Overview:
We are looking for an end-to-end Data Scientist to design, build, and maintain ML-powered systems that solve core data quality and classification problems across the business. You will own the full lifecycle — from exploratory analysis and feature engineering through model training, deployment, and ongoing performance monitoring. The work spans entity resolution (identifying duplicate records across large datasets) and multi-class classification models that drive decision-making across a variety of business domains.
Responsibilities:
What You'll Do
Own the end-to-end model lifecycle: problem framing, data exploration, feature engineering, model training, evaluation, deployment, and monitoring
-
Build and maintain entity resolution systems that detect duplicate records using supervised ML and string similarity techniques
-
Develop classification models that categorize unstructured or semi-structured data into meaningful business categories
-
Engineer features from messy, real-world text data — names, addresses, free-text fields — using string matching algorithms, phonetic encoding, n-grams, and other NLP techniques
-
Design candidate retrieval and indexing strategies to make models performant at scale
-
Tune thresholds, scoring logic, and rule-based overrides to balance precision and recall for production use cases
-
Maintain production model artifacts and data pipelines, ensuring models stay current as underlying data evolves
-
Collaborate with engineering and product teams to understand requirements and translate business problems into well-scoped modeling tasks
Qualifications:
- 10+ years of experience building and deploying ML models end-to-end (not just notebooks)
-
Strong Python skills — pandas, NumPy, scikit-learn, XGBoost or similar gradient boosting frameworks
-
Hands-on experience with record linkage, entity resolution, or deduplication problems
-
Experience building classification models (binary and multi-class) on structured and semi-structured data
-
Deep familiarity with string similarity algorithms: edit distance, sequence matching, phonetic encoding, shingling
-
Strong feature engineering instincts — ability to extract signal from noisy, inconsistently formatted data
-
Comfort working with large serialized data structures and understanding memory/performance tradeoffs in production contexts
-
Experience with SQL and relational databases (PostgreSQL or similar)
-
Clear communication skills — ability to explain model behavior and tradeoffs to non-technical stakeholders
Nice to Have
-
Experience with blocking and indexing strategies for scalable record linkage
-
Background in NLP, text normalization, or information extraction
-
Familiarity with model serving in API contexts (Flask, FastAPI, or similar)
-
Experience in data quality, master data management, or marketplace domains
-
Exposure to deep learning frameworks (PyTorch, TensorFlow) for text classification