Key responsibilities
· Analytics exports. Build and maintain denormalised CSV and Parquet exports that the VS team can query or download without needing to understand our schema. Every export includes a manifest file and field dictionary.
· Feature engineering pipeline. Implement the transforms the VS team needs: unit normalisation to SI, date standardisation, null handling strategy (impute vs. flag vs. exclude), and outlier detection flags.
· VS dataset export format. Design the versioned, self-describing VS export format. Field definitions, provenance metadata, and a changelog so the VS team always knows what changed between exports.
· Data quality dashboard. Build the monitoring layer: per-document extraction scores, field-level fill rates, flagged item counts, and trend over time. This is how the team knows if an ML change is an improvement.
· Schema evolution. Work with the Pipeline engineer on database schema migrations. You own the analytics-facing views - the denormalised representations that make the data queryable without joins.
· Team coordination. Attend the weekly sync. Map the model's required fields against what the database currently produces. Document the gaps and drive closure.
key deliverables
- CI/CD pipeline- linting, tests, ML evaluation harness as a merge gate
- Analytics dataset exports - CSV and Parquet
- Feature engineering pipeline - unit normalisation, date standardisation, null strategy, outlier flags
- VS dataset export format v1 - versioned, with manifest and field dictionary
- Data quality dashboard - per-doc scores, fill rates, flagged item trends
- VS field mapping and gap analysis document
- Ongoing data quality monitoring
Technical skills and experience
- Minimum 4+ years data engineering in production environments
- Bachelor’s degree in engineering or science
- CI/CD pipeline ownership: GitHub Actions, CircleCI, or equivalent. You have built pipelines from scratch, not just maintained them
- Infrastructure as code: Terraform or Pulumi. You write it, not just read it
- ML infrastructure experience: you have supported a team running models in production - training pipelines, experiment tracking, model versioning, deployment
- Cloud platform depth on AWS or GCP - networking, IAM, secrets, storage, compute cost management
- Strong enough on Python to read, debug, and instrument the ML team's training code without their help
Nice to have
- Experience with ML experiment tracking tools (MLflow, Weights & Biases, Neptune)
- Experience feeding data into ML training pipelines
- Familiarity with scientific units and measurement conventions (helpful for normalisation work)
- Experience with dbt, Great Expectations, or similar data quality tooling
- Background in engineering or industrial data (sensor data, test reports, datasheets)
Pay: ₹1,500,000.00 - ₹1,600,000.00 per year
Benefits:
- Paid sick time
- Paid time off
- Work from home
Application Question(s):
- What is your current CTC ?
- What is your Expected CTC ?
- What is your Notice Period ?
Education:
Experience:
- data engineer: 4 years (Preferred)
Work Location: Remote