Key responsibilities
· CI/CD pipeline. Design and own the full CI/CD pipeline from day one: linting, unit tests, integration tests, and critically, the ML evaluation harness as a first-class CI gate. Every PR that touches extraction code automatically runs against the full ground truth dataset. A drop in accuracy blocks the merge - automatically, not by asking someone to check.
· ML pipeline infrastructure. Own the infrastructure that the ML engineers' work runs on: model training jobs (GPU or cloud-based), experiment tracking (MLflow, Weights & Biases, or equivalent), model versioning and the model registry, and the active learning trigger pipeline that kicks off retraining when reviewer corrections hit a threshold.
· IaC and cloud infrastructure. All cloud resources defined as code (Terraform or Pulumi) from Week 1. No hand-crafted resources, no configuration drift. You own networking, storage, secrets management, environment parity between dev, staging, and production, and cloud cost governance - especially for training compute and cloud OCR, the two biggest variable cost lines.
· Production operations. Own the production deployment process: environment configuration, secrets rotation, database backups, alerting on queue depth, error rate and latency thresholds, and a runbook for every common failure mode. On-call starts with you.
· Security and load testing. Lead the security review: authentication flows, API key handling, file storage access controls, input sanitisation. Commission and coordinate the penetration test. Run load testing against the pipeline - simulate concurrent document uploads, identify the throughput ceiling, and fix the top bottlenecks before launch.
· Cost and compute governance. Model training and cloud OCR are variable costs that can surprise you. Own the cost model: what does each training run cost, what is the cost per document at current OCR provider pricing, and what does the margin look like at 10× volume? If GPU-accelerated OCR becomes the right call, you make that recommendation with numbers behind it.
key deliverables
- CI/CD pipeline - linting, tests, ML evaluation harness as a merge gate
- ML experiment tracking and model registry
- IaC - all cloud resources, environments, secrets, cost governance
- Active learning trigger infrastructure - corrections → retraining → evaluation → deploy
- Production deployment pipeline and operator runbook
- Security review and penetration test coordination
- Load testing - throughput ceiling documented, top bottlenecks resolved
- Cloud cost model - per-document cost at 1×, 10×, 100× volume
Technical skills and experience
- Minimum 4+ years in a DevOps, MLOps, or platform engineering role - not one or the other, genuinely both
- Bachelor’s degree in engineering or science
- CI/CD pipeline ownership: GitHub Actions, CircleCI, or equivalent. You have built pipelines from scratch, not just maintained them
- Infrastructure as code: Terraform or Pulumi. You write it, not just read it
- ML infrastructure experience: you have supported a team running models in production - training pipelines, experiment tracking, model versioning, deployment
- Cloud platform depth on AWS or GCP - networking, IAM, secrets, storage, compute cost management
- Strong enough on Python to read, debug, and instrument the ML team's training code without their help
Nice to have
- Experience with ML experiment tracking tools (MLflow, Weights & Biases, Neptune)
- Familiarity with active learning or continual learning pipeline infrastructure
- Experience with GPU compute management - spot instances, cost optimisation, training job scheduling
- Background supporting document processing or OCR pipelines at scale
- SOC 2 readiness or security compliance experience
- Experience with ML model monitoring in production - drift detection, performance degradation alerts
Pay: ₹1,300,000.00 - ₹1,500,000.00 per year
Benefits:
- Paid sick time
- Paid time off
- Work from home
Application Question(s):
- What is your current CTC?
- What is your Expected CTC ?
- What is your Notice period time ?
Education:
Experience:
- DevOps: 4 years (Required)
Work Location: Remote