BBI is a global data engineering consulting firm that empowers clients to effectively scale and modernize. We combine engineering fundamentals and innovative tools to execute business-critical, end-to-end projects on-time and on-budget. We offer expert services across Data Integration, Data Modernization, Data Migration, Data Architecture, Platform Support, and Application Services. Our goal is to provide business value in the most effective way for our clients so clients can focus on growth.
Experience level: 5 - 8 Years
Qualification: Postgraduate/ Graduate
Location: Chennai/Pune/Bangalore
At Black buck Insights (BBI), we hire great minds who can embrace technology to innovate and build. We are always on the lookout for individuals who are thrilled by the idea of developing solutions, features, and services while managing ambiguity and super-paced projects. If this is you, come chart your own path at BBI! Position Summary The Senior Machine Learning Software Engineer is a senior-level technical contributor responsible for leading the development of software infrastructure, tools, and platforms that enable scalable and maintainable machine learning operations. This role plays a critical part in bridging the gap between research and production by architecture reliable systems for training, testing, deployment, and monitoring of machine learning models. The Senior Machine Learning Software Engineer ensures AI capabilities are production-grade, reliable, and scalable—unlocking innovation across all AI-driven products. In addition to making significant technical contributions, the Senior MLSE provides mentorship to junior engineers and fosters best practices in software quality, MLOps, and automation across the machine learning lifecycle.
Responsibilities:
Infrastructure Design & Development
- Architect, build, and maintain reusable components and tools to support model training, evaluation, and deployment at scale.
- Optimize model serving frameworks, feature stores, data pipelines, and CI/CD systems for ML workflows.
- Ensure reliability, observability, and performance across ML systems in production.
Technical Leadership & Execution
- Lead cross-functional engineering initiatives involving platform stability, experimentation infrastructure, or real-time inference systems.
- Review code, propose architectural improvements, and uphold software engineering best practices within the ML engineering team.
- Drive design and implementation of MLOps pipelines, automation, and model governance workflows.
Collaboration with Research & Product Engineering
- Work closely with ML researchers to produce experimental models, ensuring compatibility with existing infrastructure.
- Coordinate with data engineering to integrate pipelines, data validations, and model input/output schemas.
- Contribute to product engineering discussions when ML systems require edge optimization, user facing API integrations, or UI-linked inference.
Mentorship & Knowledge Sharing
- Mentor ML Software Engineers I and II, with a proven track record of advancing at least one MLSE I to MLSE II.
- Contribute to internal documentation, architecture reviews, and engineering learning resources.
- Set high standards for code quality, reproducibility, and maintainability across the ML engineering discipline.
-
3+ years building and operating production software systems (ML software/inference platform experience strongly preferred).
-
Strong Python engineering plus solid Linux/bash debugging skills.
-
Hands-on experience with NVIDIA Triton Inference Server (or equivalent model serving platform).
-
Practical experience in model optimization + deployment pipeline (e.g., ONNX/TensorRT, performance/latency tuning, packaging for production).
-
Proven experience deploying and operating services on AWS, including ECS, plus Docker/container workflows, S3/ECR, IAM/secrets, and safe rollout/rollback practices.
-
Experience with CI/CD and artifact/version management for ML software (DVC/MLflow-equivalent workflows are a plus).
-
Production reliability mindset: monitoring, incident triage, and staged release safety.
-
Strong ownership, communication, and demonstrated ability to ramp quickly on missing stack-specific pieces within a 3-6 month onboarding window.