Required Skills:
- MLOps
-
AWS / Databricks
-
Terraform
-
CI/CD
-
Docker
-
Python
-
Observability
-
DevSecOps
MLOps Engineer Role Overview we are building a next generation enterprise AI Delivery team and are seeking an experienced, MLOps Engineer to build and operate the pipelines, platforms, and guardrails that take ML/GenAI from notebooks to reliable, secure, and scalable production services. You will engineer infrastructure-as-code, CI/CD for ML and LLM applications, secure model serving, observability, and runtime cost/performance optimization – partnering closely with Data Scientists, AI Product Owners, and Platform/DevOps teams. Ideal candidates will have 3-5 years of production experience with ML platforms (e.g., SageMaker, Azure ML, Databricks), and expertise in Kubernetes-based model serving and GitOps automation. You will champion reliability (SLOs/SLIs), compliance, and automation-first practices across the ML lifecycle. Key Responsibilities Design, develop, and document Infrastructure as Code (Terraform) for ML/LLM platform components on AWS/Databricks; implement secure, scalable foundations for data, compute, networking, and secrets. Build and maintain GitHub based pipelines (Actions/Workflows) for training, packaging, validation, and deployment of ML/LLM assets (models, evaluation suites, prompts, policies), using GitOps for environment promotion. Containerize models using Docker and deploy them primarily through managed endpoints (SageMaker/Azure ML); Kubernetes‑based serving (KServe/Triton/Seldon) is a plus. Operate model registries and feature stores; enforce versioning, lineage, and artifact governance via MLflow/Databricks and cloud native services. Implement logs/metrics/traces, performance profiling, and drift/quality monitors; define SLIs/SLOs and on call runbooks; drive incident response and post-mortems with accountability (business hours support rotation). Embed DevSecOps: secrets management, IAM/RBAC, vulnerability scanning, image signing, policy as code, least privilege access, backup/DR/resiliency patterns; align with enterprise security standards. Operationalize GenAI: prompt/content safety filters, evaluation harnesses (human in the loop), grounding/attribution logging, token cost & latency tracking, and red teaming pipelines integrated into CI/CD. Monitor and optimize compute/storage/bandwidth and inference costs; implement right sizing, autoscaling, and caching strategies. Partner with Data Scientists to productize models; co design platform features with stakeholders; deliver documentation, templates, and knowledge transfers that accelerate safe reuse. Run operations (RUN): Troubleshoot escalations, improve monitoring, automate administration/IRP tasks, and continuously harden reliability, performance, and security across environments. Required Skills & Qualifications Technical Experience: Understanding of DevOps concepts such as reference implementation enforcement, use of shared DevOps stacks, infrastructure optimization (performance, cost, HA, resiliency), release management (GitOps best practices), and QA automation frameworks. Strong knowledge of AWS ecosystems and Databricks integration. Proficiency in Terraform for developing, testing, and maintaining Infrastructure‑as‑Code to manage cloud services for ML engineering. Hands‑on experience with CI/CD using GitHub, GitHub Actions, and Workflow automation to support continuous integration, delivery, and deployment of ML assets. Strong experience with Docker; Kubernetes is a plus. MLflow (tracking/registry), model registries, feature stores, experiment tracking, and lineage management; Databricks and cloud native equivalents. Build pipelines for training, testing (unit/integration/e2e), evaluation, and deployment. Experience designing or contributing to infrastructure, application, and performance monitoring (logs, metrics, dashboards) and supporting observability strategies. Ability to produce efficient, maintainable code in Python; experience troubleshooting and extending Python‑based services. Consulting Experience: Proven track record in an IT consulting environment, engaging with large enterprises and MNCs in strategic data solutioning projects. Experience working with enterprise stakeholders in platform adoption, requirement clarification, effort sizing, and change management for ML platform rollouts. Leadership & Soft Skills: Strong collaboration and communication across Delivery and RUN. Excellent communication, documentation, and presentation skills. Strong problem-solving, analytical thinking, and strategic vision. Educational Qualifications: Bachelor’s or Master’s degree in Computer Science, Engineering, or a related quantitative field. Preferred Certifications: AWS DevOps Engineer – Professional AWS Certified Machine Learning – Specialty (or Azure DevOps Engineer Expert) CKA (Certified Kubernetes Administrator), HashiCorp Terraform Associate What We’re Looking For Self-starters who are highly motivated, ambitious, and eager to challenge the status quo. Builders who combine scientific rigor with pragmatic engineering and can balance accuracy, latency, and cost. Effective leaders who collaborate openly, freely share knowledge and elevate team performance. Straightforward, results-oriented individuals who value impact and accountability. Adaptable experts who stay on top of fast-evolving AI technologies and practices. Opportunity to shape and build an AI product portfolio that delivers meaningful business impact for Regions. Work alongside a motivated and innovative team that values learning, ownership, and excellence. Thrive in a culture that challenges the status quo and embraces diverse perspectives.