Required Skills:
- Python
-
MLOps
-
Cloud Platform
-
Incident Management
-
Monitoring & Observability
-
Containers & Orchestration
-
CI/CD
-
Team Leadership
Nice to Have:
- DevSecOps
-
Cloud Certifications
Expert, AI Application Sustenance – Engineer We are seeking an experienced Support Expert to oversee the operational support, incident management, and technical maintenance of AI/ML-driven applications built using Python and deployed across AWS and Azure cloud platforms. The ideal candidate will have deep technical expertise in cloud infrastructure, Python-based application development, AI/ML operations (MLOps) and team leadership. This role ensures high availability, reliability, and performance of AI systems that support business-critical processes. Key Responsibilities 1. Technical Leadership & Team Management Lead, mentor, and guide a team of support engineers (L1/L2/L3). Act as the primary escalation point for complex issues across applications, ML models, and cloud services. Drive continuous improvement initiatives and establish best practices for operational excellence. 2. Application Support & Troubleshooting Provide end-to-end support for Python-based APIs, microservices, AI/ML pipelines, and data processing systems. Diagnose and resolve issues related to Python environments, dependencies, performance bottlenecks, and integrations. Maintain high availability and optimal performance of production workloads. 3. Monitoring, Incident, and Problem Management Establish monitoring dashboards and alerts through Dynatrace, CloudWatch, Azure Monitor or Application Insights. Own the full lifecycle of incidents, from detection to root cause analysis and preventive measures. Maintain and ensure adherence to SLA commitments. Required Qualifications Technical Skills 5–10 years of experience supporting production systems, preferably AI/ML or data-intensive applications. Strong proficiency in Python, debugging, and API frameworks (FastAPI, Flask, Django). Experience with AWS (EC2, Lambda, S3, SageMaker, RDS) and Azure (App Services, Functions, Azure ML, Storage, AKS). Experience with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins) and infrastructure-as-code. Familiarity with containerization (Docker, Kubernetes) and microservices architecture. Strong understanding of AI/ML concepts, model deployment and monitoring. Soft Skills Excellent problem-solving skills and ability to handle high-pressure incidents. Strong leadership, mentoring, and team management capabilities. Effective communication and stakeholder management skills. Preferred Qualifications Certifications such as AWS and Azure Administrator, or Azure/AWS Data/AI certifications. Experience with MLOps tools (MLflow, Kubeflow, Sagemaker Pipelines, Azure ML Pipelines). Knowledge of DevSecOps, automation, and cost optimization best practices. Experience in ITIL practices for incident, problem, and change management.