1.MlOps
JD:
Key Responsibilities:
-
Maintain and support machine learning applications running on Windows and Linux servers in on-premises environments.
-
Manage and troubleshoot Kubernetes clusters hosting ML workloads.
-
Collaborate with data scientists and engineers to deploy machine learning models reliably and efficiently.
-
Implement and maintain monitoring and ing solutions using DataDog to ensure system health and performance.
-
Debug and resolve issues in production environments using Python and monitoring tools.
-
Automate operational tasks to improve system reliability and scalability.
-
Ensure best practices in security, performance, and availability for ML applications.
-
Document system architecture, deployment processes, and troubleshooting guides.
Required Qualifications:
-
Proven experience working with Windows and Linux operating systems in production environments.
-
Hands-on experience managing on-premises servers and Kubernetes clusters and Docker containers.
-
Strong proficiency in Python programming, with a solid background in developing Python-based solutions and applications.
-
Deep understanding of machine learning workflows, including practical knowledge of ML model training processes, frameworks, and evaluation.
-
Proven experience in ML deployment development, including building out end-to-end model serving pipelines and managing the model lifecycle.
-
Familiarity with monitoring and debugging tools, e.g., DataDog.
-
Ability to troubleshoot complex issues in distributed systems.
-
Experience with CI/CD pipelines for ML applications.
-
Familiarity with AWS cloud platforms.
-
Background in Site Reliability Engineering or DevOps practices.
-
Strong problem-solving skills and attention to detail.
-
Excellent communication and collaboration skills.
AI, AWS, Data Science, Jenkins, ML, Python