Overview:
A highly skilled Senior AI Solutions Engineer responsible for provisioning and deployment of solutions to support AI training and inferencing workloads. The ideal candidate is expected to have a strong understanding of Kubernetes, IaC (Terraform), CI/CD pipelines (ArgoCD, Jenkins) ,MLOps, LLMOps - LLM pipelines virtualization, inference/model serving , compute, storage and networking within a data center, a strong grasp on Gen AI, LLMs, Machine Learning, Deep Learning, and have hands-on experience deploying Kubernetes on both virtualized and bare metal environments. Knowledge of operating systems, virtualization, container orchestration, configuration management, automation, distributed systems and artificial intelligence and their capabilities is a must.
Responsibilities:
- Implement AI solutions for customers that will drive business impact and introduce operational efficiencies.
-
Deploy, configure, and maintain Kubernetes environments for a variety of client environments.
-
Engage in all aspects of container orchestration administration, including the management of users and policies, resources, networking configuration, creation and management of applications, and configuration of pod scheduling and cluster scaling.
-
Perform routine upgrades, patching and maintenance to ensure infrastructure is secure and up to date.
-
Implement and maintain automated solutions for provisioning and configuring the operating environment and its associated infrastructure.
-
Monitor and analyze performance metrics, working proactively to prevent issues before they impact operations.
-
Troubleshoot and resolve infrastructure issues, ensuring minimal downtime and high level of performance.
-
Provide best practice guidance on configuration for container orchestration platforms across multiple applications and projects.
Maintain thorough documentation for processes, platform architecture, system configurations, and troubleshooting steps.
Qualifications:
- Experience:
-
Bachelor’s degree in computer science, Information Technology, or related field (or equivalent work experience)
-
5+ years’ experience provisioning and administering container orchestration platforms to support mission critical AI workloads.
-
Experience working in projects involving compute, network and storage components within a datacentre.
-
Experience writing and formatting high-level and low-level technical documentation for proposed solutions
-
Experience in AIOps and coordinating with platform support team.
-
Required skills:
-
Understanding of machine learning, deep learning, neural networks, and foundation models.
-
Understanding of AI training and fine-tuning workflows, inference pipelines, and feature engineering.
-
Hands-on knowledge deploying, configuring and maintaining container orchestration tools such as Red Hat OpenShift, RKE2, or upstream Kubernetes.
-
Working knowledge of container fundamentals: container networking and storage volumes, as well as building and deploying Docker images.
-
Working knowledge of various type of Operating Systems, such as Unix, Linux and Windows.
-
Understanding of integration with observability and monitoring (Prometheus, Grafana) and logging.
-
Familiarity with scripting, Python, Ansible, Terraform, Git, and CI/CD pipelines.
-
Understanding of vGPU, pass-through, MIG, or container-based GPU orchestration options.
-
Familiarity with Agile and DevOps ways of working.
-
Strong stakeholder management, and communication skills.
-
Familiarity with Kubernetes ecosystem using helm charts, operators, and container registries (i.e. Quay).
-
Working knowledge of container fundamentals: container networking and storage volumes, as well as building and deploying Docker images.
-
Experience with automation tools for infrastructure provisioning and configuration such as Ansible and Terraform.
-
Experience with monitoring and logging tools such as Prometheus, Grafana, Zabbix or similar.
-
Understanding of vGPU, pass-through, MIG, or container-based GPU orchestration options.
-
Understanding of firewalls, security policies, NAT, VPN tunnels, RBAC, TLS, PKI and certificates.
-
Understanding of distributed systems requirements and design (scalability, fault tolerance, HA).
-
Understanding of AI training and fine-tuning workflows, inference pipelines, and feature engineering.
-
Understanding of Gen AI, LLMs, RAG pipelines, and relevant use cases.
-
Familiarity with TensorFlow, PyTorch, Rapids, and other GPU-accelerated libraries.
-
Professional level certifications.
-
Preferred Skills & Experience:
-
Kubernetes, IaC (Terraform), CI/CD pipelines (ArgoCD, Jenkins)
-
MLOps and LLMOps - LLM pipelines
Hands-on experience with Machine Learning and Deep Learning frameworks, GPU virtualization, inference/model serving
-
- Working Conditions
This position may require evening and weekend work for time-sensitive project implementations