As a Sr. AWS Cloud AI Infrastructure Engineer, you will play a pivotal role in designing, deploying, and optimizing cloud-native infrastructure tailored for advanced AI/ML applications. Your expertise in AWS services and cloud architecture will be instrumental in enabling scalable, secure, and high-performance environments that support the full lifecycle of AI-driven solutions - from data ingestion to real-time inference. Working alongside a team of specialists, you’ll contribute to cutting-edge AI initiatives, including generative AI, large language models (LLMs), and retrieval-augmented generation (RAG) systems.
You will be responsible for architecting and operationalizing AWS infrastructure to support enterprise-scale AI applications. This includes managing GPU-powered training/inference environments, serverless orchestration, scalable model hosting, and secure data pipelines. You will ensure AI workloads are production-ready, cost-effective, and compliant with governance and privacy standards. Collaborating with cross-functional teams, you will automate infrastructure provisioning, streamline MLOps workflows, and support the deployment of AI models using services like Bedrock, SageMaker, EKS, and ECS.
-
Bachelor’s degree in Computer Science, Data Engineering, or a related field. AWS certifications (e.g., Solutions Architect, Machine Learning Specialty) are highly desirable.
-
4-5+ years of experience in designing and implementing cloud infrastructure, preferably for AI/ML workloads.
-
Proven expertise in the AWS ecosystem: EC2 (GPU instances), SageMaker, Bedrock, Lambda, Step Functions, EKS/ECS, CloudFormation/CDK, IAM, and S3.
-
Deep understanding of cloud-native AI architectures (e.g., LLM hosting, vector search, scalable inference endpoints).
-
Strong programming skills in Python (or similar), with scripting experience to automate infrastructure and deployment.
-
Experience in building and managing reproducible AI pipelines (training, evaluation, deployment).
-
Familiarity with MLOps practices and tools like SageMaker Pipelines, MLflow, or Kubeflow.
-
Solid foundation in cloud security, performance tuning, and monitoring AI infrastructure using CloudWatch, X-Ray, or Prometheus.