Intro – Born-in-the-cloud
QloudX is a born-in-the-cloud digital transformation boutique company with large expertise in multi-cloud technologies and strategic IT, with niched experience in the transportation and retail domains. Our goal is to help transportation and retail businesses across the globe implement cloud technologies in order to maximize value for their digital transformation journeys.
Leveraging our group’s 30+ years of experience and resources, QloudX is a brainchild of former CIOs who made a strategic decision to create a next generation cloud company that focuses on all things cloud. We are proud to have our headquarters in Europe, with 2 delivery centers in India (Mumbai and Pune), although our reach, presence and active customers also extend to USA and Asia.
As AWS Advanced Consulting Partners, we develop and create solutions for our customers with a multitude of cloud technologies, ranging from Data/Analytics, Serverless, DevOps, Containers etc.
# Cloud Platform Engineer — ACS (Assurance Cloud Service)
Experience : 10+ years
MyCom's multi-tenant SaaS platform is called ACS.
About the Role
We're looking for a hands-on Cloud Platform Engineer to own and evolve the infrastructure behind our multi-tenant SaaS platform that delivers network assurance and analytics to global telecom operators (Globe, Singtel, Deutsche Telekom/Magenta, Telekom Serbia, and others). You'll work across the full provisioning lifecycle — from writing IaC to deploying production tenant environments and keeping them healthy. This is not a "design it and hand it off" role. You'll be in the code, in the AWS console, in kubectl, and in CI pipelines every day.
Key Responsibilities
Infrastructure Automation: Design, build, and maintain scalable AWS infrastructure using Terraform and AWS CDK (Python).
Kubernetes Orchestration: Manage and optimize our Amazon EKS clusters, ensuring high availability and performance.
GitOps Implementation: Drive continuous delivery by managing the
Flux GitOps lifecycle to automate deployments and maintain environmental consistency.
Scaling & Efficiency: Implement Karpenter for intelligent, right-sized node provisioning to optimize compute costs and performance.
System Reliability: Monitor system health, troubleshoot complex infrastructure issues, and participate in architectural reviews to ensure best practices in security and reliability.
Details:
- Develop and maintain infrastructure-as-code using AWS CDK (Python) and Terraform to provision and manage isolated customer environments across multiple AWS regions
- Provision and operate EKS clusters with Karpenter autoscaling, Kyverno policies, FluxCD GitOps, RBAC, and Helm-based add-ons (Spark Operator, Trident/NetApp, ADOT Collector, Strimzi)
- Manage the full tenant lifecycle: onboard new customers by authoring tenant configs, running CDK/Terraform pipelines via GitLab CI, and updating GitOps repos
- Build and maintain the data platform layer: MongoDB Atlas clusters, Amazon MSK (Kafka), AWS Glue ETL, Athena connectors, and MSK-to-Mongo connector pipelines
- Operate and evolve storage infrastructure: FSx ONTAP (Gen1 Gen2 migrations, SnapMirror, Trident storage classes), EBS, S3 with KMS encryption
- Manage RDS Oracle instances (NetExpert, ProAssure, ProOptima) including multi-AZ, snapshots, parameter groups, switchover automation, and maintenance windows
- Develop and maintain Lambda functions for operational automation: FSx monitoring, RDS monitoring, NLB target group updates, ENI tagging, database switchover, streaming telemetry
- Work on GenAI/Bedrock integration: agents, action groups, guardrails, knowledge bases, and IRSA for Kubernetes-to-Bedrock access
- Maintain and extend GitLab CI/CD pipelines for both the platform repo (test, build, release, schema docs) and the tenant provisioning repo (plan, validate, deploy, git repo updates)
- Manage cross-account networking: Transit VPCs, VPN gateways, Transit Gateways, NACLs, WAF, and site integrations with customer on-prem networks
- Operate Elastic Cloud deployments with VPC endpoints and traffic filters
- Write and expand Python unit tests (pytest, CDK assertions) and improve code coverage
- Participate in merge request reviews on a production-protected repo with mandatory approvals
Core Requirements
AWS Expert: Proven experience with core AWS services (VPC, IAM, RDS, Route53) and advanced networking.
Kubernetes Specialist: Deep understanding of the K8s ecosystem, including networking, storage, and resource management.
IaC Polyglot: Proficiency in Terraform for foundational infrastructure and
AWS CDK (Python) for programmatic resource definition.
Modern DevOps Tooling: Hands-on experience with:
Flux CD (or similar GitOps tools like ArgoCD).
Karpenter for automated scaling.
CI/CD pipeline construction and maintenance.
Details:
- Strong Python skills (you'll write CDK constructs, Lambda functions, automation scripts, and tests daily)
- Deep, practical AWS experience: EKS, VPC networking, IAM, CloudFormation, RDS, FSx ONTAP, S3, Lambda, MSK, Glue, Athena, Bedrock, SES, GuardDuty, Private CA, Transfer Family
- Solid Terraform skills (modules, state management via GitLab-managed backends, imports, multi-provider configs including NetApp ONTAP and Elastic Cloud)
- Kubernetes operations: kubectl, Kustomize, FluxCD, Helm, Karpenter, Kyverno, IRSA, RBAC, storage classes, ingress controllers
- Experience with GitOps workflows and GitLab CI/CD
- Comfort with Oracle RDS administration (parameter groups, snapshots, multi-AZ, storage scaling)
- Familiarity with NetApp ONTAP concepts (SVMs, volumes, SnapMirror, intercluster peering) is a plus
- Experience operating multi-tenant SaaS platforms with production customer environments
Nice to Have
- Experience with MongoDB Atlas provisioning via CDK/Terraform
- Kafka (MSK) operations: broker sizing, connector plugins, cluster policies
- AWS Bedrock agents and action groups
- Apache Airflow / Spark on Kubernetes
- Elastic Cloud (cloud.elastic.co) deployment and VPC endpoint management
- Telecom domain knowledge (network assurance, performance management)
Tools & Environment
- Python 3.12, uv package manager, Ruff linter
- AWS CDK v2.177, Terraform
- GitLab CI/CD (self-hosted)
- Jira (Atlassian) for ticketing