Role description
We are looking for a highly skilled and experienced Senior Site Reliability Engineer (SRE) to design, build, operate, and support highly available, scalable, and resilient platforms on Microsoft Azure.
The ideal candidate will possess strong expertise in Azure cloud services, Azure Kubernetes Service (AKS), Infrastructure as Code (IaC), CI/CD automation, observability, and production operations. This role focuses on implementing SRE best practices to improve system reliability, reduce operational toil, and drive operational excellence across cloud-native and containerized environments.
You will collaborate closely with Platform Engineering, Application Development, DevOps, and Security teams to ensure the stability, scalability, and performance of mission-critical systems and services.
Key Responsibilities
- Own and improve the reliability, scalability, performance, and availability of Azure-based platforms and services
- Implement and drive SRE best practices including SLIs, SLOs, error budgets, incident response, and blameless postmortems
- Design, develop, and manage Infrastructure as Code (IaC) using Terraform
- Build and maintain CI/CD pipelines using Azure DevOps for both infrastructure and application deployments
- Automate operational processes and workflows using:
- Azure Automation Runbooks
- Azure Logic Apps
- Automation Jobs
- Design, deploy, and manage scheduled and batch workloads using:
- Kubernetes Jobs
- Kubernetes CronJobs
- Perform cloud administration, troubleshooting, and automation using Azure CLI
- Implement monitoring, ing, and observability solutions using Dynatrace, Azure Monitor, and Application Insights
- Automate configuration management and operational tasks using Ansible, PowerShell, and Python
- Lead production incident management, root cause analysis (RCA), and reliability improvement initiatives
- Collaborate with development teams supporting .NET and Python applications hosted on Azure and AKS
- Continuously identify opportunities to reduce manual effort through automation and self-healing mechanisms
- Support platform upgrades, patching, scaling, and performance optimization activities
- Participate in on-call rotations and ensure timely resolution of production issues
Required Technical Skills Cloud Platform – Microsoft Azure
- Strong hands-on experience with Microsoft Azure services
- Azure Compute:
- Virtual Machines
- Azure Kubernetes Service (AKS)
- App Services
- Azure Functions
- Azure Storage:
- Blob Storage
- Azure Files
- Managed Disks
- Azure Networking:
- VNets
- Subnets
- NSGs
- Load Balancers
- Application Gateway
Containers & Kubernetes
- Strong experience with Azure Kubernetes Service (AKS)
- Solid understanding of Kubernetes architecture and core concepts:
- Deployments
- Services
- Ingress
- RBAC
- Experience managing:
- Kubernetes Jobs
- Kubernetes CronJobs
- Expertise in AKS scaling, node pools, upgrades, and production troubleshooting
Infrastructure Automation & Configuration Management
- Terraform (Azure Provider, Modules, Remote State Management)
- Ansible automation and configuration management
- Azure Automation Runbooks and Jobs
- Azure Logic Apps for workflow orchestration
CI/CD & DevOps
- Azure DevOps:
- Pipelines
- Repositories
- Release Management
- CI/CD automation for:
- Infrastructure deployments
- AKS workloads
- Batch jobs and CronJobs
- Integration with:
- Azure CLI
- Terraform
- Automation Scripts
Observability & Monitoring
- Dynatrace:
- APM
- Infrastructure Monitoring
- Dashboards
- ing
- Azure Monitor
- Application Insights
Scripting & Application Support
- Azure CLI
- PowerShell
- Python scripting
- Experience supporting and troubleshooting .NET and Python applications in production environments
Skills
azure,site reliability engineering,terraform,dynatrace,powershell
About UST
UST is a global digital transformation solutions provider. For more than 20 years, UST has worked side by side with the world’s best companies to make a real impact through transformation. Powered by technology, inspired by people and led by purpose, UST partners with their clients from design to operation. With deep domain expertise and a future-proof philosophy, UST embeds innovation and agility into their clients’ organizations. With over 30,000 employees in 30 countries, UST builds for boundless impact—touching billions of lives in the process.