Role description
We are seeking a motivated and detail-oriented Junior Site Reliability Engineer (SRE) to join our Cloud Operations team. In this role, you will support the reliability, availability, performance, and operational excellence of cloud-native platforms hosted on Microsoft Azure.
The ideal candidate possesses strong hands-on troubleshooting skills across Azure infrastructure, Kubernetes environments, and monitoring platforms, along with a proactive mindset toward automation and continuous improvement. You will work closely with senior SREs, cloud engineers, and development teams to maintain highly available and resilient systems while gaining exposure to modern DevOps and SRE practices.
Key Responsibilities
- Support the day-to-day operations of Microsoft Azure cloud infrastructure and services.
- Monitor, maintain, and improve the reliability, availability, and performance of cloud-native applications and platforms.
- Perform hands-on troubleshooting across Azure compute, storage, networking, and application layers.
- Assist in maintaining and supporting Azure Kubernetes Service (AKS) environments.
- Participate in production incident response, triage, troubleshooting, escalation, and post-incident reviews.
- Investigate and resolve infrastructure and application issues related to:
- Azure Virtual Machines
- Azure Kubernetes Service (AKS)
- Kubernetes pods, nodes, services, ingress controllers
- Azure Storage services and connectivity
- Cloud networking and platform dependencies
- Utilize Azure logs, metrics, monitoring dashboards, and diagnostic tools to identify root causes and implement resolutions.
- Monitor infrastructure and application health using tools such as Dynatrace, Azure Monitor, and related observability platforms.
- Respond to s, investigate anomalies, and assist in reducing recurring operational issues.
- Support performance analysis and system optimization initiatives.
- Assist with Infrastructure as Code (IaC) deployments using Terraform.
- Support CI/CD pipeline execution and troubleshooting using Azure DevOps.
- Automate repetitive operational tasks using:
- Azure Automation Runbooks
- Azure Logic Apps
- PowerShell and Python scripts
- Perform diagnostics, administration, and remediation activities using Azure CLI.
- Support containerized workloads running on AKS.
- Assist with deployment, troubleshooting, and maintenance of:
- Kubernetes Deployments
- Services
- Jobs
- CronJobs
- Collaborate with engineering teams to improve platform reliability and operational efficiency.
- Learn and adopt Site Reliability Engineering best practices focused on reducing operational toil and improving system stability.
- Contribute to operational documentation, runbooks, and knowledge-sharing initiatives.
- Participate in a rotational on-call support schedule aligned to Eastern Standard Time (EST).
Required Technical Skills
- Hands-on experience supporting and troubleshooting Microsoft Azure environments.
- Strong understanding of:
- Azure Virtual Machines
- Azure Kubernetes Service (AKS)
- Azure Storage (Blob Storage, Azure Files)
- Azure Networking (VNets, NSGs, Load Balancers – foundational level)
- Experience using Azure logs, metrics, diagnostics, and monitoring tools.
- Practical experience with Azure CLI for troubleshooting and administration.
- Foundational experience with Kubernetes concepts and administration.
- Knowledge of:
- Pods
- Deployments
- Services
- Ingress
- Jobs and CronJobs
- Exposure to AKS-based production environments.
- Familiarity with Terraform and Infrastructure as Code (IaC) concepts.
- Experience with Azure DevOps repositories and CI/CD pipelines.
- Exposure to Azure Automation Runbooks, Logic Apps, or similar workflow automation tools.
- Understanding of monitoring, ing, logging, and observability principles.
- Exposure to Dynatrace, Azure Monitor, Application Insights, or equivalent monitoring platforms.
- Basic scripting and automation skills using:
- Experience using Azure CLI.
- Basic understanding of .NET and Python-based applications and their operational support requirements.
Preferred Qualifications
- Microsoft Azure certifications (AZ-900, AZ-104, or equivalent).
- Exposure to Site Reliability Engineering (SRE) practices and cloud-native architectures.
- Familiarity with Git-based version control systems.
- Strong analytical, troubleshooting, and problem-solving skills.
- Excellent communication and collaboration abilities.
What Success Looks Like
- Effective troubleshooting and resolution of Azure platform issues.
- Consistent support of reliable and highly available cloud services.
- Increased automation and reduction of manual operational effort.
- Timely response to incidents and proactive identification of system risks.
- Continuous learning and growth within cloud, DevOps, and SRE disciplines.
Skills
azure,dynatrace,terraform,kubernetes,azure cloud,
About UST
UST is a global digital transformation solutions provider. For more than 20 years, UST has worked side by side with the world’s best companies to make a real impact through transformation. Powered by technology, inspired by people and led by purpose, UST partners with their clients from design to operation. With deep domain expertise and a future-proof philosophy, UST embeds innovation and agility into their clients’ organizations. With over 30,000 employees in 30 countries, UST builds for boundless impact—touching billions of lives in the process.