At LPL’s Global Capability Center, you'll find a collaborative culture where your voice matters, integrity guides every decision, and technology fuels progress. Your skills, talents, and ideas will redefine what's possible. LPL's success reflects its exceptional employees, who together pursue one noble purpose: empowering financial advisors to deliver personalized advice for all who need it. We’re proud to be expanding and reaching new heights in Hyderabad.
Join us as we create something extraordinary together.
Job Summary
We are seeking experienced Site Reliability Engineers (SREs) to build and operate an AI driven observability and reliability platform for enterprise-scale environments. This team will be responsible for monitoring LPL environments using Dynatrace, building intelligent agents that analyze logs, metrics, and traces, generating automated alerts based on dynamic thresholds, and implementing self healing recovery mechanisms to reduce MTTR and prevent outages.
This role blends SRE, platform engineering, and applied AI/ML, with a strong focus on automation, resilience, and operational excellence.
Job Responsibilities
Observability & Monitoring
-
Design, implement, and operate enterprise observability solutions using Dynatrace, including logs, metrics, traces, and RUM.
-
Define golden signals, SLOs, SLIs, and error budgets for critical applications and infrastructure.
-
Create dynamic and adaptive alerting strategies to reduce noise and improve signal quality.
-
Partner with application teams to embed reliability and observability standards into services from design through production.
AI Agents & Intelligent Automation
-
Build and deploy AI/ML-powered agents that analyze:
- Dynatrace logs and metrics
-
Historical incident data
-
Behavioral and anomaly patterns
-
Implement predictive monitoring and anomaly detection using statistical models or ML techniques.
-
Automate alert triage, correlation, and root cause analysis (RCA) using AI-driven approaches.
-
Continuously improve models based on operational feedback and outcomes.
Self-Healing & Recovery
-
Design and implement self-healing systems that automatically:
- Restart or rescale services
-
Redirect traffic
-
Roll back failed deployments
-
Trigger remediation workflows
-
Integrate automated recovery actions with CI/CD pipelines, runbooks, and orchestration tools.
-
Ensure recovery mechanisms are safe, auditable, and compliant with enterprise governance.
Reliability Engineering & Operations
-
Lead incident response, post-incident reviews, and root cause analysis.
-
Establish runbooks, automated remediation playbooks, and operational standards.
-
Drive reductions in MTTR, incident frequency, and operational toil.
-
Participate in on-call rotations and help build a sustainable on-call culture.
Platform, Cloud & Infrastructure
-
Work with cloud and onprem infrastructure teams to improve reliability at scale.
-
Build automation using Infrastructure as Code (IaC) practices.
-
Ensure systems meet security, compliance, and resiliency requirements expected in financial or regulated environments.
Job Qualifications
Core SRE & Platform Skills
-
4+ years of experience in SRE, DevOps, Platform Engineering, or Production Operations
-
Strong experience designing and operating highly available, distributed systems
-
Hands-on expertise with:
- Monitoring and observability platforms (strong preference for Dynatrace)
-
Linux-based systems and networking fundamentals
-
Deep understanding of SLOs, SLIs, error budgets, and reliability trade-offs
Automation & Programming
-
Proficiency in at least one programming language such as:
- Python, Go, Java, or similar
-
Strong scripting experience (Python, Bash, etc.) for automation and tooling
-
Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
AI / ML & Intelligent Systems
-
Experience building or integrating AI/ML models for operational use cases, such as:
- Anomaly detection
-
Log analytics
-
Predictive alerts
-
Familiarity with:
- ML pipelines, feature extraction, and model lifecycle management
-
Rule-based + ML hybrid approaches for production systems
-
Experience integrating AI workflows into monitoring and alerting platforms is highly valued.
Cloud, Containers & CI/CD
-
Experience with cloud platforms (AWS, Azure, or GCP)
-
Strong knowledge of:
- Kubernetes and container orchestration
-
CI/CD pipelines and deployment automation
-
Understanding of scalable architecture patterns and failure modes in cloud-native systems.
LPL Global Business Services, LLP - PRIVACY POLICY