Sr Software Engineer Syseng

GCC India -
Hyderabad, Telangana

Apply Now

Job details

Full-time
1 day ago

Qualifications

CI/CD
Cloud infrastructure
Azure
Go
Kubernetes
Software deployment
DevOps
Build automation
Java
Bash (Unix shell)
AWS
Distributed systems
Terraform
Continuous integration
Scripting
Linux
AI
Python
Analytics

Full job description

At LPL’s Global Capability Center, you'll find a collaborative culture where your voice matters, integrity guides every decision, and technology fuels progress. Your skills, talents, and ideas will redefine what's possible. LPL's success reflects its exceptional employees, who together pursue one noble purpose: empowering financial advisors to deliver personalized advice for all who need it. We’re proud to be expanding and reaching new heights in Hyderabad.

Join us as we create something extraordinary together.

Job Summary

We are seeking experienced Site Reliability Engineers (SREs) to build and operate an AI driven observability and reliability platform for enterprise-scale environments. This team will be responsible for monitoring LPL environments using Dynatrace, building intelligent agents that analyze logs, metrics, and traces, generating automated alerts based on dynamic thresholds, and implementing self healing recovery mechanisms to reduce MTTR and prevent outages.

This role blends SRE, platform engineering, and applied AI/ML, with a strong focus on automation, resilience, and operational excellence.

Job Responsibilities

Observability & Monitoring

Design, implement, and operate enterprise observability solutions using Dynatrace, including logs, metrics, traces, and RUM.
Define golden signals, SLOs, SLIs, and error budgets for critical applications and infrastructure.
Create dynamic and adaptive alerting strategies to reduce noise and improve signal quality.
Partner with application teams to embed reliability and observability standards into services from design through production.

AI Agents & Intelligent Automation

Build and deploy AI/ML-powered agents that analyze:
Dynatrace logs and metrics
Historical incident data
Behavioral and anomaly patterns
Implement predictive monitoring and anomaly detection using statistical models or ML techniques.
Automate alert triage, correlation, and root cause analysis (RCA) using AI-driven approaches.
Continuously improve models based on operational feedback and outcomes.

Self-Healing & Recovery

Design and implement self-healing systems that automatically:
Restart or rescale services
Redirect traffic
Roll back failed deployments
Trigger remediation workflows
Integrate automated recovery actions with CI/CD pipelines, runbooks, and orchestration tools.
Ensure recovery mechanisms are safe, auditable, and compliant with enterprise governance.

Reliability Engineering & Operations

Lead incident response, post-incident reviews, and root cause analysis.
Establish runbooks, automated remediation playbooks, and operational standards.
Drive reductions in MTTR, incident frequency, and operational toil.
Participate in on-call rotations and help build a sustainable on-call culture.

Platform, Cloud & Infrastructure

Work with cloud and onprem infrastructure teams to improve reliability at scale.
Build automation using Infrastructure as Code (IaC) practices.
Ensure systems meet security, compliance, and resiliency requirements expected in financial or regulated environments.

Job Qualifications

Core SRE & Platform Skills

4+ years of experience in SRE, DevOps, Platform Engineering, or Production Operations
Strong experience designing and operating highly available, distributed systems
Hands-on expertise with:
Monitoring and observability platforms (strong preference for Dynatrace)
Linux-based systems and networking fundamentals
Deep understanding of SLOs, SLIs, error budgets, and reliability trade-offs

Automation & Programming

Proficiency in at least one programming language such as:
Python, Go, Java, or similar
Strong scripting experience (Python, Bash, etc.) for automation and tooling
Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)

AI / ML & Intelligent Systems

Experience building or integrating AI/ML models for operational use cases, such as:
Anomaly detection
Log analytics
Predictive alerts
Familiarity with:
ML pipelines, feature extraction, and model lifecycle management
Rule-based + ML hybrid approaches for production systems
Experience integrating AI workflows into monitoring and alerting platforms is highly valued.

Cloud, Containers & CI/CD

Experience with cloud platforms (AWS, Azure, or GCP)
Strong knowledge of:
Kubernetes and container orchestration
CI/CD pipelines and deployment automation
Understanding of scalable architecture patterns and failure modes in cloud-native systems.

LPL Global Business Services, LLP - PRIVACY POLICY

Apply Now

Jobseeker tools

Employer Tools

Browse

Stay Connected