Chennai 8+ Years Full-time Enterprise Observability, Monitoring & APM Engineering, SRE Practices, Automation & Incident Management OPT0262
- Design, implement, and manage observability solutions covering monitoring, logging, and tracing.
- Build dashboards, alerts, and reports for real-time system health and performance insights.
- Instrument applications and infrastructure for effective telemetry collection.
- Provide expert production support, resolving incidents, outages, and performance bottlenecks.
- Collaborate with engineering teams to ensure applications are built for observability and reliability.
- Conduct proactive trend analysis and thematic studies to identify risks and optimization opportunities.
- Improve system scalability, performance, and operational efficiency.
- Develop automation scripts to streamline observability workflows and incident response.
- Automate deployment and configuration of observability infrastructure.
- Ensure compliance with banking regulations and audit requirements.
- Participate in on-call rotations and post-incident reviews.
- Document procedures, best practices, and knowledge base articles.
- Strong expertise in observability platforms such as Prometheus, Grafana, ELK Stack, Datadog, New Relic, and Splunk.
- Deep understanding of APM tools such as AppDynamics, Dynatrace, and OpenTelemetry.
- Experience with Docker and Kubernetes.
- Proficiency in scripting languages such as Python and Bash.
- Strong understanding of SRE principles.
- Experience with cloud platforms including AWS, Azure, and GCP.
- Hands-on experience with Infrastructure as Code tools such as Terraform and Ansible.
- Excellent analytical, troubleshooting, and communication skills.
- Bachelor’s degree in Computer Science, IT, or a related field.
- Relevant cloud or DevOps certifications are preferred.
- Minimum 8+ years of experience in observability, SRE, DevOps, or production support roles.
- Strong experience working in banking or financial environments.
- Proven track record supporting production systems at enterprise scale.
- Experience implementing automation for monitoring and incident response.
- Exposure to compliance-driven environments and audits.