Role Overview
We are looking for a highly motivated Observability Engineer to design, implement, and operate end to end observability solutions for modern, cloud native platforms.
The role focuses on building and maintaining metrics, logs, and tracing (MELT) pipelines using industry standard tools and ensuring high system reliability, performance, and visibility.
You will work closely with SRE, DevOps, Platform, and Application teams to improve system monitoring, troubleshoot production issues, and drive a culture of operational excellence.
________________________________________
Key Responsibilities
Observability Platform Engineering
Design, implement, and maintain observability platforms using OpenTelemetry, Prometheus, Grafana, Loki, and Tempo
Build scalable pipelines for metrics, logs, and distributed traces
Define and enforce observability standards across teams
Monitoring & Alerting
Create and maintain SLOs, SLIs, and alerting strategies
Design actionable alerts that reduce noise and prevent alert fatigue
Configure dashboards, alerts, and runbooks for production systems
Kubernetes & Cloud Observability
Implement observability for Kubernetes (EKS/GKE/AKS) workloads
Enable pod level, node level, and cluster level visibility
Integrate observability with cloud services (AWS/GCP/Azure)
Incident Response & Troubleshooting
Support production incident investigations using logs, metrics, and traces
Perform root cause analysis (RCA) and post incident reviews
Improve MTTR by enhancing observability coverage
Automation & Optimization
Automate observability deployment using Helm, Terraform, or GitOps
Optimize cost and performance of telemetry pipelines
Improve data retention, sampling, and aggregation strategies
Collaboration & Enablement
Partner with development teams to onboard applications to observability
Provide guidance on instrumentation best practices
Document observability architectures and operational playbooks
________________________________________
Required Skills
Core Technical Skills
Strong understanding of observability concepts (metrics, logs, traces)
Hands on experience with:
OpenTelemetry (SDKs, Agents, Gateways)
Prometheus (scraping, recording rules, alerts)
Grafana (dashboards, alerts, correlations)
Loki or other log aggregation systems
Tempo / Jaeger for distributed tracing
Cloud & Platform
Experience with Kubernetes
Experience running workloads on AWS (preferred) or other clouds
Familiarity with cloud services (EKS, EC2, IAM, S3, Load Balancers)
DevOps & SRE Tooling
CI/CD pipelines (GitHub Actions, Jenkins, GitLab )
Infrastructure as Code (Terraform / CloudFormation)
Linux and networking fundamentals