Bangalore, Pune, Hyderaba
full-time
senior
ob Summary We are seeking an Observability / Monitoring Engineer with strong experience in designing, configuring, and maintaining monitoring and logging solutions across application and GCP cloud environments. The ideal candidate will have hands-on expertise with modern APM and monitoring tools, strong Python development skills, and the ability to build insightful dashboards that improve system reliability and performance. Key Responsibilities Build, configure, and maintain monitoring and observability tools across application and GCP environments. Implement and manage APM solutions, log aggregation systems, and performance monitoring frameworks. Develop monitoring scripts, automation, and integrations using Python. Create, enhance, and maintain dashboards for system metrics, logs, and performance insights. Work with SRE, DevOps, Development, and Cloud teams to establish monitoring best practices. Troubleshoot monitoring issues and ensure high availability of observability platforms. Define alerting rules, SLO/SLIs, and thresholds for proactive detection of system issues. Contribute to improving reliability, performance visibility, and root‑cause analysis processes. Support observability solutions during incident response and post‑incident reviews. Required Skills & Experience Hands‑on experience with monitoring/observability tools such as: Prometheus Grafana Dynatrace stack (or similar APM tools like New Relic, Datadog, AppDynamics) Strong programming experience with Python, especially for automation and integration tasks. Good understanding of dashboards, metric visualization, and alerting design. Experience with log aggregation, metrics pipelines, and monitoring architectures. Familiarity with GCP cloud services and their native monitoring capabilities (e.g., Cloud Monitoring, Cloud Logging). Strong analytical and problem‑solving skills. Ability to collaborate with cross‑functional engineering teams.
Key Responsibilities Build, configure, and maintain monitoring and observability tools across application and GCP environments. Implement and manage APM solutions, log aggregation systems, and performance monitoring frameworks. Develop monitoring scripts, automation, and integrations using Python. Create, enhance, and maintain dashboards for system metrics, logs, and performance insights. Work with SRE, DevOps, Development, and Cloud teams to establish monitoring best practices. Troubleshoot monitoring issues and ensure high availability of observability platforms. Define alerting rules, SLO/SLIs, and thresholds for proactive detection of system issues. Contribute to improving reliability, performance visibility, and root‑cause analysis processes. Support observability solutions during incident response and post‑incident reviews. Required Skills & Experience Hands‑on experience with monitoring/observability tools such as: Prometheus Grafana Dynatrace stack (or similar APM tools like New Relic, Datadog, AppDynamics) Strong programming experience with Python, especially for automation and integration tasks. Good understanding of dashboards, metric visualization, and alerting design. Experience with log aggregation, metrics pipelines, and monitoring architectures. Familiarity with GCP cloud services and their native monitoring capabilities (e.g., Cloud Monitoring, Cloud Logging). Strong analytical and problem‑solving skills. Ability to collaborate with cross‑functional engineering teams.