Configure and maintain Datadog dashboards, alerts, monitors, SLOs & SLIs. Integrate Datadog with cloud environments (AWS / Azure / GCP), Kubernetes, and on-prem applications. Implement APM traces, RUM, Infrastructure Monitoring, and Log Management. Develop and standardize observability best practices across teams. Troubleshoot performance issues using Datadog metrics, logs & traces. Automate monitoring setup using Terraform / Ansible / CI/CD tools. Work closely with DevOps, SRE, and development teams to ensure platform reliability. Optimize alerting to reduce noise and enhance incident response processes. Required Skills Hands-on experience with Datadog (Dashboards, Log Pipelines, Metrics, Alerts, APM). Strong knowledge of Linux-based systems and system performance metrics. Experience working with Containers & Kubernetes (EKS / AKS / GKE). Proficiency with at least one scripting language: Python / Bash / Shell. Experience with Cloud platforms: AWS / Azure / GCP. Understanding of CI/CD pipelines and Infrastructure as Code (Terraform preferred). Good to Have Experience with Incident Management / SRE practices Familiarity with Prometheus, Grafana, Splunk, New Relic, or similar tools Knowledge of Service Mesh / Microservices architecture Networking basics (DNS, Load balancing, SSL/TLS)