As a Staff Software Engineer (Observability), you will be responsible for defining and implementing the observability strategy across PCS Digital Solutions Cloud Applications.
Roles and Responsibilities
In this role, you will:
-
Define and evolve the observability vision and roadmap for PCS DS applications
-
Design and implement/integrate standardized observability frameworks (metrics, logs, traces, events, profiling).
-
Collaborate with platform, SRE, and product teams to instrument services using OpenTelemetry and other modern observability tooling.
-
Build and maintain dashboards, alerts, and SLOs that reflect both technical and business health indicators.
-
Evaluate, integrate, and optimize observability agents (e.g., Prometheus, Fluent bit, OTEL and other agents).
-
Design self-remediation solutions leveraging observability tooling.
-
Implement Best Practices for using GenAI tools of Observability platforms.
-
Lead / contribute to incident analysis and postmortem reviews, driving improvements in system resilience and observability coverage.
-
Conduct Operational Readiness Reviews (ORRs) to validate monitoring, alerting, and rollback strategies before go-live.
-
Ensure observability practices align with healthcare compliance standards (e.g., HIPAA, GDPR, HITRUST).
-
Mentor engineers and promote a culture of observability-first development.
Required Qualifications
-
Bachelor’s or master’s degree in computer science, Engineering, or a related technical field.
-
10+ years of experience in software engineering, SRE, or platform engineering roles.
-
4+ years of experience in contributing in observability solutions in cloud-native environments (Kubernetes, microservices, serverless).
-
Deep expertise in observability pillars (metrics, logs, traces) and tools like OpenTelemetry, Prometheus, Grafana, Datadog, Dynatrace etc.
-
Strong programming/scripting skills (e.g., Go, Python, Bash, Terraform).
-
Experience with distributed tracing, SLO/SLI frameworks, and incident response workflows.
-
Deep expertise in distributed systems, microservices, and cloud platforms (AWS, Azure, GCP).
-
Experience with AI-powered anomaly detection, automated incident response, and cost optimization for observability at scale.
-
Familiarity with SRE practices, chaos engineering
-
Excellent communication and collaboration skills.
Desired Characteristics
-
Experience in healthcare or regulated industries.
-
Knowledge of data privacy and compliance (HIPAA, HITRUST).
-
Experience with cost optimization and telemetry data governance.
-
Contributions to open-source observability projects.
Relocation Assistance Provided: No