We are seeking an experienced Observability & Chaos Engineering Specialist to support monitoring, resilience, and operational excellence initiatives for AI-driven and cloud-native systems. The ideal candidate will have strong expertise in Langfuse, AWS native observability services, MCP agent-based environments, and Chaos Engineering using AWS Fault Injection Simulator (FIS).
The role focuses on building highly observable, resilient, and fault-tolerant distributed systems by implementing advanced monitoring, tracing, logging, and controlled failure testing practices.
-
Design and implement observability frameworks for AI/agent-based systems and distributed cloud-native applications
-
Configure and manage Langfuse for LLM/AI workflow observability, tracing, monitoring, and evaluation
-
Develop monitoring and telemetry solutions for MCP agent setups and multi-agent orchestration environments
-
Implement and optimize AWS native observability services, including:
-
CloudWatch
-
X-Ray
-
CloudTrail
-
OpenSearch / Logging frameworks
-
Establish centralized logging, distributed tracing, metrics collection, and alerting mechanisms
-
Design and execute Chaos Engineering experiments using AWS Fault Injection Simulator (FIS) to validate system resilience and recovery capabilities
-
Simulate infrastructure, network, and service failures to identify system weaknesses and improve fault tolerance
-
Collaborate with DevOps, Platform Engineering, AI Engineering, and Security teams to improve operational reliability
-
Build dashboards, alerts, and health monitoring systems for proactive incident detection and response
-
Analyze system behavior under stress conditions and recommend architecture improvements
-
Support incident troubleshooting, root cause analysis, and reliability optimization initiatives
-
Maintain technical documentation for observability architecture, chaos testing scenarios, and operational runbooks
-
7-9 years of experience in Observability Engineering, SRE, DevOps, or Platform Engineering
-
Strong hands-on experience with:
-
Langfuse for AI/LLM observability
-
AI workflow tracing and telemetry
-
Expertise in AWS native observability tools, including:
-
CloudWatch
-
AWS X-Ray
-
CloudTrail
-
AWS monitoring and logging services
-
Experience working with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
-
Strong understanding of:
-
Distributed systems observability
-
Telemetry pipelines
-
Logging, tracing, and metrics collection
-
Hands-on experience with Chaos Engineering practices
-
Expertise using AWS Fault Injection Simulator (FIS) for resilience and fault-tolerance testing
-
Knowledge of:
-
Incident management and root cause analysis
-
Reliability engineering and operational best practices
-
Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
-
Experience with CI/CD pipelines and infrastructure automation
-
Strong scripting/programming skills in Python or similar languages
-
Strong analytical, troubleshooting, and problem-solving skills
-
Experience with:
-
OpenTelemetry
-
Grafana
-
Prometheus
-
ELK/OpenSearch stack
-
Familiarity with:
-
AI/LLM platforms and agentic architectures
-
Event-driven and microservices-based systems
-
Knowledge of:
-
DevSecOps and cloud security monitoring
-
Performance engineering and load testing
-
AWS certifications preferred
-
Experience working in highly regulated or enterprise-scale environments