Observability & Chaos Engineering Specialist

Techblocks -
Hyderabad, Telangana

Quick apply

Job details

Qualifications

CI/CD
Law
Incident management
MCP
Kubernetes
SAP
DevOps
Master's degree
Microservices
AWS
Analysis skills
Cloud security
Distributed systems
Continuous integration
Scripting
ISO 27001
Root cause analysis
AI
Python
Analytics

Full job description

Position Title: Observability & Chaos Engineering Specialist

Experience: 7 - 10 years

Job Location: Remote

Work Mode: Remote

Time Zone/Shift: Starts at 12:30 PM IST

Requirement:

Role: Observability & Chaos Engineering Specialist (Langfuse / AWS / MCP Agents)

Role Overview

We are seeking an experienced Observability & Chaos Engineering Specialist to support monitoring, resilience, and operational excellence initiatives for AI-driven and cloud-native systems. The ideal candidate will have strong expertise in Langfuse, AWS native observability services, MCP agent-based environments, and Chaos Engineering using AWS Fault Injection Simulator (FIS).

The role focuses on building highly observable, resilient, and fault-tolerant distributed systems by implementing advanced monitoring, tracing, logging, and controlled failure testing practices.

Key Responsibilities

Design and implement observability frameworks for AI/agent-based systems and distributed cloud-native applications
Configure and manage Langfuse for LLM/AI workflow observability, tracing, monitoring, and evaluation
Develop monitoring and telemetry solutions for MCP agent setups and multi-agent orchestration environments
Implement and optimize AWS native observability services, including:
- CloudWatch
- X-Ray
- CloudTrail
- OpenSearch / Logging frameworks
Establish centralized logging, distributed tracing, metrics collection, and alerting mechanisms
Design and execute Chaos Engineering experiments using AWS Fault Injection Simulator (FIS) to validate system resilience and recovery capabilities
Simulate infrastructure, network, and service failures to identify system weaknesses and improve fault tolerance
Collaborate with DevOps, Platform Engineering, AI Engineering, and Security teams to improve operational reliability
Build dashboards, alerts, and health monitoring systems for proactive incident detection and response
Analyze system behavior under stress conditions and recommend architecture improvements
Support incident troubleshooting, root cause analysis, and reliability optimization initiatives
Maintain technical documentation for observability architecture, chaos testing scenarios, and operational runbooks

Required Skills & Qualifications

7-9 years of experience in Observability Engineering, SRE, DevOps, or Platform Engineering
Strong hands-on experience with:
- Langfuse for AI/LLM observability
- AI workflow tracing and telemetry
Expertise in AWS native observability tools, including:
- CloudWatch
- AWS X-Ray
- CloudTrail
- AWS monitoring and logging services
Experience working with MCP (Model Context Protocol) agent setups or multi-agent orchestration frameworks
Strong understanding of:
- Distributed systems observability
- Telemetry pipelines
- Logging, tracing, and metrics collection
Hands-on experience with Chaos Engineering practices
Expertise using AWS Fault Injection Simulator (FIS) for resilience and fault-tolerance testing
Knowledge of:
- Incident management and root cause analysis
- Reliability engineering and operational best practices
Familiarity with containerized and cloud-native environments (ECS/EKS/Kubernetes)
Experience with CI/CD pipelines and infrastructure automation
Strong scripting/programming skills in Python or similar languages
Strong analytical, troubleshooting, and problem-solving skills

Preferred / Nice-to-Have Skills

Experience with:
- OpenTelemetry
- Grafana
- Prometheus
- ELK/OpenSearch stack
Familiarity with:
- AI/LLM platforms and agentic architectures
- Event-driven and microservices-based systems
Knowledge of:
- DevSecOps and cloud security monitoring
- Performance engineering and load testing
AWS certifications preferred
Experience working in highly regulated or enterprise-scale environments

About Us:

We are a global, cloud-native organization with a strong presence across North America and India, delivering innovative digital transformation solutions to clients across diverse industries such as Financial Services, Healthcare, Retail & E-commerce, Manufacturing, and Technology. Our strong client base includes Fortune 500 enterprises as well as high-growth mid-market and startup organizations, giving our teams exposure to a wide variety of business challenges and cutting-edge solutions.

Our technology practices are built around modern, future-ready capabilities including Cloud Engineering, Data & Analytics, AI/ML, Digital Experience Platforms, Application Modernization, and Enterprise Solutions such as SAP and other leading platforms. We follow a design thinking-led approach combined with agile and lean engineering practices to deliver scalable, high-impact solutions. Backed by globally recognized certifications such as ISO 27001, SOC 1, SOC 2, SOC 3, UK Cyber Essentials Plus, and CMMI Level 3, we ensure the highest standards of security, compliance, process maturity, and quality across all our engagements.

Why Join Us:

Opportunity to work on global projects and Fortune 500 clients
Exposure to cutting-edge technologies
Strong learning, mentorship, and career growth programs
Collaborative and innovation-driven work culture

If you are passionate about working on innovative technologies and want to be part of a fast-growing organization, we encourage you to apply and be part of our journey.

Company Details:

Website: http://tblocks.com

LinkedIn: https://linkedin.com/company/techblocks/about

Quick apply

Time Zone/Shift: Starts at 12:30 PM IST

Role: Observability & Chaos Engineering Specialist (Langfuse / AWS / MCP Agents)

Role Overview

Key Responsibilities

Required Skills & Qualifications

Preferred / Nice-to-Have Skills

Jobseeker tools

Employer Tools

Browse

Stay Connected