Experience Level : 6-8 years
We are seeking engineers with an AI-first mindset who can design and build intelligent, data-driven solutions on top of observability platforms, with a focus on scalable and reliable production systems.
Key Responsibilities
Design and develop AI-driven backend systems for observability and outage management
Build intelligent services for event correlation, noise reduction, root cause analysis, anomaly detection, and prediction
Develop capabilities for incident summarization, knowledge retrieval, and operational insights
Design and optimize data pipelines for large-scale telemetry data (logs, metrics, traces, events)
Implement LLM-powered features, including conversational interfaces, RAG pipelines, and automated insights
Integrate AI/ML models into production systems, ensuring scalability and reliability
Work with OpenTelemetry and observability platforms to process and analyze system signals
Collaborate with engineering, SRE, and DevOps teams to build cloud-native solutions on OCI
Contribute to system design, code reviews, and platform evolution
Primary Skills & Experience
AI / Machine Learning & Data Engineering (Primary)
Strong proficiency in Python for AI/ML and data engineering
Experience designing and deploying AI/ML applications in production
Hands-on experience with LLMs and APIs (OCI Generative AI, OpenAI, or similar)
Experience with prompt engineering, evaluation frameworks, and RAG pipelines
Understanding of anomaly detection, pattern recognition, and time-series analysis
Experience with vector databases / similarity search systems
Observability, Backend & Distributed Systems (Core)
Strong understanding of observability principles (metrics, logs, traces, events)
Experience with distributed systems debugging and reliability engineering
Hands-on experience with OpenTelemetry and monitoring tools (Prometheus, Grafana, OCI Monitoring)
Strong backend development experience with Python, APIs, and microservices
Familiarity with event-driven architectures and streaming platforms (Kafka, OCI Streaming)
Understanding of scalable, fault-tolerant system design
Experience with monitoring, alerting, dashboards, and search platforms (Elasticsearch/OpenSearch)
Qualifications
Bachelor’s or Master's degree in computer science or related field
Experience with AI-powered observability or AIOps systems preferred
Knowledge of incident management, root cause analysis, and SLO/SLA frameworks
Experience with multi-tenant, large-scale distributed systems
Strong communication and collaboration skills in an agile environment
Pay: ₹1,600,000.00 - ₹2,400,000.00 per year
Benefits:
Work Location: In person