Job Description – 24×7 NOC Engineer (NOC)
We are looking for a 24×7 NOC Engineer (NOC) to ensure the availability, performance, and reliability of production systems. This role is hands-on across monitoring/observability, incident response, troubleshooting, and automation, working closely with engineering and infrastructure teams to reduce downtime and improve operational excellence.
Shift / Support Model: 24×7 rotational shifts (including nights/weekends) with on-call participation as required.
Key Responsibilities
-
Monitor applications and infrastructure using New Relic, Datadog, Grafana and related observability tooling; maintain dashboards and actionable alerting.
-
Alert creation, tuning, and noise reduction
-
Provide L1/L2 incident response in a 24×7 environment; triage alerts, restore service quickly, and manage escalations.
-
Perform deep troubleshooting across Linux systems, Kubernetes workloads, infrastructure components, and network paths.
-
Conduct log analysis using Newrelic/ELK (and/or similar platforms) to identify patterns, correlate events, and support root cause analysis.
-
Build and enhance automation for routine operational tasks, alert remediation, and reporting using Python and Bash.
-
Manage infrastructure changes using Terraform and follow Infrastructure-as-Code practices (review, version control, rollback readiness).
-
Support Kubernetes platform operations by assisting with deployments, performing cluster/service health checks, executing scaling and recycling activities, monitoring capacity and performance, and troubleshooting issues.
-
Maintain clear runbooks, SOPs, and shift handover notes; ensure knowledge is captured and reusable.
-
Partner with engineering and cloud/infrastructure teams to improve reliability through post-incident reviews, problem management, and continuous improvements to observability.
Must-have Skills
-
Monitoring & Observability: New Relic, Datadog, Grafana; strong alert triage and dashboarding skills.
-
Linux: administration fundamentals, process/service troubleshooting, permissions, performance basics.
-
Automation & Scripting: Bash and Python for operational tooling and automation.
-
Infrastructure as Code: Terraform (hands-on).
-
Containers: Kubernetes (workload troubleshooting, cluster concepts).
-
Networking: TCP/IP basics, DNS, HTTP/HTTPS, load balancing concepts, connectivity troubleshooting.
-
Log Analysis: ELK (or equivalent), querying/correlation for RCA support.
Secondary Skills
-
Cloud infrastructure fundamentals (AWS/Azure/GCP).
-
Good communication skills: clear incident updates, shift handovers, and stakeholder coordination.
Qualifications & Experience
-
Bachelor’s degree (B.Tech/B.E., MCA) or equivalent practical experience.
-
4–6 years of experience in SRE / NOC / Production Support / DevOps / Infrastructure Operations.
-
Experience working in a shift-based operations environment with strong ownership and urgency.
-
Ability to document clearly (runbooks, post-incident notes) and collaborate effectively with cross-functional teams.