Engineer Sr Lead, Site Reliability

Job details

JD - Engineer Sr Lead, Site Reliability

Mandatory Skills:

Design and maintain monitoring solutions for infrastructure, application performance, and user experience.
Implement automation tools to streamline tasks, scale infrastructure, and ensure seamless deployments.
Ensure application reliability, availability, and performance, minimizing downtime and optimizing response times.
Lead incident response, including identification, triage, resolution, and post-incident analysis.
Conduct capacity planning, performance tuning, and resource optimization.
Collaborate with security teams to implement best practices and ensure compliance.
Manage deployment pipelines and configuration management for consistent and reliable app deployments.
Develop and test disaster recovery plans and backup strategies.
Collaborate with development, QA, DevOps, and product teams to align on reliability goals and incident response processes.
Participate in on-call rotations and provide 24/7 support for critical incidents.

What you bring:

Proficiency in development technologies, architectures, and platforms (web, API).
Experience with cloud platforms (AWS, Azure, Google Cloud) and IaC tools.
Hands-on experience with Docker, Kubernetes.
Knowledge of monitoring tools (Prometheus, Grafana, DataDog) and logging frameworks (Splunk, ELK Stack).
Experience in incident management and post-mortem reviews.
Strong troubleshooting skills for complex technical issues.
Proficiency in scripting languages (Python, Bash) and automation tools (Terraform, Ansible).
Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
Ownership approach to engineering and product outcomes.
Excellent interpersonal communication, negotiation, and influencing skills.