JD - Engineer Sr Lead, Site Reliability
Mandatory Skills:
- Experience in Resiliency Testing / Chaos Engineering
- Strong knowledge of Service Health Monitoring using SLI/SLO frameworks
- Solid understanding of Core SRE concepts
- Hands-on experience in Performance Engineering / Performance Testing
- Design and maintain monitoring solutions for infrastructure, application performance, and user experience.
- Implement automation tools to streamline tasks, scale infrastructure, and ensure seamless deployments.
- Ensure application reliability, availability, and performance, minimizing downtime and optimizing response times.
- Lead incident response, including identification, triage, resolution, and post-incident analysis.
- Conduct capacity planning, performance tuning, and resource optimization.
- Collaborate with security teams to implement best practices and ensure compliance.
- Manage deployment pipelines and configuration management for consistent and reliable app deployments.
- Develop and test disaster recovery plans and backup strategies.
- Collaborate with development, QA, DevOps, and product teams to align on reliability goals and incident response processes.
- Participate in on-call rotations and provide 24/7 support for critical incidents.
What you bring:
- Proficiency in development technologies, architectures, and platforms (web, API).
- Experience with cloud platforms (AWS, Azure, Google Cloud) and IaC tools.
- Hands-on experience with Docker, Kubernetes.
- Knowledge of monitoring tools (Prometheus, Grafana, DataDog) and logging frameworks (Splunk, ELK Stack).
- Experience in incident management and post-mortem reviews.
- Strong troubleshooting skills for complex technical issues.
- Proficiency in scripting languages (Python, Bash) and automation tools (Terraform, Ansible).
- Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
- Ownership approach to engineering and product outcomes.
- Excellent interpersonal communication, negotiation, and influencing skills.