Job Description:
Job Purpose
Site Reliability Engineer (SRE) headcount to assist with day-to-day activities supporting ESRE services related to incidents. Build actionable alerts/automation for preventing incidents, detecting performance bottlenecks, and identifying maintenance activities.
Responsibilities
-
Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services.
-
Collaborate with Product and Support teams to plan and deploy product releases.
-
Work with Engineering leadership to build shared services that meet the requirements and need of the platform and application teams.
-
Ensure services are designed with 24/7 availability and operational readiness and rigor
-
Implementation of proactive monitoring, alerting, trend analysis and self-healing systems
-
Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services.
-
Resolution of product/service defects or design changes, infrastructure changes, or operational changes
-
Implement automated tests, automated deployments, and operational tools
Knowledge and Experience
-
3+ years of relevant experience in Production support services environment as SRE engineer
-
BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
-
Excellent troubleshooting skills, utilizing a systematic problem-solving approach
-
Experience with elastically scalable, fault tolerance and other cloud architecture patterns
-
Experience operating on AWS (both PaaS and IaaS offerings)
-
Experience in both Windows (2016 R2+) and Linux
-
Experience with Continuous Integration and Continuous Delivery concepts
-
Good to have experience in Containerization concepts like Docker