Position Summary:
We are seeking an experienced Site Reliability Engineer (SRE) to join our dynamic team, responsible for ensuring the scalability, reliability, and performance of our critical production systems. The ideal candidate will have a strong background in AWS, software engineering, and cloud infrastructure, with a focus on automation and operational excellence.
In this role, you will manage and optimize cloud-based services, integrate and support Java/C++ applications, and build robust CI/CD pipelines. You will leverage containerization technologies (e.g., Docker, Kubernetes) and work with AWS managed services (RDS, DynamoDB, ElastiCache) to automate, tune, and monitor systems at scale. With a deep understanding of networking protocols and monitoring tools (e.g., DataDog, CloudWatch), you will ensure continuous system availability and drive improvements through automation.
The role requires strong problem-solving skills to handle high-volume environments and production issues, with a focus on enhancing system performance and reliability through innovative engineering solutions.
Your Role Responsibilities and Duties:
- Automate and Optimize Systems: Develop, maintain, and enhance automated tools and systems to ensure the high availability, performance, and reliability of services.
- Collaborative Development: Work closely with development teams to design and implement scalable software solutions.
- Problem Resolution: Identify, troubleshoot, and resolve issues related to infrastructure, network, and system performance.
- CI/CD Management: Implement and manage continuous integration and deployment pipelines for streamlined software delivery.
- Proactive Monitoring: Monitor service metrics and logs to detect patterns and predict potential issues before they occur.
- Incident Response: Participate in the on-call rotation, responding promptly to incidents and emergencies.
- Post-Incident Analysis: Conduct thorough post-incident reviews to analyze and prevent future outages.
- Cloud Automation: Utilize cloud services and infrastructure as code (IaC) to automate resource provisioning and management.
- Comprehensive Documentation: Develop and maintain detailed documentation for system configurations, mapping, processes, and service records.
- Best Practices Advocacy: Promote and apply best practices in system security, reliability, and scalability.
Required Qualifications:
BS degree in Computer Science, Engineering, or related technical subject area.- 5+ years hands-on AWS experience – integrating, developing and managing applications
- 5+ years of relevant work experience in a high-volume and/or critical production, software environment
- 5+ years of hands on software engineering or supporting/maintaining software systems experience (Java and/or c++ services)
- 3+ years of experience with building automation into daily operational processes through one or more programming languages
- Experience with container technologies and orchestration (ie: Docker, Kubernetes, EKS)
- Experience in configuring, tuning and automating operational responsibilities for AWS managed data services including RDS, DynamoDB and Elasticache
- Experience with monitoring and log management tools (ie: DataDog, CloudWatch, Splunk)
- Hands-on experience in triaging and tuning Java cloud applications with integration into AWS managed services
- Solid understanding of AWS networking systems and protocols (ie: ALB, R53, API-Gateway, TCP/IP, HTTP/HTTPS, DNS)
- Experience with developing or supporting Continuous Integration and Continuous Delivery/Deployment pipelines (CI/CD)
Soft Skills:
- Problem-Solving: Strong analytical and troubleshooting skills.
- Collaboration: Excellent teamwork and communication skills to work effectively with cross-functional teams.
- Adaptability: Ability to manage multiple tasks and projects in a fast-paced environment.
- Attention to Detail: Precision in diagnosing and fixing issues.
- Continuous Learning: A proactive attitude towards learning new technologies and improving existing skills