About :
We are recruiting on behalf of a leading global financial technology organization (full-time opportunity on the client's payroll) that is committed to delivering best-in-class service reliability and performance.
As part of its continued investment in platform stability and operational excellence, the organization is expanding its Site Reliability Engineering (SRE) team to support a highly available 24x7 FX trading environment.
We are looking for a highly motivated and technically talented Senior Application SRE who will focus on application monitoring, automation, and optimization to enhance system stability, minimize downtime, and improve overall user experience.
The ideal candidate will bring strong problem-solving skills, experience working with large-scale distributed systems, and a deep understanding of software and infrastructure reliability principles. This role offers an opportunity to work on mission-critical applications in a fast-paced, high-performance environment where reliability and operational excellence are key priorities.
Responsibilities
- Ensure the reliability, performance, and availability of applications through proactive monitoring and automation.
- Develop and maintain real-time monitoring, alerting, and logging systems to detect and resolve issues before they impact customers.
- Automate manual operations, including application deployment, configuration, scaling, and recovery.
- Collaborate with software engineering teams to integrate reliability best practices into the development lifecycle.
- Conduct root cause analysis (RCA) and implement preventive measures to mitigate recurring issues.
- Support a 24x7 distributed enterprise environment across multiple global data centers.
- Work closely with Support to enhance incident response processes, ensuring fast and effective resolution of technical escalations.
- Participate in on-call rotations to support critical application issues and outages.
- Maintain and optimize CI,CD pipelines to ensure fast and reliable application releases.
- Enhance system security by managing SSL certificates, encryption, and authentication mechanisms.
- Foster a culture of continuous improvement by evaluating new tools, frameworks, and methodologies to enhance system reliability.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
- 5+ years of experience in a similar role, focusing on application reliability, automation, and performance optimization.
- Strong expertise in Linux and Windows system administration.
- Proficiency in at least one scripting language (e.g., Python, Shell, Perl, JavaScript).
- Experience with Docker, Kubernetes, or containerization technologies.
- Familiarity with CI,CD tools like Jenkins and deployment automation frameworks.
- Hands-on experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack, New Relic, Datadog).
- Understanding of networking concepts (TCP,IP, DNS, load balancing, firewalls).
- Experience with configuration management tools like Ansible, Salt, or Puppet.
- Strong debugging and troubleshooting skills across application, database, and infrastructure layers.
- Ability to work in a fast-paced, high-pressure environment with multiple priorities.
- Excellent communication and collaboration skills to work effectively with engineering and support teams.
Nice-to-Have Skills
- Experience in the financial services or trading industry.
- Knowledge of distributed computing, cloud platforms (AWS, GCP, Azure).
- Exposure to security best practices and compliance standards.
- Familiarity with incident management frameworks (ITIL, SRE best practices, or similar methodologies).
Experience and education
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
Work Location: Hybrid remote in Bangalore City, Bengaluru, Karnataka