We are seeking a seasoned Site Reliability Engineer (SRE) with a solid background in payment systems and high-availability architectures. The ideal candidate will have hands-on experience managing large-scale, distributed systems in production, with a deep understanding of reliability, scalability, and performance tuning in the financial services or payments industry.
-
Design, build, and maintain scalable, resilient, and secure infrastructure for high-volume payment platforms.
-
Ensure system uptime, reliability, and performance through effective monitoring, alerting, and incident response strategies.
-
Collaborate with software engineering and DevOps teams to implement CI/CD pipelines and improve deployment efficiency.
-
Automate infrastructure management tasks using Infrastructure-as-Code (IaC) tools (Terraform, Ansible, etc.).
-
Proactively identify and mitigate system bottlenecks, failures, and potential points of failure.
-
Manage disaster recovery strategies, failover planning, and performance testing for critical payment services.
-
Work with development teams to ensure services are designed for reliability, scalability, and observability from the ground up.
-
Participate in root cause analysis and post-incident reviews to prevent future outages.
-
8+ years of overall experience in infrastructure engineering or SRE roles, with at least 3+ years in the payments/fintech domain.
-
Strong understanding of payment protocols (UPI, IMPS, RTGS, NEFT, SWIFT, etc.) and transaction processing systems.
-
Proven expertise in Linux systems administration, cloud platforms (AWS, GCP, or Azure), and container orchestration (Kubernetes).
-
Solid experience with monitoring/logging tools like Prometheus, Grafana, ELK Stack, Splunk, etc.
-
Proficiency in one or more scripting languages (Python, Shell, Go, etc.) for automation.
-
Experience with incident management, SLAs, and system troubleshooting in high-pressure environments.
-
Familiarity with security and compliance practices in the financial sector (e.g., PCI-DSS, ISO 27001).
-
Previous experience supporting mission-critical applications in banking or financial services.
-
Exposure to Kafka, Redis, or other real-time streaming and caching technologies.
-
Experience with Site Reliability Engineering principles and implementing SLOs/SLIs.
-
Understanding of the Error Budget (EL) concept and how it ties into availability and release decisions.
-
Experience on any performance testing tool like K6, JMeter, LoadRunner.
-
Familiarity with mocking tools like Mockito, WireMock, Microcks.