We are looking for a highly skilled Linux Systems Administrator / SRE Engineer to join our Infrastructure & Platform Services (ISP) team. The role focuses on managing large-scale Linux environments, implementing automation, improving observability, and ensuring the reliability of critical infrastructure.
Key Responsibilities:
- Maintain, secure, and operate a large-scale Unix/Linux server environment (1,300+ servers).
- Participate in on-call rotations to proactively address issues and prevent outages.
- Support production environments through automation of patching, upgrades, and performance improvements.
- Develop monitoring and observability solutions using Zabbix, Prometheus, Grafana, and ELK stack.
- Manage Linux virtual machines, containers, and associated applications.
- Perform lifecycle management including patching, upgrades, updates, and troubleshooting.
- Process daily service tickets (requests, incidents, and change requests).
- Manage internal and external DNS updates and configurations.
- Collaborate with third-party vendors for troubleshooting and issue resolution.
- Contribute to system improvements, automation initiatives, and operational efficiency.
- Support and maintain CI/CD pipeline integrations.
- Implement Infrastructure as Code (IaC) using Ansible, Terraform, and scripting tools.
Qualifications & Skills:
- Strong experience in Linux system administration (Debian, Ubuntu, RHEL).
- Proficiency in scripting languages such as Bash, Python, or Go.
- Hands-on experience with automation tools like Ansible and Terraform.
- Good understanding of cloud platforms and virtualized environments (OpenStack, VM-based platforms).
- Experience with monitoring and observability tools such as Zabbix, Prometheus, Grafana, and ELK stack.
- Exposure to SQL/NoSQL databases such as PostgreSQL and Couchbase.
- Experience with CI/CD pipelines and version control tools like Git.
- Working knowledge of web servers such as Nginx or Apache.
- Strong problem-solving and troubleshooting skills.
- Ability to work collaboratively across teams and contribute to process improvements.
- Strong written and verbal communication skills.
- Ability to remain calm and composed during incident response situations.