Job Description
The Key responsibilities include:
- Help design, build, deploy and configure the new monitoring infrastructure that will enable us to work faster and smarter.
- Work with tech-leads of system migrations to ensure they correctly monitor their new platform and help in the creation of alerting rules and escalation paths
- Ensure that the monitoring system itself if ‘monitored’ and there are redundant escalation paths to detect if parts have failed.
- Develop and maintain any code-base required to solve solutions and customer specific config
- Ensure the platform is configured as automatically as possible using technologies like service discovery, ansible, git to reduce manual configuration where possible
- Help tech-leads and system owners build Grafana and other dashboarding tools
- Work with our NOC teams and system owners to gather requirements for monitoring and alerting and ensure these critical functions are maintained during system transitions.
- Help transition custom monitoring scripts from Nagios to either Prometheus or icinga2 platforms.
- Integrate existing monitoring systems into the new design and help transition away systems as require
Qualification:
Basic degree or diploma in IT. Certifications from Microsoft and on Enterprise Linux, Cloud Foundations, AWS Cloud Practitioner or similar, Dev Ops centered training and quals
Experience
>5 years of experience in a systems admin role implementing, developing and maintaining enterprise level platforms preferably in the media industry
In-depth knowledge in the design and implementation of the following areas is crucial
- Management of Docker and/or Kubernetes Platforms
- Docker container build processes
- Redhat/Oracle Linux/CentOS System Administration
- Monitoring technologies: Prometheus, influx dB, icinga2, Nagios, SNMP, Grafana
- Logging technologies: Kibana, Elasticsearch, Cloud Watch
- Orchestration management with a focus on one of the following: Ansible, Cloud Formation, Terraform, Puppet, Chef
- JSON and API integration