The Cloud/Infrastructure Engineer will help design, implement, and maintain our organisation's cloud and on-premise infrastructure. This role owns day-to-day infrastructure work independently while contributing to broader strategy alongside senior engineers. The ideal candidate has a solid understanding of system stability, monitoring, and performance optimization, and can troubleshoot complex issues and drive practical solutions with limited oversight.
At Redwood, we strive to automate and optimize our operations and application integrations to support the scale of our business and customer needs. As an observability engineer, you will build and maintain systems that collect, analyze, and visualize data from software applications and infrastructure. You will provide insight into system health and performance, enable proactive problem identification, and contribute to the ongoing evolution of our observability solutions in partnership with other teams.
Implement Infrastructure Strategy: Contribute to the design and implement infrastructure initiatives, including data collection pipelines, data processing, and visualization dashboards, with scalability and performance in mind. This includes metrics, logging, tracing, and alerting to provide end-to-end visibility into system behavior.
System Monitoring and Performance Optimization: Monitor system performance, identify bottlenecks, and implement solutions to optimize efficiency and reliability, including proactive identification of potential issues and preventative measures.
Incident Response and Troubleshooting: Respond to incidents, troubleshoot complex system issues, and perform root cause analysis to help prevent future occurrences.
Tooling and Infrastructure: Configure, manage, and maintain observability tooling and infrastructure (log aggregation, metric collection, distributed tracing, alerting), and contribute to tool evaluation and selection. Work with tools such as Prometheus, Datadog, New Relic, VictoriaMetrics, and similar to capture relevant metrics, logs, and traces.
Alerting and Incident Management: Configure and maintain alerts based on defined thresholds and anomalies, ensuring timely notification of potential issues and supporting rapid incident response.
Data Analysis and Troubleshooting: Analyze collected data to identify performance bottlenecks, root causes, and trends within complex systems, collaborating with engineering teams to resolve problems.
Automation and Optimization: Develop and implement automation for monitoring and alerting to improve response times and reduce manual effort. Create automations for repetitive tasks (e.g., RMJ automation for new employees).
Cross-functional Collaboration: Collaborate with development, operations, and other teams to apply cloud and infrastructure best practices across the software development lifecycle, ensuring proper instrumentation and data collection.
Capacity Planning: Use observability data to identify potential capacity constraints and support infrastructure scaling decisions.
Security and Compliance: Support security and compliance efforts, including access reviews, in line with organizational requirements.
Problem Solving: Resolve challenges related to system stability, data connectivity, and application configurations.
Dashboard Development: Build and maintain comprehensive, user-friendly dashboards that provide real-time insights into system health, key performance indicators, and operational metrics.
- Bachelor's degree in Computer Science, a related field, or equivalent experience.
- 3–5 years of systems administration/engineering experience.
- Solid understanding of Windows Server and Linux systems.
- Hands-on experience with layer 2/3 networking, including VLANs, routing, and ACLs; experience with multi-vendor switches and firewalls (e.g., HP, Dell, Palo Alto, Cisco ASA, Sophos) a plus.
- Good understanding of monitoring principles and best practices.
- Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
- Scripting and automation skills (e.g., Python, Bash, PowerShell).
- Experience configuring IPSec VPN tunnels, with familiarity with BGP routing and subnet peering across hybrid cloud environments.
- Ability to interpret data sets, identify patterns, and draw actionable insights.
- Strong communication and collaboration skills, with the ability to present findings to cross-functional teams.
- Ability to manage multiple priorities and work in a fast-paced environment.
- Experience with observability tools such as Prometheus, Grafana, Elasticsearch, Kibana, Jaeger, Datadog, New Relic, VictoriaMetrics, Zipkin, or similar.
- Experience contributing to security reviews and documentation with cross-functional/product teams.
- Experience with SSO and/or MFA implementations.
- Deeper public cloud platform experience or certifications (AWS, Azure, GCP).
- Experience with networking infrastructure across multi-vendor environments.
- Experience with at least one scripting language for automation or proof-of-concept work.