We are building an AI-native software factory that rapidly launches SaaS products into the market. Our product engineers use AI-assisted development tools to build fast. We are hiring a senior, hands-on Platform Reliability Engineer to make sure every product we launch is production-grade: deployable, observable, secure, scalable, resilient, and cost-efficient.
You will own the shared production layer across our portfolio. Your job is to turn “it works” into “it runs reliably for customers.” You will define the standards, tooling, and operating practices that let a small engineering team launch and maintain many products without operational chaos.
What you will own
This role owns productionization across the service lifecycle: deployment standards, production readiness reviews, observability, SLOs and alerting, automated recovery, scalability, security hardening, and incident response. The goal is to reduce manual operational toil by turning repeatable ops work into software, templates, and automation.
Responsibilities
Build and own the “golden path” for launching and operating products in production:
- infrastructure templates
- CI/CD pipelines
- environment provisioning
- secrets management
- DNS, SSL, and edge configuration
- rollout and rollback workflows
- backups and restore testing
- monitoring, logging, tracing, dashboards, and alerts
Define and enforce a Production Readiness Review for every launch, covering reliability, security, scalability, rollback, observability, and recovery. Define service-level indicators and service-level objectives for each product, and build alerting tied to customer impact rather than noisy infra events. Architect and operate reliable cloud infrastructure for multi-product SaaS workloads:
- autoscaling
- load balancing
- caching
- queues and background jobs
- database reliability
- failover and disaster recovery
- capacity planning and performance tuning
Own runtime and cloud security hardening:
- IAM and least-privilege access
- secret rotation and key management
- dependency and container scanning
- patching and vulnerability management
- network boundaries and service-to-service access
- audit logging
- WAF/CDN and edge protections
- secure release controls
Lead incident response for production issues:
- triage
- mitigation
- root cause analysis
- postmortems
- follow-through remediation
Reduce operational toil by automating repetitive support, maintenance, and recovery work. Partner closely with the product engineers from design through launch so every new app is deployable through a standard platform, not a one-off setup. For AI-native products, design runtime guardrails around:
- model/API credentials
- provider rate limits
- graceful degradation during vendor issues
- latency and cost monitoring
- fallback behavior for core AI workflows
What we’re looking for
- 5+ years of hands-on experience in SRE, platform engineering, production engineering, DevSecOps, or an infra-heavy backend role with direct production ownership
- Strong experience with at least one major cloud platform such as AWS, GCP, or Azure
- Strong infrastructure-as-code skills with Terraform, OpenTofu, Pulumi, or equivalent
- Strong CI/CD and release engineering experience
- Strong observability skills across logs, metrics, traces, dashboards, and alerting
- Strong security fundamentals across IAM, secrets, network controls, vulnerability management, and secure delivery
- Experience operating containers and/or serverless systems in production
- Solid coding and scripting ability in at least one language such as TypeScript, Python, Go, or Bash
- Experience with PostgreSQL, Redis, queues, background workers, and modern web app infrastructure
- Experience owning on-call, incidents, postmortems, and recovery processes
- Comfort working in a fast-moving startup where many products are launched from shared building blocks
- Comfort reviewing and hardening AI-generated or AI-assisted code and infrastructure changes
Nice to have
- Experience with multi-tenant SaaS products
- Experience building internal developer platforms
- SOC 2, ISO 27001, or security compliance preparation experience
- Experience with LLM/AI application operations
- Experience with FinOps or cloud cost optimization
- Experience supporting a product portfolio rather than a single application
Success in the first 90 days
- Establish a standard production deployment template for all new products
- Put centralized monitoring, logging, tracing, and alerting in place
- Create and enforce a production readiness checklist for launches
- Define initial SLOs for core products
- Implement backups and successfully test restore procedures
- Roll out a baseline security hardening standard across all production apps
- Create incident response runbooks and escalation paths
Success metrics
- Time from product-ready codebase to production launch
- Change failure rate
- Mean time to detect and mean time to recover
- Uptime and latency performance against agreed SLOs
- Number of critical production incidents
- Backup restore success rate
- Security findings closed within target time
- Infrastructure cost per product and per active customer