Who We Are
Finance leaders choose Billtrust to get paid faster, control costs, and maximize customer satisfaction. As the leader in B2B accounts receivable workflow and payment software, we provide the world’s leading brands with AI-powered solutions across the full AR lifecycle—from invoice presentment and payment processing to cash application and collections. With over 2,600 global customers, more than $1 trillion in invoice dollars processed, and a proprietary network of 13 million buyers, Billtrust delivers business value through deep industry expertise and a culture relentlessly focused on meaningful customer outcomes.
We’re an AI-first company, not just in what we build for our customers, but in how we work. Across every function, our teams use AI tools daily to work faster, make better decisions, and deliver higher-quality outcomes. We hire exceptional people, give them cutting-edge AI capabilities, and measure success by the impact they create. If you want to do the best work of your career at the frontier of AI and fintech, Billtrust is the place to do it.
Our Values
Customers
We relentlessly increase value for customer and do the right thing for them.
Action
We make ‘thoughtfully fast’ decisions, act quickly, cut through red tape, deliver progress not perfection, take ownership and accountability.
Team Spirit
We put the team ahead of ourselves, foster trust and respect, collaborate with passion, despise toxic politics, value our differences, and celebrate together.
Innovation
We challenge the status quo, experiment thoughtfully, and are novel and brilliant in what we create.
Excellence
We love to win, but we hate losing even more. We aspire to be the best and take pride in our work. When we fall short, we own it and come back stronger.
Site Reliability Engineer
As a Site Reliability Engineer within our Operations Engineering Center, you'll ensure the reliability, scalability, and performance of Billtrust's infrastructure that powers mission-critical order-to-cash operations. You'll participate in our follow-the-sun SRE coverage across time zones. You'll respond to incidents, implement monitoring and alerting strategies, and engineer autonomous incident response systems through agentic runbooks and intelligent triage. Your work will directly impact billions of dollars in transactions processed through our platform while pioneering AI-driven operational excellence.
Key Responsibilities
- Respond to incidents, perform root cause analysis, and lead post-mortem discussions
- Implement and maintain comprehensive monitoring, alerting, and observability across infrastructure
- Establish and maintain SLO frameworks, tracking and improving reliability metrics
- Engineer autonomous alert triage agents and agentic runbooks for incident response
- Design and build intelligent incident correlation engines using AI/ML techniques
- Develop and maintain infrastructure automation, CI/CD pipelines, and deployment procedures
- Manage Kubernetes clusters, container orchestration, and cloud platform resources (AWS)
- Lead toil reduction initiatives through automation, focusing on high-impact pain points
- Collaborate with platform and product teams on infrastructure requirements and capacity planning
Required Qualifications
Experience & Technical Background
- 5+ years of hands-on experience in Site Reliability Engineering or infrastructure operations
- Strong proficiency with Linux/Unix systems administration and shell scripting
- Experience with cloud platforms (AWS preferred, Azure or GCP acceptable)
- Hands-on Kubernetes and container orchestration experience
- Demonstrated expertise in incident response, troubleshooting, and post-mortem analysis
- Strong background with monitoring tools (Datadog, Prometheus, Grafana, PagerDuty)
- Experience with infrastructure automation and infrastructure-as-code tools (Terraform)
- Proficiency with at least one programming/scripting language (Python, Go, Bash preferred)
- Proficiency using Claude Code, GitHub Copilot or similar AI coding assistance
Soft Skills & Attributes
- Excellent communication skills, particularly during high-stress incident situations
- Problem-solving mindset with focus on automated solutions over manual workarounds
- Reliability-first mentality with attention to detail and systems thinking
- Ability to thrive in a distributed, follow-the-sun team environment
- Comfort with on-call responsibilities and 24x7 operational commitment