Platform Engineer
🇺🇸 United States
Logistics
Management
Python
Rust
Docker
Kubernetes
AWS
Terraform
Machine Learning
Design
Backend
Devops
$140 - $250
Platform Engineer
from 🇺🇸 United States
$140 - $250
Platform for building RL environments and evals
Tech description:
AWS
Docker
Rust
Postgres
Python
Next + React
Job description:
## About HUD
[HUD](https://www.hud.ai/) is building infrastructure to create RL training data and evals for frontier AI agents, as well as a marketplace to sell these to frontier labs through the HUD marketplace. Our platform is used by frontier labs, Fortune 500 companies, and startups. We’ve raised $16M from top VCs and were YC W25.
## About the role
We’re looking for a platform engineer who can own the reliability, scale, performance, and developer experience of HUD’s core infrastructure and backend systems.
**This is not a pure infrastructure role.** The right person has strong production infra experience, but also thinks like a backend engineer: they can reason about service architecture, queues, databases, APIs, deployment safety, performance bottlenecks, and how product requirements translate into resilient systems. You’ll work across AWS, Kubernetes, Terraform, CI/CD, observability, and backend services to make HUD faster, more reliable, cheaper to run, and easier for engineers to build on.
## Responsibilities
- Own production uptime, latency, provisioning speed, infrastructure cost, and incident response for core platform services
- Build and maintain AWS infrastructure with Terraform, Kubernetes/EKS, Helm, Docker, EC2, CodeBuild, ECR, S3, IAM, networking, and secrets management
- Design and improve backend and platform systems for scale, including capacity planning, autoscaling, queueing, backpressure, cleanup jobs, retries, and rollback paths
- Define and improve dashboards, alerts, logs, traces, SLOs, runbooks, and on-call workflows so failures are detected, debugged, and resolved quickly
- Build reliable CI/CD, release automation, environment management, and deployment workflows that improve developer productivity and reduce production risk
- Write clean, maintainable code where needed to automate systems, improve backend services, and create internal tooling
## Experience
You may be a good fit if you:
- Have owned production cloud infrastructure for a high-availability, user-facing platform, with responsibility for uptime, performance, deployment safety, and cost
- Have deep experience with AWS infrastructure and containerized systems; experience with tools like Terraform, Kubernetes/EKS, Docker, EC2, CodeBuild, ECR, S3, IAM, load balancers, networking, and secrets management is strongly preferred
- Have built or operated CI/CD, environment management, release automation, observability, alerting, and incident response systems
- Have strong backend engineering judgment and can reason about service architecture, APIs, databases, async systems, queues, scaling limits, and production failure modes
- Can write clean, maintainable code and apply strong software engineering judgment across product architecture, infrastructure, backend systems, and developer workflows
Strong candidates may also have:
- Experience operating infrastructure for data-heavy, ML/AI, workflow, marketplace, developer-tools, or enterprise platforms
- Experience designing systems for bursty workloads, long-running jobs, sandboxed execution, distributed workers, or high-concurrency services
- Experience reducing cloud spend through better architecture, autoscaling, workload placement, caching, cleanup systems, or observability
- Experience building internal platforms or tools that make engineers faster without hiding too much complexity
_We prioritize technical aptitude, ownership, and learning potential over years of experience._
## **Team & company details**
- **Team Size** : ~15 people currently, mostly full-time in-person, but some remote.
- **Our team:** Our team includes 4 International Olympiad medalists (IOI, ILO, IPhO), serial AI startup founders, and researchers with publications at ICLR, NeurIPS, etc.
- **Company stage:** We have 8 figures in funding and high revenue growth. We’re scaling profitably and quickly to meet very strong demand.
## **Logistics**
- **Employment** : Full-time.
- **Location** : On-site in the San Francisco Bay Area.
- **Visa Sponsorship** : We provide support for relocation and visas for strong full-time candidates to the US.
- **Timeline** : Applications are rolling. The process is 2 technical interviews and a 1-week work trial.
## **What we offer**
- Competitive compensation based on experience and location
- 100% covered top-of-the-line medical, dental, and vision from Blue Shield of CA
- Lunch and dinner when you’re in the office
- Company-wide holiday break (Christmas Eve to New Year’s Day) on top of PTO and paid holidays
- Other perks including an Equinox membership, 401k, and commuter benefits
- Unlimited\* access to tokens for ChatGPT, Claude Code, Cursor, etc. \*_By unlimited, we mean no one on our token usage leaderboard has ever hit a limit. So we have no idea what the limit is._
Due to high volume, we may not actively respond to every application, but feel free to contact us at [[email removed]](mailto:[email removed]) or elsewhere if we missed your application!








