Subscribe to the latest remote jobs:

Platform Engineer

🇺🇸 United States

Logistics

Management

Python

Rust

Docker

Kubernetes

AWS

Terraform

Machine Learning

Design

Backend

Devops

$140 - $250

APPLY

Platform Engineer

from 🇺🇸 United States

$140 - $250

About hud

Platform for building RL environments and evals

Tech description:

AWS
Docker
Rust
Postgres
Python
Next + React

Job description:

## About HUD

[HUD](https://www.hud.ai/) is building infrastructure to create RL training data and evals for frontier AI agents, as well as a marketplace to sell these to frontier labs through the HUD marketplace. Our platform is used by frontier labs, Fortune 500 companies, and startups. We’ve raised $16M from top VCs and were YC W25.

## About the role

We’re looking for a platform engineer who can own the reliability, scale, performance, and developer experience of HUD’s core infrastructure and backend systems.

**This is not a pure infrastructure role.** The right person has strong production infra experience, but also thinks like a backend engineer: they can reason about service architecture, queues, databases, APIs, deployment safety, performance bottlenecks, and how product requirements translate into resilient systems. You’ll work across AWS, Kubernetes, Terraform, CI/CD, observability, and backend services to make HUD faster, more reliable, cheaper to run, and easier for engineers to build on.

## Responsibilities

- Own production uptime, latency, provisioning speed, infrastructure cost, and incident response for core platform services

- Build and maintain AWS infrastructure with Terraform, Kubernetes/EKS, Helm, Docker, EC2, CodeBuild, ECR, S3, IAM, networking, and secrets management

- Design and improve backend and platform systems for scale, including capacity planning, autoscaling, queueing, backpressure, cleanup jobs, retries, and rollback paths

- Define and improve dashboards, alerts, logs, traces, SLOs, runbooks, and on-call workflows so failures are detected, debugged, and resolved quickly

- Build reliable CI/CD, release automation, environment management, and deployment workflows that improve developer productivity and reduce production risk

- Write clean, maintainable code where needed to automate systems, improve backend services, and create internal tooling

## Experience

You may be a good fit if you:

- Have owned production cloud infrastructure for a high-availability, user-facing platform, with responsibility for uptime, performance, deployment safety, and cost

- Have deep experience with AWS infrastructure and containerized systems; experience with tools like Terraform, Kubernetes/EKS, Docker, EC2, CodeBuild, ECR, S3, IAM, load balancers, networking, and secrets management is strongly preferred

- Have built or operated CI/CD, environment management, release automation, observability, alerting, and incident response systems

- Have strong backend engineering judgment and can reason about service architecture, APIs, databases, async systems, queues, scaling limits, and production failure modes

- Can write clean, maintainable code and apply strong software engineering judgment across product architecture, infrastructure, backend systems, and developer workflows

Strong candidates may also have:

- Experience operating infrastructure for data-heavy, ML/AI, workflow, marketplace, developer-tools, or enterprise platforms

- Experience designing systems for bursty workloads, long-running jobs, sandboxed execution, distributed workers, or high-concurrency services

- Experience reducing cloud spend through better architecture, autoscaling, workload placement, caching, cleanup systems, or observability

- Experience building internal platforms or tools that make engineers faster without hiding too much complexity

_We prioritize technical aptitude, ownership, and learning potential over years of experience._

## **Team & company details**

- **Team Size** : ~15 people currently, mostly full-time in-person, but some remote.

- **Our team:** Our team includes 4 International Olympiad medalists (IOI, ILO, IPhO), serial AI startup founders, and researchers with publications at ICLR, NeurIPS, etc.

- **Company stage:** We have 8 figures in funding and high revenue growth. We’re scaling profitably and quickly to meet very strong demand.

## **Logistics**

- **Employment** : Full-time.

- **Location** : On-site in the San Francisco Bay Area.

- **Visa Sponsorship** : We provide support for relocation and visas for strong full-time candidates to the US.

- **Timeline** : Applications are rolling. The process is 2 technical interviews and a 1-week work trial.

## **What we offer**

- Competitive compensation based on experience and location

- 100% covered top-of-the-line medical, dental, and vision from Blue Shield of CA

- Lunch and dinner when you’re in the office

- Company-wide holiday break (Christmas Eve to New Year’s Day) on top of PTO and paid holidays

- Other perks including an Equinox membership, 401k, and commuter benefits

- Unlimited\* access to tokens for ChatGPT, Claude Code, Cursor, etc. \*_By unlimited, we mean no one on our token usage leaderboard has ever hit a limit. So we have no idea what the limit is._

Due to high volume, we may not actively respond to every application, but feel free to contact us at [[email removed]](mailto:[email removed]) or elsewhere if we missed your application!