DevOps / SRE Engineer - AI Platform
from š¹š Thailand
The DevOps / SRE Engineer owns the operational substrate of an AI-native retail decisioning platform ā infrastructure, CI / CD, observability, cost meter, and incident response for a system that runs production agents takingĀ real businessĀ actions. The roleĀ builds onĀ the enterprise Terraform standard, CI / CD spine, and FinOps tagging policy rather than reinventing parallel infrastructure.Ā
Remote candidates outside of Thailand are welcome to apply.
Key Responsibilities:
- Adopt the enterprise Terraform standard and module library for all platform infrastructure; author platform-specific modules where needed (agent runtime, vector DB, knowledge graph); run drift detection weekly.Ā
- Build platform-specific CI / CD pipelines on the enterprise spine ā service deploys, agent deploys, eval-gate enforcement; integrate eval gates so no agent reaches production without eval pass.Ā
- Operate rollback orchestration with sub-15-minuteĀ recovery;Ā quarterly game days.Ā
- Own the platform observability stack āĀ OpenTelemetry,Ā LangfuseĀ for LLM traces, custom dashboards for per-agent cost.Ā
- Implement the per-agent cost meter end-to-end ā token counts, vector queries, model inference, downstream LLM Gateway costs; surface cost data to the enterprise GenAI cost dashboard.Ā
- Stand up the platform on-call rotation; author runbooks for every production agent and service; lead incident response with measurable corrective actions.Ā
- Implement platform cost-tagging policy consistent with the enterprise standard (team, domain, environment, project, agent, suite, persona); report monthly to Cost Review.Ā
- Drive costĀ optimisationĀ ā right-sizing, caching, model routing decisions, reservedĀ compute.Ā
- Bachelor's orĀ Master's degree in Computer Science, Engineering, orĀ a relatedĀ discipline.Ā
- 5+ years SRE / DevOps with production ownership.Ā
- Terraform at scale ā modules, state, drift, environment promotion.Ā
- CI / CD for data + ML / AI services (GitLab CI / CD or comparable).Ā
- Cloud platform (Azure preferred; AWS / GCP transferable).Ā
- Observability āĀ OpenTelemetry,Ā LangfuseĀ (or comparable LLM traces), custom dashboards.Ā
- FinOps ā tagging policies, attribution,Ā optimisation.Ā
- Incident response ā on-call, post-mortems, runbook authorship.Ā
Preferred Qualifications
- AI / agent platform SRE experience; cost-meter / chargeback systems built orĀ operated.Ā
- Multi-cloud production experience; open-source contributions toĀ IaCĀ / observability tooling.Ā
- AI / ML / agent system observability instrumentation (LLM cost, agent cost, eval scores).Ā
- Vendor certifications such asĀ HashiCorpĀ Terraform Associate / Professional, Azure Solutions Architect Associate, or Databricks Data Engineer Professional.Ā