Lead Director, Site Reliability Engineering - Client Experience
đșđž United States
Management
Kubernetes
GCP
Azure
Finance
Machine Learning
Design
Devops
Testing
$144,200.00 - $288,400.00
Lead Director, Site Reliability Engineering - Client Experience
from đșđž United States
$144,200.00 - $288,400.00
Weâre building a world of health around every individual â shaping a more connected, convenient and compassionate health experience. At CVS HealthÂź, youâll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger â helping to simplify health care one person, one family and one community at a time.
Position Summary
TheLead Director â Site Reliability Engineering - Client Experience is responsible for building, leading, and scaling handsâon SRE teams supporting Adjudication and Client Experience platforms across On-Prem, Azure and GCP.
This role owns endâtoâend reliability engineeringâfrom defining SLOs and error budgets to designing resilient cloud architectures, automating operations, and embedding reliability directly into the SDLC. The ideal candidate is a deeply technical leader who has personally designed, operated, and scaled highly available distributed systems and can coach teams to do the same.
You will work closely with engineering, architecture, product, infrastructure, and security teams to shift operations from reactive to predictive, reduce operational toil, and ensure platform stability at enterprise scale.
Key Responsibilities
Lead and grow handsâon SRE teams responsible for reliability, scalability, performance, and availability of Tierâ1 services across Azure and GCP
Establish and enforce SRE best practices, including SLIs, SLOs, error budgets, toil reduction, and automationâfirst operations
Review and influence architecture, reliability designs, and failure modes for critical platforms and services
Drive cloudânative reliability patterns, including autoscaling, graceful degradation, resilience testing, and disaster recovery
Own incident management, serving as an escalation leader and championing blameless postâmortems and systemic fixes
Lead root cause analysis and ensure corrective actions result in measurable reliability improvements
Define and standardize monitoring, alerting, and observability across distributed systems using metrics, logs, and traces
Implement predictive operations and AIâOps capabilities, including anomaly detection, automated triage, and remediation
Lead reliability engineering for multiâcloud environments (Azure & GCP), including Kubernetes platforms (AKS, GKE)
Ensure preâseason readiness and yearâround capacity planning based on historical usage and growth forecasts
Drive consistency in CI/CD, deployment strategies, and rollback mechanisms across teams
Embed reliability into the SDLC, shifting accountability left into design, development, and testing
Reduce operational toil through automation, selfâservice platforms, and standardized runbooks
Lead modernization initiatives that replace manual operations with engineeringâdriven reliability solutions
Communicate platform health, risks, and improvements using dataâdriven reliability metrics
Ensure systems meet security, compliance, and regulatory requirements
Required Qualifications
10+ years of progressive experience in engineering or SRE organizations,
5+ years of experience managing senior engineers and leaders
5+ years of handsâon experience designing, deploying, and operating systems in cloud environments (Azure and/or GCP)
Proven experience building or scaling SRE practices, including SLOs, SLIs, incident response, and postâmortems
Strong background in distributed systems, microservices, APIs, and cloudânative architectures
Experience leading teams through platform modernization or reliability transformation initiatives
Preferred Qualifications
Deep expertise with Kubernetesâbased platforms (AKS, GKE; OpenShift a plus)
Experience implementing AIâOps, automation, and predictive reliability solutions
Strong understanding of observability platforms and modern monitoring strategies
Track record of reducing outages, improving MTTR, and scaling reliability at enterprise scale
Ability to operate with a startup mindset while navigating complex enterprise environments
Excellent communication and stakeholder management skills with the ability to influence at all levels
Education
Bachelorâs degree or equivalent experience
Pay Range
The typical pay range for this role is:
$144,200.00 - $288,400.00
This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls. Â The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors. Â This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above. Â This position also includes an award target in the companyâs equity award program.Â
Â
Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.
Great benefits for great people
We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families.
Additional details about available benefits are provided during the application process and onBenefits Moments.
Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state and local laws.