Principal Observability & Reliability Architect
🇺🇸 United States
Consulting
Management
Kubernetes
Machine Learning
Design
Devops
Analyst
$180,000 - $240,000
Principal Observability & Reliability Architect
from 🇺🇸 United States
$180,000 - $240,000
Responsibilities
- Lead client discovery, architecture workshops, and solution design across observability, telemetry, reliability, and operational intelligence initiatives.
- Design enterprise observability architectures spanning monitoring, logging, metrics, tracing, telemetry pipelines, alerting, event correlation, service visibility, and platform integrations.
- Define scalable standards for telemetry onboarding, naming, tagging, RBAC, service ownership, dashboards, alert governance, runbooks, and operational handoff.
- Advise on telemetry governance, including data quality, retention, access control, sampling, cardinality, and cost optimization.
- Lead modernization initiatives including tool rationalization, dashboard and alert rationalization, telemetry strategy, and migration from legacy monitoring platforms.
- Guide SRE practices including SLIs, SLOs, error budgets, production readiness, and incident response maturity.
- Design integration patterns across ITSM, CMDB, event management, and automation platforms.
- Support pursuits by shaping solution strategy, validating scope, informing estimates, and building client-facing technical narratives.
- Serve as a senior escalation point and provide architecture governance during delivery.
- Build reusable reference architectures, playbooks, and accelerators while mentoring architects, consultants, and offshore teams.
Qualifications
- 10+ years in observability, monitoring, APM, platform operations, SRE, or related enterprise technology domains, including 5+ years leading architecture and delivery strategy for enterprise observability or reliability initiatives.
- Deep, hands-on experience designing and implementing across monitoring, logging, metrics, tracing, telemetry collection, and pipeline patterns in hybrid and multi-cloud environments.
- Strong knowledge of telemetry governance, including routing, transformation, normalization, enrichment, retention, access control, and cost management.
- Experience defining enterprise standards for dashboards, alerts, tagging, naming, service ownership, RBAC, and operating model adoption.
- Strong command of incident response, event correlation, alert strategy, service health, and business-service visibility, plus applied SRE concepts including SLIs, SLOs, error budgets, and production readiness.
- Ability to lead executive and technical workshops and translate business needs into actionable architecture and delivery plans.
- Consulting or professional services experience with strong client-facing communication, estimation, risk management, and cross-functional leadership.
Preferred Qualifications
- Platform experience such as Dynatrace, Splunk, Grafana, LogicMonitor, Datadog, New Relic, AppDynamics, Elastic, Prometheus, or OpenTelemetry.
- Experience with telemetry pipeline tools such as OpenTelemetry Collector, Grafana Alloy, Fluent Bit, Kafka, Cribl, or Vector, along with familiarity with cloud, Kubernetes, CI/CD, and infrastructure as code.
- Experience integrating with platforms such as ServiceNow, Jira Service Management, PagerDuty, Opsgenie, BigPanda, or xMatters.
- Experience developing reusable consulting assets such as reference architectures, governance models, playbooks, POVs, and accelerators; relevant cloud, SRE, ITIL, or FinOps certifications are a plus.



