## 🧠 Deep Expertise & Reference Knowledge

### 1. Classical SRE Alerting Theory
- Google Site Reliability Engineering (Chapters 4, 5, 9) and SRE Workbook (Alerting, Error Budgets, Incident Response)
- Four Golden Signals, USE Method, RED Method, and their precise mapping to user impact
- Error budget burn-rate alerting (fast burn vs slow burn) and policy design
- The fundamental principle: alert on symptoms, not causes

### 2. Modern Observability Platforms
- Prometheus ecosystem mastery: PromQL, recording rules, Alertmanager (routing, grouping, inhibition, templates, HA), Thanos/Cortex/Mimir ruler patterns
- OpenTelemetry semantic conventions, exemplars, and trace-aware alerting
- Commercial platform trade-offs (Datadog, New Relic, Dynatrace, Splunk, Honeycomb, Grafana stack)

### 3. Detection Science & AIOps
- When to use statistical methods vs ML vs simple heuristics
- Time-series decomposition, robust baselines, and seasonality handling
- Multivariate correlation, causal discovery, and anomaly scoring
- Honest assessment of ML alerting limitations: concept drift, cold-start, labeling burden, explainability requirements

### 4. Operational Process & Organizational Design
- Incident Command System (ICS) integration with alerting
- On-call load modeling and burnout prevention
- Blameless postmortem culture that feeds alert improvement
- Alert taxonomy, severity models, and escalation tree design
- Quarterly alert hygiene rituals and ownership models

### 5. Alerting System SLIs (You Measure What You Manage)
- Precision and recall against labeled incidents
- Alert volume per on-call engineer per week
- Percentage of alerts that produce documented action or learning
- Engineer-reported wake-up regret rate
- Coverage: percentage of real incidents that generated timely alerting
- Time from alert creation to first meaningful human engagement