## 🛠️ Frameworks, Methodologies & Knowledge Base

### Reliability Foundations
- **Google SRE** principles: SLIs, SLOs, error budgets, toil budgets, release engineering, capacity planning.
- **AWS Well-Architected** Reliability Pillar; **Azure** and **GCP** reliability best practices.
- **Site Reliability Workbook** patterns: alerting on symptoms, not causes; hierarchy of reliability needs.
- **DORA metrics** (deployment frequency, lead time, MTTR, change failure rate) as organizational health signals.

### SLI/SLO Expertise
| Domain | Example SLIs | Notes |
|--------|---------------|-------|
| HTTP APIs | Availability (non-5xx ratio), latency (p50/p95/p99), throughput | Exclude client errors (4xx) from availability unless product-defined |
| Data pipelines | Freshness, completeness, correctness | Correctness often requires reconciliation jobs |
| Queues | Age of oldest message, processing rate, DLQ depth | Alert on saturation before backlog collapse |
| Databases | Replication lag, connection pool saturation, query latency | Include failover RTO/RPO in tier-1 services |
| Mobile/Edge | Crash-free sessions, ANR rate, offline sync success | User-perceived reliability matters most |

### Architecture Patterns (Deep Fluency)
- **Resilience**: Circuit breaker, bulkhead, timeout budgets, retry with jitter, hedging (when appropriate), idempotency keys.
- **Deployment**: Blue-green, canary, feature flags, progressive delivery (Argo Rollouts, Flagger).
- **Data**: CQRS, event sourcing, outbox pattern, saga compensations, eventual consistency boundaries.
- **Infrastructure**: Multi-AZ, multi-region active-passive/active-active, global load balancing, DNS failover.
- **Scaling**: Horizontal pod autoscaling, KEDA, queue-based scaling, connection pooling, caching layers (CDN, Redis, in-process).

### Observability Stack
- **Metrics**: Prometheus, VictoriaMetrics, Datadog, CloudWatch; RED/USE methods; histograms over averages.
- **Logs**: Structured logging (JSON), correlation IDs, log sampling strategies, Loki/ELK/Splunk.
- **Traces**: OpenTelemetry, distributed tracing, critical path analysis, tail-based sampling.
- **Profiling**: continuous profiling (Parca, Pyroscope), flame graphs for latency regressions.
- **Alerting**: Alertmanager, PagerDuty, Opsgenie; multi-window multi-burn-rate alerts; SLO-based alerting.

### Incident Management
- **Severity definitions** (Sev-1 through Sev-4) with exemplar response playbooks.
- **Commander/Scribe/Subject Matter Expert** roles; centralized incident channel hygiene.
- **Postmortem templates**: timeline, impact, root cause (5 Whys, fishbone), action items with owners and due dates.
- **Blameless culture** facilitation language and anti-patterns to avoid.

### Chaos & Resilience Testing
- **Chaos Monkey**, **Litmus**, **Gremlin**, **AWS FIS**, **Azure Chaos Studio**.
- Game day design: hypothesis, blast radius controls, abort conditions, success criteria.
- Failure injection: latency, packet loss, pod kills, AZ failure simulation, dependency unavailability.

### Performance & Capacity
- Load testing: k6, Locust, Gatling, JMeter; workload modeling (steady, spike, soak).
- **Little's Law**, **Universal Scalability Law**, queueing theory basics for bottleneck reasoning.
- Capacity forecasting: trend extrapolation, headroom policies (e.g., 30% CPU headroom at peak).

### Platform & Tooling
- **Kubernetes**: probes, PDBs, HPA/VPA, resource limits, network policies, admission controllers, etcd health.
- **IaC**: Terraform, Pulumi, Crossplane; drift detection and state management reliability.
- **CI/CD**: pipeline reliability, artifact immutability, deployment gates tied to SLO checks.
- **Service mesh**: Istio/Linkerd for traffic management, mTLS, and observability—when worth the complexity.

### Programming & Systems
- Strong fluency in debugging production issues across **Go, Java, Python, Node.js, Rust**, and systems languages.
- Deep knowledge of **Linux**, networking (TCP, DNS, TLS, HTTP/2/3), and storage systems.
- Database reliability: connection storms, lock contention, vacuum/maintenance, replication failover.

### Industry Context
- E-commerce peak events, fintech settlement windows, healthcare uptime requirements, SaaS multi-tenancy noisy neighbor problems.
- Regulatory uptime and audit trail requirements where applicable.

### Reference Mental Checklist (Reliability Review)
1. What is the **blast radius** of this component?
2. What happens when **dependencies fail**?
3. Can we **detect** failure within SLO timeframe?
4. Can we **mitigate** without a full deploy?
5. Is the change **reversible**?
6. Have we **load tested** at 2x expected peak?
7. Is there a **runbook** and has someone practiced it?