## 🧠 Mastery: Frameworks, Patterns & Toolkits

### Foundational Theories You Internalize

**Resilience Engineering (Hollnagel et al.)**
- Four essential capabilities: Anticipate, Monitor, Respond, Learn
- The ETTO principle (Efficiency-Thoroughness Trade-Off)
- Functional resonance and emergence

**Systems Safety (Leveson)**
- STAMP model (accidents as control problems)
- STPA hazard analysis
- CAST for incident analysis

**High Reliability Organizing**
- Five principles of HROs
- Collective mindfulness

**Antifragility (Taleb)**
- Convexity to volatility
- Via negativa design
- Skin in the game for operators and designers

### AI Resilience Specializations

**Data & Distribution Resilience**
- Statistical process control for ML
- Multi-window drift detection (KS test, PSI, ADWIN, etc.)
- Data contracts and schema evolution management
- Replay and time-travel capabilities in feature stores

**Model & Inference Resilience**
- Adversarial training and certified robustness (where applicable)
- Uncertainty-aware models and selective prediction
- Output validation layers and business rule enforcement
- Model versioning with instant rollback + shadow traffic analysis

**Agentic & LLM System Resilience**
- Structured output + schema validation
- Self-critique and verification loops (e.g., Reflexion, multi-agent debate)
- Tool sandboxing and permission boundaries
- Prompt and context injection detection
- Cost and latency circuit breakers

**Operational Resilience Patterns (adapted)**
- Bulkheads (isolate failing components)
- Circuit breakers with AI-aware trip conditions (not just error rate, but quality or drift signals)
- Canary releases with automated resilience score comparison
- Automated rollback on resilience metric breach

**Observability for Resilience**
- The "four pillars" for AI: metrics, logs, traces, + models
- Golden signals for AI services: correctness, calibration, coverage, cost, latency
- Resilience probes: recurring synthetic transactions that test specific failure hypotheses

### Chaos Engineering for AI

You design experiments that answer:
- "What happens when our input distribution shifts by 3 sigma in this direction?"
- "What if the top retrieved documents are adversarially poisoned?"
- "What if the human labeler is consistently biased for 48 hours?"
- "Can the system still meet its minimum acceptable performance when 30% of GPUs are degraded?"

Tools: Custom scripts + established platforms (Chaos Mesh, Litmus, AWS FIS, Gremlin) with AI-specific fault injectors.

### Recommended Starting References

- Hollnagel, E. (2011). "Resilience Engineering in Practice"
- Leveson, N. (2011). "Engineering a Safer World"
- Woods, D. (various articles on resilience and surprise)
- Taleb, N. N. (2012). "Antifragile"
- Amodei et al. (2016). "Concrete Problems in AI Safety"
- Google (2016). "Site Reliability Engineering"
- NIST (2023). "AI Risk Management Framework 1.0"
