## 🧠 Frameworks, Methodologies & Knowledge Base

### Incident Management Frameworks

| Framework | Application |
|-----------|-------------|
| **ICS (Incident Command System)** | Role assignment, span of control, unified command |
| **NIST SP 800-61** | Computer security incident handling lifecycle |
| **SRE Incident Management** | Error budgets, toil reduction, post-mortem culture |
| **Google SRE Post-Mortem Template** | Timeline, root cause, action items, lessons learned |
| **OODA Loop** | Rapid iteration under uncertainty |

### AI/ML-Specific Incident Taxonomy

1. **Model Serving** — Latency spikes, OOM, GPU failures, batching errors, wrong model version routed.
2. **Data Pipeline** — Feature drift, stale embeddings, index corruption, poisoned retrieval docs.
3. **Output Quality** — Hallucination spikes, toxicity/safety filter bypass, language regression.
4. **Agent/Tool Use** — Unauthorized API calls, scope creep, infinite loops, cost explosions.
5. **Security** — Prompt injection, jailbreaks, training data extraction, model theft attempts.
6. **Fairness/Bias** — Disparate impact detected in production monitoring.
7. **Compliance** — Missing human-in-the-loop, inadequate disclosure, audit trail gaps.

### Diagnostic Playbooks (High-Level)

**LLM Quality Degradation**
1. Check model version / routing changes (deploys, flags).
2. Compare input distribution (length, language, domain shift).
3. Inspect safety filter changes and temperature/top-p configs.
4. Sample outputs against golden-set eval; check RAG retrieval relevance.
5. Roll back to last known-good model+config bundle.

**RAG Pipeline Failure**
1. Verify index freshness and chunk integrity.
2. Check embedding model version mismatch.
3. Audit recent document ingestions for poisoned content.
4. Isolate affected namespace/tenant; rebuild index from snapshot.

**Agent Tool Abuse**
1. Enable restrictive tool policy / read-only mode.
2. Trace tool call logs for anomalous patterns.
3. Revoke and rotate compromised credentials.
4. Patch system prompt and tool schema validation.

### Metrics & Evidence to Request

- Error rate, p50/p99 latency, token volume, cost per request
- Safety classifier scores distribution shift
- Retrieval hit rate, MRR, empty-result rate
- User complaint velocity, NPS dip correlation
- Deploy timeline vs incident start (change correlation)
- Model card / version hash / artifact checksums

### Communication & Regulatory Awareness

- **GDPR** — 72-hour breach notification assessment
- **EU AI Act** — Serious incident reporting for high-risk AI systems
- **SEC/Cyber disclosure** — Material incident materiality assessment (coordinate with legal)
- **Consumer protection** — Clear, timely user notification for harmful AI behavior

### Post-Mortem Structure (Blameless)

1. **Summary** — Impact, duration, severity
2. **Timeline** — Detection → response → mitigation → resolution (UTC)
3. **Root Cause** — 5 Whys + contributing factors (not single-point blame)
4. **What Went Well** — Reinforce good practices
5. **What Went Poorly** — Process and system gaps
6. **Action Items** — Owner, due date, priority; categorize: fix, detect faster, prevent recurrence
7. **Lessons Learned** — Runbook updates, new monitors, game-day scenarios

### Game-Day & Preparedness

- Maintain **AI incident runbooks** per service tier
- Quarterly **tabletop exercises**: model rollback, RAG poison, prompt injection, regulatory clock
- Pre-approved **comms templates** and **kill-switch authority matrix**
- On-call rotation including ML engineer, safety reviewer, and comms backup