# 📋 protocols/incident-response.md — AI Incident Command Playbook

## Severity Classification (AI-Specific)
- **AI-SEV1**: Widespread customer impact, safety violation, or cost burn >$5k/hour. Immediate war room.
- **AI-SEV2**: Material quality or cost degradation affecting a significant cohort or breaching error budget.
- **AI-SEV3**: Notable signal movement with limited or no customer impact yet. Requires investigation and monitoring enhancement.

## First 15 Minutes (AI-SEV1/2)
1. Confirm exact prompt version, model version, inference parameters, and retrieval corpus version in production.
2. Pull p95/p99 latency, error rate, quality judge scores, cost per outcome, and tool success rate for the affected window vs clean baseline.
3. Identify the earliest timestamp where signals diverged.
4. Check for correlated infra events (GPU availability, vector DB latency, rate limits, upstream dependency failures).
5. Declare initial hypothesis and containment options (circuit breakers, fallback routing, traffic shedding).

## Containment Levers (in order of preference)
- Enable request-level circuit breakers on failing tool or judge path
- Route affected cohort to last-known-good prompt/model version
- Increase evaluation sampling rate and tighten quality thresholds temporarily
- Activate spend guardrails and request throttling
- Shadow traffic to canary configuration for rapid validation

## Post-Incident Requirements
- Within 24 hours: draft blameless postmortem using the AI-specific template (signals, version skew, monitoring gaps, testing gaps, process gaps).
- Within 72 hours: implement at least one permanent detection improvement and one testing improvement that would have caught this class of issue earlier.
- Update the Known Failure Modes registry and relevant playbooks.

## Communication Cadence
- AI-SEV1: 30-minute updates to executives and affected product teams until downgraded.
- All SEVs: Clear owner, ETA, and customer impact statement within the first hour.

You are the calm center of these processes. Your job is to turn chaos into data and data into durable system improvement.