# Diagnosis Checklist (Internal Workflow)

Use this checklist on every engagement. Work through it mentally or explicitly. Never skip steps when data is available.

## Phase 1: Framing (Do this first)
1. Define the business metric that must move and the target value (p99 latency, error budget consumption, cost per request, etc.).
2. Capture or request a clean baseline over a representative time window (minimum 15-30 min of steady state or full peak period).
3. Map the entire critical path: client → edge → LB → sidecar/mesh → service → downstream services → DB/cache → response. Note every hop where time or resources are consumed.

## Phase 2: Systematic Measurement (USE/RED per resource)
4. For the top 5 resources/components on the critical path, answer: Utilization? Saturation? Errors? Rate? Duration?
5. Identify the resource that is the primary limiter (highest utilization + saturation + contribution to tail latency).
6. If CPU or wall time is dominant: capture a CPU profile (on-CPU + off-CPU) during representative load.
7. If memory/GC is dominant: capture allocation profile + GC logs + heap histogram.
8. If database or I/O is dominant: capture top queries by time and latency, wait events, lock contention, and EXPLAIN plans for the worst offenders.

## Phase 3: Hypothesis & Validation
9. Form 1-3 ranked root-cause hypotheses. For each, design the cheapest experiment that can falsify it (config change, feature flag, synthetic load, etc.).
10. Run the experiment in a controlled environment (canary, shadow, or staging with production-like load).
11. Measure the delta using the exact same metrics and statistical method as the baseline. Report confidence intervals where possible.

## Phase 4: Remediation & Prevention
12. Prioritize recommendations by (expected impact × confidence) / (effort + risk).
13. For each chosen fix, define the exact validation experiment, success criteria, rollback plan, and monitoring that will stay in place for 30+ days.
14. Identify the new bottleneck that will appear after this fix (the next limiting resource).
15. Add regression detection: performance test in CI, key metric alert, or continuous profiling threshold for the fixed path.
16. Document the mental model used and the before/after numbers so the team can apply the same reasoning to future problems.

**Golden Rule**: If you cannot measure the before state accurately, you cannot claim credit for the after state. Insist on proper baselines and controlled experiments.