# The Aether Optimization Flywheel

This is the repeatable, high-rigor process you follow on every engagement. Each phase produces specific artifacts that become organizational assets.

## Phase 0 – Immersion & Context Capture (1-3 days)

**Activities**
- Read all architecture decision records, prompt repositories (with version history), cost dashboards, trace samples, user research, support tickets, and prior incident reports.
- Interview 3-6 stakeholders across product, engineering, and operations using a structured questionnaire.
- Build the initial Task Taxonomy: cluster 200-500 real production requests into 8-15 meaningful behavioral categories.

**Artifacts**
- Task Taxonomy v1 (with example requests and rough volume distribution)
- Current-state value stream map (user intent → final value delivered, with time, cost, and error rate at each step)
- Stakeholder priorities and success definitions documented

## Phase 1 – Instrumentation & Visibility (Parallel with Phase 0)

**Checklist**
- [ ] Every production AI path emits full prompt, completion, model ID, token counts (prompt + completion), latency breakdown, attributed cost, trace/span IDs, and at least one user outcome signal (thumbs, task completion flag, downstream business event).
- [ ] PII redaction and retention policy verified.
- [ ] Cost attribution by feature, team, and task cluster exists and is trusted.
- [ ] Ability to replay any trace with identical inputs for debugging and A/B testing.

**Go/No-Go Gate**: Do not proceed to heavy diagnosis until you can see the actual production distribution and cost curve.

## Phase 2 – Constraint Diagnosis

Apply Theory of Constraints + Value Stream Mapping. Ask repeatedly: "What is the one thing that, if improved 30%, would unlock the largest end-to-end gain?"

Common constraint categories: model intelligence ceiling on hard slices, retrieval recall/precision, prompt/context pollution, agent loop inefficiency (too many steps or serial dependencies), tail latency causing abandonment, evaluation blind spots (we don't know when we are wrong), or unit economics that make scaling the feature impossible.

Produce a one-page Constraint Diagnosis with quantified impact (e.g., "Poor retrieval recall on long-tail product questions is responsible for an estimated 38% of failed self-serve resolutions, costing ~$X/month in escalated tickets.").

## Phase 3 – Portfolio Construction (ICE-P)

Score every identified opportunity:
- **Impact** (1-10): Expected movement on the business north-star metric
- **Confidence** (1-10): Strength of evidence and mechanistic understanding
- **Effort** (1-10): Calendar weeks for first reliable signal
- **Reversibility** (1-10): How fast and cheap it is to undo if it fails or regresses

Priority Score = (Impact × Confidence × Reversibility) / Effort

Categorize into:
- Quick Wins (high score, ≤2 weeks to signal)
- Medium Experiments (3-6 weeks)
- Strategic Bets (major architecture or capability moves, 2+ quarters)

## Phase 4 – Experiment Design

For every experiment above Level 1 (prompt changes), produce a one-pager containing:
- Clear hypothesis in "If we X then we expect Y because Z" form
- Primary success metric and guardrail metrics
- Statistical design (power analysis, minimum detectable effect, planned duration or event count)
- Traffic allocation and stop criteria
- Rollback triggers and owner
- Required instrumentation delta

Prefer shadow mode (no user-visible change) for the first signal whenever possible.

## Phase 5 – Ship, Measure, Decide

Implement behind feature flags or in shadow. Run the measurement protocol. Document the result with before/after numbers, confidence intervals, and qualitative observations. Decide: kill, iterate, or scale with expanded guardrails.

## Phase 6 – Codify & Restart

Winning patterns become:
- New default prompts or prompt modules in the shared repository
- Updated model router rules or capability-to-model mapping
- Additional automated regression tests in the eval harness
- Updated playbooks and onboarding material for the team
- Refined Task Taxonomy and instrumentation

Then immediately restart the flywheel on the next highest-priority constraint or task cluster.

This process compounds. Organizations that run it consistently for 6+ months typically see 3-6x cumulative gains in effective AI output per dollar while dramatically increasing reliability.