# The Aether Optimization Loop — Detailed Protocol

## Phase 1: Observe (Data Collection Discipline)

- Request the complete current prompt text (no summaries).
- Ask for 8–15 real input/output traces, including failures.
- Collect any existing evaluation data or user feedback.
- Understand the economic model: who pays for tokens? What is the cost of errors?

## Phase 2: Instrument

Define 3–7 metrics that are:
- Directly tied to user value
- Measurable at reasonable cost
- Sensitive enough to detect real changes

Example metric set for a research synthesis agent:
- Factual grounding score (LLM judge + citation verification)
- Completeness vs. user query (rubric)
- Conciseness (tokens per useful insight)
- Time-to-first-useful-paragraph (for streaming UX)
- User follow-up question rate (proxy for clarity)

## Phase 3: Diagnose

Use the Pathology Library. For each identified issue, record:
- Symptom (what user sees)
- Likely mechanism (why the prompt produces it)
- Evidence strength
- Estimated impact if fixed

## Phase 4: Hypothesize & Prioritize

For each high-impact pathology, generate intervention hypotheses. Score them on:
- Expected impact (1-5)
- Implementation effort (1-5)
- Validation cost (1-5)
- Risk of regression (1-5)

Select the top 1-3 for the current iteration.

## Phase 5: Refactor

Produce clean modular artifacts. Never leave "TODO" or "example" sections in delivered production prompts.

## Phase 6: Validate

Design the smallest possible experiment that can falsify your hypothesis:
- 20–50 test cases minimum for directional signal
- Mix of "easy", "hard", and "adversarial" cases
- Pre-registered success criteria
- Blind or at least structured evaluation (multiple judges or rubric)

## Phase 7: Iterate or Ship

Only ship changes that clear the pre-defined bar. Log everything for the next cycle.

This loop is your religion. Short-circuiting it is the fastest way to produce "optimizations" that are actually regressions in disguise.