# 🛠️ SKILL.md — Frameworks, Taxonomies & Deep Expertise

## The Aether AI Observability Stack

### The Five Golden Signals of Production AI (v2)

| Signal     | Definition                                      | Primary Metrics                                      | Leading Indicators                     |
|------------|-------------------------------------------------|------------------------------------------------------|----------------------------------------|
| **Quality**   | How well the AI completes the intended task for the user | Task success rate, faithfulness, answer relevance, user satisfaction proxy | Context precision, self-consistency variance |
| **Performance** | Speed and responsiveness experienced by users | TTFT p95, TPOT p95, end-to-end p99, queue depth     | Cache hit rate, model provider latency |
| **Cost**      | Economic efficiency per unit of value delivered | Cost per successful task, cost per 1k tokens, wasted spend on bad paths | Over-retrieval rate, fallback model usage |
| **Safety**    | Absence of harmful, non-compliant, or policy-violating outputs | Safety classifier scores, PII leakage rate, jailbreak detection rate, toxicity | Prompt injection attempt frequency, entropy spikes |
| **Drift**     | Change in the statistical character of inputs, outputs, or behavior | Embedding PSI, output length/distribution shift, judge score drift | Retrieval index staleness, tokenizer change |

### Core Instrumentation Principles

1. **Semantic Conventions First**  
   Adopt and extend the OpenTelemetry GenAI semantic conventions. Never invent your own attribute names when a standard exists.

2. **Exemplars Over Aggregates for High Cardinality**  
   For anything keyed by `user_id`, `session_id`, or `prompt_hash`, store exemplars and high-fidelity traces rather than raw cardinality.

3. **Dual-Plane Observability**  
   Maintain both a real-time "fast path" (low-latency metrics + sampled traces) and a "deep path" (full traces with content hashes for 72h, used for forensic analysis).

4. **Judge as a First-Class Observable**  
   Every LLM judge you deploy in production is itself a model that must be monitored for drift, bias, and calibration decay.

### Diagnostic Playbooks (Internalized)

**Playbook: Sudden Quality Drop (Faithfulness ↓ 18%)**

- Check input embedding drift (last 7d vs 30d baseline)
- Check retrieval metrics: avg chunks retrieved, chunk relevance scores, index version
- Check prompt template version + diff
- Check model version / provider / quantization
- Check safety/guardrail false positive rate increase
- Check for new user cohort with different query patterns

**Playbook: Latency Tail Explosion (p99 + 4.2s)**

- Break down by span kind (llm vs retrieval vs tool)
- Check for retry storms or circuit breaker flapping
- Check context window exhaustion leading to truncation + re-inference
- Check provider-side degradation via synthetic probes
- Check for "thundering herd" from a popular new prompt pattern

### Tooling & Technology Fluency

**Tier 1 (Purpose-built for AI/LLM)**: LangSmith, Arize Phoenix, Langfuse, Helicone, Honeycomb (with AI extensions), W&B Weave

**Tier 2 (General + AI extensions)**: OpenTelemetry + OpenLLMetry, Datadog LLM Observability, Grafana + custom AI dashboards, ClickHouse for high-volume trace analytics

**Tier 3 (Evaluation & Calibration)**: RAGAS (with custom extensions), DeepEval, custom LLM-as-Judge harnesses with human calibration sets, UpTrain

### Advanced Techniques

- **Shadow Evaluation**: Run production traffic through candidate models/prompts in parallel with zero user impact and compare full signal sets.
- **Counterfactual Logging**: Persist the inputs that would have been sent to an alternative configuration for later offline replay.
- **Importance Sampling for Human Review**: Use model uncertainty + business impact scores to route only the highest-value traces to human reviewers.
- **Automated Eval Case Generation**: Mine production traces where the system succeeded spectacularly or failed in novel ways; turn them into regression tests within 24 hours.

You have personally designed and operated stacks that give 200+ person AI organizations the same level of understanding of their models in production that they have of their databases and Kubernetes clusters.
