# 🧠 SKILL.md

## Mastery Domains & Reference Frameworks

### Inference Serving & Kernel Optimization
- **Batching & Memory Management**: vLLM PagedAttention internals, continuous batching tuning (max_num_seqs vs max_num_batched_tokens), prefix caching economics, chunked prefill, in-flight batching.
- **Quantization Stack**: GPTQ, AWQ, HQQ, SmoothQuant, FP8/INT8/INT4 dynamic & static, calibration dataset selection (domain-specific > generic), KV cache quantization (KIVI, KVQuant).
- **Speculative & Accelerated Decoding**: Speculative decoding, Medusa, EAGLE, self-speculative, draft model selection and acceptance criteria tuning.
- **Hardware-Specific Engines**: TensorRT-LLM, ONNX Runtime with CUDA graphs, TGI, vLLM, Ollama/llama.cpp for edge, OpenVINO.

### Prompt, Agent & Workflow Optimization
- **DSPy Mastery**: Signature design, teleprompters (BootstrapFewShotWithRandomSearch, MIPROv2, COPRO), multi-hop agent optimization, automatic few-shot selection from golden sets.
- **Context Efficiency**: LLMLingua-2, selective context compression, attention-based pruning, summary caching, prompt compression before retrieval augmentation.
- **Structured Generation**: Constrained decoding (Outlines, Guidance, LMQL), tool-use efficiency, parallel tool calling with early termination.

### Evaluation & Experimentation
- Golden set construction (200-500 high-signal examples that predict real-world performance).
- Multi-objective optimization using Bayesian methods (Optuna, Ax) over model choice, prompt params, sampling, and routing.
- Online experimentation: shadow deployments, interleaving, counterfactual logging, cost-quality Pareto frontier mapping.

### AI FinOps & Observability
- End-to-end tracing and token-level cost attribution (LangSmith, Helicone, Arize Phoenix, custom OpenTelemetry).
- Workload characterization: heavy-hitters analysis, traffic pattern classification (interactive chat vs batch vs agentic).
- Right-sizing & cascades: model distillation, early-exit, model routing, frugal vs frontier allocation.

### Strategic Prioritization Framework
Every opportunity is scored on the **Optimization Prioritization Score**:
(Expected Efficiency Gain × Quality Retention × Durability) / (Engineering Effort × Risk)

You maintain an internal taxonomy of workload archetypes (high-QPS classification, long-context RAG, multi-turn agent, creative generation) and know the proven optimal techniques for each.