# 🛠️ SKILLS.md — Technical Mastery & Reference Knowledge

## Inference Architecture Patterns (2025-2026 State of the Art)

**Pattern A — Homogeneous Continuous Batching Fleet** (default for most interactive workloads): Single pool of GPUs (vLLM or TensorRT-LLM) with dynamic batching and prefix caching. Best for balanced I/O ratios and contexts <8k-16k. Simple autoscaling and operational model.

**Pattern B — Disaggregated Prefill + Decode**: Separate prefill tier (memory-bandwidth optimized) and decode tier (compute-optimized). KV cache transfer over high-speed fabric (InfiniBand/NVLink). Essential for strict TTFT SLOs with long contexts or highly variable output lengths. Requires sophisticated routing and cache transfer protocol.

**Pattern C — Speculative + Cascaded Serving**: Small draft model (1-8B) for speculative decoding (1.6-2.8x effective speedup) or lightweight classification/routing tier in front of large model. Dramatically improves cost and latency for mixed workloads.

**Pattern D — Multi-Adapter Hot-Swap (LoRA/DoRA/QLoRA)**: One base model + 50-300 task-specific adapters. Requires careful adapter memory management, swapping strategy, and cache affinity. Ideal for internal platforms and multi-tenant SaaS.

## Key Performance Numbers (Reference — Always Validate Locally)

- H100 SXM (FP8, Llama-3.1-70B class, continuous batching, 2k/300 tokens, tuned batch): 420-620 output tokens/s per GPU at p95 TBT < 45ms.
- H200 (larger HBM): 30-45% higher effective throughput on memory-bound workloads.
- Typical prefix cache hit rates in production chat/code: 25-65% depending on workload; each 10% hit rate often yields 8-15% cost reduction.
- Realistic cold-start TTFT for 70B model on 8xH100: 1.8-4.2s depending on weight loading and KV initialization strategy.

## Orchestration & Platform Stack

- Kubernetes 1.29+ + NVIDIA GPU Operator + MIG/time-slicing + topology-aware scheduling.
- Schedulers: Volcano/YuniKorn for gang scheduling and fair queuing; Kueue for job queueing and quota; Ray 2.10+ for unified training+serving+data.
- GitOps: Argo CD + ApplicationSets + Kustomize overlays; Crossplane or Terraform for control plane.
- Model deployment CRDs and controllers for self-service golden paths with progressive delivery and automated canarying against evaluation harnesses.

## Critical Observability Metrics

**Infrastructure**: DCGM GPU utilization/memory/power/ECC, fabric congestion, node repair latency, spot interruption rate.

**Inference QoS**: time_to_first_token_seconds histogram, time_per_output_token_seconds, num_requests_running/waiting, gpu_cache_usage_perc, speculative_acceptance_rate, cache_hit_rate, error rate by stage.

**Economics**: cost per million tokens (input vs output), cost per successful request, GPU-hour waste, utilization at p10/p50/p95.

## Common Anti-Patterns You Aggressively Prevent

- Sizing clusters on theoretical peak instead of p95 + burst + failure headroom.
- Mixing latency-sensitive inference and heavy training on the same nodes without strong isolation or QoS.
- Ignoring long-tail context lengths (one 128k request can destroy a fleet sized for 4k).
- Running stateful inference on spot without checkpointing, draining, and resumption logic.
- Treating model weights as ordinary container images (registry bandwidth, node disk pressure, and cold-start time are first-class concerns).
- Creating shared infrastructure without metering, attribution, and chargeback mechanisms.