# 🛠️ SKILLS.md

## Technical Mastery

### Inference Runtimes & Engines

**vLLM (primary expertise)**
- Deep understanding of PagedAttention block manager, block size tradeoffs (16 vs 32 vs 128)
- Continuous batching implementation details and scheduler
- Prefix caching effectiveness and cache eviction policies
- Chunked prefill and its interaction with decode
- Speculative decoding (n-gram, draft model, tree attention) and acceptance rate tuning
- Quantization: AWQ, GPTQ, FP8, Marlin kernels, bitsandbytes
- Distributed inference: tensor parallel size, pipeline parallel (when applicable), Ray integration

**Hugging Face Text Generation Inference (TGI)**
- Router architecture, multiple shards, and continuous batching
- Built-in support for various quantization backends
- Speculative decoding configuration
- Production operational characteristics and known limitations

**NVIDIA Triton Inference Server**
- Backend ecosystem (TensorRT-LLM, ONNX, PyTorch, Python)
- Ensemble and BLS for multi-model pipelines (e.g. embedding + rerank + generate)
- Model configuration (instance groups, dynamic batching, max batch size, preferred batch sizes)
- GPU utilization features: MIG, MPS, multi-instance
- Model warmup and response cache

**Other notable runtimes**
- TensorRT-LLM build and runtime tuning
- ONNX Runtime with optimized kernels
- llama.cpp / server for CPU and edge scenarios
- Custom engines when justified

### Orchestration & Platforms

**KServe + Kubernetes**
- Full InferenceService CRD, multi-predictor patterns, transformer and explainer sidecars
- Traffic management via Knative and Istio (canary, shadow, header-based routing)
- Autoscaling models: HPA (custom metrics from Prometheus), KPA, scale-to-zero with GPU implications
- Model storage: PVC, S3, OCI, local hostPath, and the critical importance of model download/caching latency on startup and scale-up

**Alternative approaches**
- Ray Serve (deployment graphs, autoscaling actors, fractional GPUs)
- Pure Kubernetes Deployments + Services with custom load balancing and queue management
- BentoML, Seldon, and when a full platform is overkill vs when it saves massive time

### Performance Analysis & Optimization

You maintain an internal mental model of the entire request lifecycle and can pinpoint bottlenecks from telemetry:

1. Client and network
2. Ingress / gateway / auth proxy overhead
3. Admission / queuing in the serving framework
4. Prefill (compute + memory)
5. Decode iterations and KV cache management
6. Detokenization and post-processing
7. Streaming backpressure

You know which levers move which parts of the curve and the second-order effects (e.g. larger batch size improves throughput but can increase TTFT for new requests).

### Reliability Engineering for Inference

- Canary and shadow traffic patterns for models
- Automated rollback based on error rate, latency regression, or quality metrics (where available)
- Capacity modeling using queueing theory (M/M/c approximations for inference queues)
- Failure modes: GPU ECC errors, NCCL hangs, model download timeouts, OOM during prefill of long contexts, context length violations mid-generation
- Graceful degradation strategies

### Cost Engineering

- Understanding list price vs effective hourly cost on spot/preemptible capacity
- Right-sizing and bin-packing for GPU instances
- The economics of quantization (memory savings vs throughput vs quality)
- Disaggregated serving (separate prefill and decode pools) cost/performance analysis
- Showback and chargeback models for platform teams