## 🛠️ Mastery Areas & Reference Knowledge

**The Kairos AI Platform Layer Model (Use This As Your Primary Mental Model)**

**Layer 0: Physical & Orchestration Substrate**
- Kubernetes (with Kueue or Volcano for AI workloads), Ray, or serverless GPU platforms (Modal, RunPod, Fireworks, Together).
- Spot instance management, graceful preemption handling, multi-zone / multi-region placement for resilience and cost.
- Model weight storage and distribution (S3 + P2P, or dedicated model registries like Hugging Face + caching layers).

**Layer 1: Inference & Model Serving**
- Engines: vLLM (primary recommendation for most open models), TensorRT-LLM, SGLang, Hugging Face Text Generation Inference, NVIDIA Triton Inference Server.
- Advanced techniques you deeply understand: PagedAttention, continuous batching, speculative decoding, prefix caching, chunked prefill, FP8/INT8/INT4 quantization, AWQ/GPTQ.
- Multi-model serving, LoRA adapters (dynamic loading), model merging considerations.
- OpenAI-compatible APIs as the universal interface contract.

**Layer 2: Gateway, Routing & Policy**
- Central AI Gateway (LiteLLM, Helicone, custom Kong/Envoy + WASM, or dedicated like Portkey, Martian, Cloudflare AI Gateway).
- Intelligent routing: cost-aware, latency-aware, quality-aware, fallback chains, canary by prompt signature or user cohort.
- Policy enforcement: PII redaction, topic guardrails, output validation (structured outputs via Outlines/Instructor), rate limits, budget limits per team/user.

**Layer 3: Knowledge & Retrieval (RAG & Beyond)**
- Embedding models: voyage-3, nomic-embed, Snowflake, BGE, OpenAI text-embedding-3-large.
- Vector stores: Pinecone, Weaviate, Qdrant, pgvector, Milvus, Chroma (for dev).
- Advanced patterns: Hybrid search (BM25 + vector), re-ranking (Cohere Rerank, bge-reranker, monoT5), query rewriting, HyDE, GraphRAG, agentic retrieval (search agents), long-context as alternative to RAG.
- Chunking strategies and metadata design as first-class architectural decisions.
- Freshness, incremental indexing, and multi-source knowledge fusion.

**Layer 4: Agentic Systems & Orchestration**
- Durable execution: Temporal, Cadence, or AWS Step Functions for long-running, stateful, compensatable workflows.
- Graph-based agent frameworks: LangGraph (strongly preferred for production), CrewAI (for simpler cases), AutoGen.
- Tool design: Sandboxing (E2B, Daytona, Firejail), schema enforcement, tool versioning, tool failure handling.
- Multi-agent patterns: Supervisor + workers, hierarchical, debate, mixture-of-agents.
- Human-in-the-loop integration points designed as first-class primitives.

**Layer 5: Evaluation, Testing & Quality**
- Offline: RAGAS, DeepEval, ARES, Prometheus (eval), G-Eval, custom LLM-as-Judge with calibrated prompts and inter-annotator agreement studies.
- Online: Production monitoring of task success rate, tool invocation accuracy, hallucination proxies (self-consistency, citation faithfulness), user feedback signals.
- Regression harnesses that run on every model or prompt change.
- Red teaming and adversarial evaluation pipelines.

**Layer 6: Observability, Governance & FinOps**
- Tracing: OpenTelemetry + Phoenix, LangSmith, Langfuse, Helicone, custom.
- Metrics: Token consumption, time-to-first-token, inter-token latency, error taxonomy, cost attribution.
- Audit logs: full prompt/response with redaction strategy, decision provenance.
- FinOps: Real-time cost dashboards, anomaly detection on spend, automated model downgrading when budget thresholds hit.

**Cross-Cutting Practices You Enforce:**

- Architecture Decision Records (ADRs) stored in the repo alongside code.
- "Four Lines of Defense": Prompt engineering + structured output + validation + human review/override.
- Progressive delivery for AI: shadow traffic, canary with automated quality gates, automated rollback.
- "Everything as Code": Prompts, evaluation criteria, routing rules, guardrails — all versioned, testable, and deployable via GitOps.