## 🛠️ Frameworks & Methodologies

### Platform Strategy Frameworks

#### AI Platform Maturity Model (5 Levels)
1. **Ad Hoc** — Individual teams, no shared infra, shadow IT
2. **Foundational** — Shared GPU pool, basic model registry, manual deployments
3. **Managed** — Self-service deployment, CI/CD for ML, centralized observability
4. **Optimized** — Autoscaling inference, FinOps dashboards, automated eval gates, multi-tenant governance
5. **Intelligent** — Predictive capacity, federated learning options, platform-driven AI product innovation

#### Build vs. Buy Decision Matrix
Evaluate across: time-to-value, customization depth, integration complexity, data residency, vendor lock-in risk, TCO (3-year), team skill match, and exit strategy.

### Architecture Patterns

| Pattern | Use When | Watch Out For |
|---------|----------|---------------|
| **Centralized Inference Gateway** | Multi-team LLM access, unified auth/billing | Single point of failure—design for HA |
| **Embedded Model Sidecar** | Low-latency, domain-specific models | Ops complexity per service |
| **RAG Pipeline as a Service** | Enterprise knowledge bases | Stale indexes, permission-aware retrieval |
| **Batch Inference Queue** | Large offline scoring jobs | Queue backlog, priority starvation |
| **Hybrid Cloud GPU Burst** | Training spikes exceed on-prem capacity | Data egress costs, latency |
| **Multi-Model Router** | Cost/latency optimization across model tiers | Routing logic complexity, eval drift |

### Operational Playbooks

#### Inference SLO Template
- **Availability**: 99.9% monthly (exclude scheduled maintenance)
- **Latency**: p50 < 500ms, p99 < 2s (adjust per use case)
- **Throughput**: Define tokens/sec per replica at target concurrency
- **Error budget**: 0.1% for 99.9% SLO—tie to release cadence

#### Incident Response (AI-Specific)
1. Detect: latency spikes, error rate, GPU OOM, model version mismatch
2. Triage: inference vs. training vs. data pipeline vs. external API
3. Mitigate: rollback model version, scale replicas, enable fallback model, circuit-break upstream
4. Postmortem: eval regression? prompt change? infra saturation?

### FinOps for AI
- **Unit economics**: cost per inference, cost per training run, cost per 1M tokens
- **GPU utilization targets**: 60-80% for training clusters, right-size inference replicas
- **Spot/preemptible strategies** for fault-tolerant training
- **Model tiering**: route simple queries to smaller/cheaper models

### Governance Toolkit
- **Model cards** and **system cards** for every production model
- **ADR template**: Context → Decision → Consequences → Alternatives considered
- **Data classification** driving retention, encryption, and access policies
- **Promotion gates**: dev → staging → prod requires eval thresholds, security scan, load test

### Evaluation Stack
- Offline: benchmark suites (domain-specific), regression tests on golden datasets
- Online: shadow deployments, canary releases, A/B with guardrail metrics
- Safety: red-team prompt libraries, PII leakage tests, toxicity/bias classifiers

### Reference Technology Landscape (Non-Exhaustive)
- **Serving**: vLLM, Ray Serve, BentoML, SageMaker, Vertex AI, Azure ML
- **Orchestration**: Kubernetes + Karpenter, Slurm, Flyte, Metaflow
- **Vector/Retrieval**: Pinecone, Weaviate, pgvector, Milvus, Elasticsearch
- **Observability**: Prometheus/Grafana, Datadog, Arize, LangSmith, Helicone
- **Agents**: LangGraph, CrewAI, custom tool-use loops with guardrail middleware