## 🛠️ 核心能力框架與知識體系

### Observability 三大支柱（AI 擴展版）

#### 1. Metrics
- **Infrastructure metrics**：GPU utilization、queue depth、batch size、KV cache hit rate
- **Model metrics**：TTFT (time to first token)、TPOT (time per output token)、tokens/sec、context window utilization
- **Business metrics**：task completion rate、escalation rate、CSAT proxy、revenue per inference
- **Cost metrics**：$/1K tokens by model、embedding cost per query、vector DB read units
- **Quality metrics**：faithfulness score、relevance score、tool call accuracy、refusal rate

#### 2. Traces
- **Span hierarchy 標準**：
  ```
  ai.request (root)
  ├── ai.retrieval (RAG)
  │   ├── vector.search
  │   └── rerank
  ├── ai.prompt.construct
  ├── ai.llm.inference
  │   ├── ai.llm.streaming
  │   └── ai.llm.tool_call
  ├── ai.tool.execute.{tool_name}
  └── ai.post_process
  ```
- **Context propagation**：W3C Trace Context、OpenTelemetry semantic conventions for GenAI
- **Correlation IDs**：session_id、conversation_id、user_id (hashed)、experiment_id

#### 3. Logs
- **Structured logging**：JSON schema with prompt_hash (not raw prompt)、model_id、latency_breakdown
- **Event types**：inference_start、tool_error、guardrail_triggered、cache_miss、rate_limited

### 工具生態系（客觀掌握）

| 類別 | 工具 | 強項 | 弱項 |
|------|------|------|------|
| LLM-native | LangSmith, Langfuse, Helicone | Trace + eval 整合 | Vendor lock-in 風險 |
| General APM | Datadog LLM Observability, New Relic | 企業現有 stack 整合 | GenAI semantic 深度 |
| Open Source | OpenTelemetry + Jaeger/Tempo, Phoenix (Arize) | 標準化、可自架 | 需自行整合 eval |
| Eval Platforms | Braintrust, Promptfoo, DeepEval | Eval automation | 非 full production monitoring |
| Feature Flags | LaunchDarkly, Statsig | A/B for prompts/models | 非 tracing 主力 |

### Evaluation 框架

#### Offline Eval Pipeline
```
Dataset (versioned) → Prompt Template (versioned) → Model (pinned) → Eval Metrics → Regression Gate → Deploy
```

#### 核心 Eval Metrics
- **RAG**：context precision/recall、answer relevance、faithfulness (RAGAS)
- **Agent**：task success rate、steps to completion、tool selection accuracy
- **Safety**：toxicity、PII leakage、jailbreak resistance
- **Performance**：p50/p95/p99 latency under load

#### Online Eval
- Shadow traffic comparison
- Interleaved human preference (RLHF-style feedback collection)
- LLM-as-judge with calibration against human labels

### SLO 設計模板

```yaml
# AI Chat Assistant SLO Example
service: ai-assistant
slis:
  - name: inference_availability
    metric: successful_responses / total_requests
    window: 30d
  - name: latency_p99
    metric: http_request_duration_p99
    threshold_ms: 5000
  - name: quality_score
    metric: avg_eval_score (daily sample)
    threshold: 0.85
  - name: cost_per_session
    metric: total_tokens * price / sessions
    threshold_usd: 0.05
slo_targets:
  inference_availability: 99.5%
  latency_p99: 95% of time < 5s
  quality_score: 90% of days >= 0.85
error_budget_policy:
  - burn_rate > 10x: page on-call
  - burn_rate > 2x: freeze non-critical deploys
```

### Instrumentation 程式碼模式

#### OpenTelemetry (Python / LangChain)
```python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("ai.observability", "1.0.0")

def traced_llm_call(prompt: str, model: str, session_id: str):
    with tracer.start_as_current_span(
        "ai.llm.inference",
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "ai.session.id": session_id,
            "ai.prompt.token_count": count_tokens(prompt),
        }
    ) as span:
        try:
            response = llm.invoke(prompt)
            span.set_attribute("gen_ai.response.token_count", response.usage.total)
            span.set_attribute("ai.cost.usd", calculate_cost(response.usage))
            return response
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise
```

### Incident Response Playbook 結構
1. **Triage**：影響範圍（% users、which model/feature）
2. **Hypothesis tree**：Model degradation vs Retrieval failure vs Infra saturation
3. **Evidence collection**：dashboard links、trace exemplars、recent deploys
4. **Mitigation**：rollback model、increase cache、reduce context、fallback model
5. **Postmortem**：timeline、root cause、action items with owners

### Drift Detection
- **Data drift**：embedding distribution shift (KL divergence)
- **Model drift**：eval score trend over 7d/30d rolling window
- **Prompt drift**：template hash changes without version bump
- **Cost drift**：tokens/request increase > 20% WoW

### 成熟度模型（Observability Maturity）
| Level | 特徵 |
|-------|------|
| L0 | 僅有 application logs，無 AI-specific signals |
| L1 | Basic latency + error rate per endpoint |
| L2 | Full trace with retrieval + inference spans |
| L3 | Automated eval gates + cost attribution |
| L4 | Predictive alerting + self-healing rollback |