## 🛠️ Frameworks & Methodologies

### MCP Server Design Pattern
```
┌─────────────┐     JSON-RPC      ┌──────────────┐
│ MCP Client  │ ◄──────────────► │  MCP Server  │
│ (IDE/Agent) │                   │  (your tools) │
└─────────────┘                   └──────┬───────┘
                                         │
                                    Domain APIs
```

**Server checklist:**
- `tools/list` returns stable names and rich `inputSchema`
- `tools/call` returns structured `content` blocks + `isError` flag
- Stdio vs SSE vs streamable HTTP chosen per deployment surface
- Health probe and graceful shutdown for long-running servers

### Modular Soul / Skill Architecture
| File | Responsibility |
|------|----------------|
| `SOUL.md` | Identity, mission, success metrics |
| `STYLE.md` | Voice, formatting, response templates |
| `RULES.md` | Non-negotiable constraints |
| `SKILL.md` | Deep domain playbooks |
| `prompts/*.md` | Task-specific activation templates |

**Composition rules:**
- Lazy-load skills only when triggers match
- Keep each module < 4k tokens; split when cohesion breaks
- Cross-reference by path, never duplicate full paragraphs

### Tool Evaluation Rubric (score 1–5 each)
1. **Reliability** – deterministic errors, idempotency
2. **Latency** – cold start + p95 call time
3. **Observability** – logs, traces, metrics hooks
4. **Auth model** – OAuth, API key rotation, scoped tokens
5. **DX** – local dev story, mocks, test fixtures
6. **Cost** – infra + per-call token overhead

### Prompt CI Pipeline
1. **Static checks**: JSON validity, forbidden phrases, max length
2. **Golden tests**: fixed inputs → schema-valid outputs
3. **Regression diffs**: token count and latency vs baseline
4. **Red-team suite**: injection strings, tool-call hijack attempts

### Multi-Agent Orchestration Patterns
- **Supervisor/router**: central planner delegates to specialists
- **Pipeline**: sequential transforms with schema contracts between stages
- **Parallel fan-out / gather**: independent subtasks with merge strategy
- **Human-in-the-loop**: approval gates on destructive or high-cost tools

### Observability Stack (minimal production set)
- Correlation ID propagated: `user_msg → model → tool → model`
- Per-turn token and cost attribution
- Tool failure taxonomy: `TIMEOUT | AUTH | VALIDATION | UPSTREAM | UNKNOWN`
- Weekly eval notebook against held-out task suite

### Reference Stacks
| Layer | Common choices |
|-------|----------------|
| Protocol | MCP, OpenAI function calling, custom gRPC |
| Orchestration | LangGraph, Temporal, plain async Python |
| Gateway | LiteLLM, Portkey, custom router |
| Storage | Postgres + pgvector, Redis for session state |
| Eval | Promptfoo, Braintrust, custom pytest + LLM judge |

### Debugging Playbook (ordered)
1. Reproduce with minimal tool subset
2. Inspect raw request/response payloads (redact secrets)
3. Validate schema round-trip
4. Bisect context size
5. Swap model tier to isolate capability vs infra bugs
6. Add structured logging at tool boundary
7. Write regression test capturing the failure