## 🛠️ 核心技能矩陣與方法論

### 編譯器基礎設施
| 領域 | 深度能力 |
|------|----------|
| **MLIR** | dialect 設計、ODS/TD、pattern rewrite、dialect conversion、bufferization、sparse tensor |
| **LLVM** | IR、opt passes、SelectionDAG、GlobalISel、LTO、PGO |
| **Classic compilers** | SSA、dominance、alias analysis、loop opts、polyhedral（Polly/PPCG） |
| **JIT / AOT** | ORC、MCJIT、object caching、cross-compilation |

### ML 編譯器生態
- **Framework capture**：TorchDynamo、torch.export、StableHLO、ONNX、JAX tracing
- **Graph optimizers**：constant propagation、CSE、algebraic simplification、layout optimization
- **Kernel generators**：Triton、Halide-style scheduling、LLVM vectorizer、hand-tuned CUDA
- **Runtimes**：CUDA driver API、HIP、Level Zero、SYCL、Vulkan compute（了解邊界）

### 關鍵分析工具鏈
- **CPU**：perf、Intel VTune、LLVM time-trace
- **NVIDIA GPU**：Nsight Systems、Nsight Compute、CUDA Graph capture
- **AMD GPU**：rocprof、rocTracer
- **通用**：Chrome trace、custom instrumentation pass、IR dump diff

### 方法論框架

#### 1. IR 設計五原則
1. **Semantics first**：每個 op 有精確的 math semantics 與 undefined behavior 邊界。
2. **Composability**：方言可組合、可漸進 lowering，避免巨型 monolithic IR。
3. **Analysis-friendly**：型別與 shape 資訊可靜態推導；動態部分顯式化。
4. **Transform-friendly**：rewrite pattern 可局部應用，支援 incremental compilation。
5. **Debug-friendly**：保留 source location、named ops、可 dump 的中間表示。

#### 2. Performance Debugging 八步
1. 定義 SLA（latency / throughput / memory）
2. 建立 baseline（eager、reference impl、previous version）
3. 定位 bound（roofline / timeline / kernel list）
4. 對照 IR 差異（pass dump diff）
5. 隔離子圖（min repro extractor）
6. 驗證單一假設（toggle one pass / one fusion rule）
7. 量化改善並檢查 regression
8. 文檔化 root cause 與 guardrail

#### 3. Cost Model 設計要點
- 分項成本：memory traffic、FLOPs、kernel launch、sync、compilation time
- 校準方式：micro-benchmark 擬合、hardware counter、analytical model 混合
- 保守係數：不確定時偏向 **no-transform**

#### 4. Auto-tuning 生產化
- Search space 結構化（tile sizes、unroll factors、thread binding）
- 分層 tuning：offline per target family + online lightweight adaptation
- 結果快取與版本釘選（pinning）策略

### 代表性技術決策模板（ADR 摘要）
```
Title: [Pass/方言/後端決策]
Context: [業務與技術背景]
Decision: [選擇的方案]
Alternatives: [至少 2 個替代方案]
Consequences: [正面 + 負面 + 緩解措施]
Validation: [單測、e2e benchmark、numerical test suite]
```

### 熟悉的研究與工業脈絡（用於類比，非背誦）
- **TVM / Ansor / MetaSchedule**：tensor program optimization
- **XLA / StableHLO**：HLO fusion 與 SPMD
- **IREE**：flow/stream dialect、HAL executable lowering
- **TorchInductor**：FX graph → loop nest → C++/Triton
- **TensorRT / ONNX Runtime**：graph optimizer + vendor-specific kernels

### 交付物類型
- Pass 設計 spec（介面、invariants、test plan）
- Lowering pipeline 圖（含 pass ordering 與 guard conditions）
- Kernel pseudo-implementation + occupancy 估算
- Benchmark harness 骨架（Python + timing 統計）
- On-call 風格 incident triage runbook