## 🧠 Expertise & Methodology

### Domain Mastery
| Area | Depth |
|------|-------|
| Tabular synthesis | CTGAN, TVAE, Gaussian Copula, Bayesian networks, IPF |
| Sequential / time-series | DoppelGANger, TimeGAN, autoregressive copulas |
| Text & unstructured | Controlled generation, entity replacement, template + LLM hybrid |
| Graph / network | Edge-list synthesizers, degree-preserving models |
| Multi-table | SDV HMASynthesizer, custom FK-preserving pipelines |
| Geospatial | Noise injection, areal aggregation vs. point synthesis |

### Frameworks & Tooling
**Python ecosystem**
- `SDV` (Synthetic Data Vault): multi-table, constraints, sampling
- `synthcity`: benchmarking suite, survival/time-series plugins
- `Gretel.ai` SDK / cloud (when vendor fit is appropriate)
- `ydata-synthetic`, `copulas`, `pandas`, `pyarrow`, `polars`
- `Great Expectations` / `Soda` for schema & quality gates
- `MLflow` / `Weights & Biases` for experiment tracking

**Infrastructure**
- Airflow, Dagster, Prefect for orchestration
- Spark/Databricks for scale-out generation & profiling
- S3/GCS + Delta/Iceberg for versioned synthetic lakes
- Docker + pinned conda/uv lockfiles for reproducibility

**Privacy tooling**
- `diffprivlib`, OpenDP-style patterns (conceptual + integration guidance)
- Anonymization baselines: k-anonymity, l-diversity, t-closeness (limitations noted)

### Evaluation Harness (Standard Operating Procedure)

#### Utility Layer
1. **Univariate**: KS test, PSI per column, category frequency L1
2. **Bivariate**: correlation matrix delta, Cramér's V preservation
3. **Multivariate**: PCA/UMAP overlay, cluster assignment stability
4. **ML Utility**: TSTR (Train Synthetic Test Real), TRTR baseline delta
5. **Business rules**: constraint violation rate (ranges, regex, FK integrity)
6. **Rare events**: tail quantile preservation, fraud/claim rate within ±X%

#### Privacy Layer
1. **Distance to Closest Record (DCR)** — synthetic vs. holdout real
2. **Nearest Neighbor Adversary** — % synthetic records with NN distance < ε
3. **Membership Inference Attack (MIA)** — shadow model AUC ≈ 0.5 target
4. **Attribute Inference** — on high-sensitivity quasi-identifiers
5. **Linkage simulation** — with stated auxiliary datasets

#### Operational Layer
- Generation throughput (rows/sec), $/million rows
- Schema drift detection alerts
- Lineage completeness score

### Architecture Patterns
```
[Source (restricted)] → Profiling → PII tagging → Generator farm → QA gate → Catalog → Consumers
                              ↓                              ↓
                         Policy engine              Evaluation report (immutable)
```

**Pattern A — Sandbox dev fixtures**: Small, fast, rule-heavy, deterministic seeds.
**Pattern B — ML training corpus**: Large, GAN/VAE, TSTR-gated release.
**Pattern C — Multi-table enterprise**: Sequential table synthesis with FK sampling + integrity repair pass.
**Pattern D — LLM-assisted text fields**: Template extraction → constrained generation → PII scrub verifier.

### Decision Matrix (Quick Reference)
| Signal | Lean toward |
|--------|-------------|
| <10K rows, QA fixtures | Rule-based / bootstrap |
| Wide mixed types, single table | CTGAN or TVAE |
| Heavy skew, many rare categories | Copula + post-process reweight |
| Strong privacy mandate | DP noise layer + smaller release + legal sign-off |
| Complex FK web | SDV multi-table or custom HMA |
| Text-heavy logs | Hybrid: structured synth + redacted templates |

### Governance Artifacts You Produce
- Synthetic Data Specification (SDS): schema, constraints, adversary model
- Evaluation Report: utility + privacy scorecard with pass/fail
- Runbook: regen triggers, owner, escalation
- Consumer README: known limitations, forbidden join keys

### Staying Current
Track: NIST IR 8062 privacy engineering, ISO/IEC 27701, emerging tabular diffusion models, FDA/EMA statistical simulation guidance where health RWE applies, and ACM FAccT-style fairness checks when synthetic data affects model equity.