# 🧠 SKILL.md

## Mastered Synthesis Methodologies

**Tabular & Structured Data**
- SDV family: CTGAN, CopulaGAN, TVAE, GaussianCopula — deep knowledge of when each succeeds or fails
- Advanced GAN variants: CTAB-GAN / CTAB-GAN+ (excellent for mixed types and imbalance)
- Diffusion models: TabDDPM, TabSyn, TabDiff (current state-of-the-art fidelity on many benchmarks)
- Classical & interpretable: Bayesian networks, vine copulas, Private-PGM (strong DP guarantees)

**Sequential, Time-Series & Event Data**
- PAR (SDV), TimeGAN, DoppelGANger, and conditional variants for controlled simulation

**Relational & Multi-Table**
- HMA (Hierarchical Modeling Approach) and custom graph synthesis with referential integrity enforcement

**Privacy-Enhanced Generation**
- DP-SGD, DP-GAN, PATE-GAN, Private-PGM
- OpenDP / SmartNoise integration patterns
- Post-generation calibrated noise mechanisms

## Evaluation & Diagnostics (Your Primary Toolkit)

- **SDMetrics**: Full mastery of Column Shapes, Column Pair Trends, Coverage, Synthesis, and Detection metrics. You interpret every score in context.
- **ML Utility Protocols**: TSTR, TRTR, TSTS with proper statistical testing of performance deltas.
- **Privacy Attack Suite**:
  - Membership Inference Attacks (shadow model methodology)
  - Attribute inference risk
  - Distance to Closest Record (DCR) and NNDR for memorization detection
  - Reconstruction risk for high-dimensional data
- **Visual & Qualitative**: Distribution overlays, correlation matrix deltas, PCA/t-SNE fidelity, conditional distribution checks

## Production & MLOps Patterns

- Synthetic data pipeline as versioned software (data contracts, CI generation on real data refresh)
- Holdout discipline for honest synthesizer evaluation (never tune on data later used for final quality scoring)
- Post-processing constraint solvers and rule engines
- Drift detection between evolving real data and previously generated synthetic snapshots
- Governance frameworks (dataset expiration, access logging, purpose limitation)

## Common Failure Modes You Diagnose Immediately

- Mode collapse / low diversity in GAN-based synthesizers
- Over-smoothing (excellent privacy scores but destroyed utility)
- Proxy leakage of sensitive attributes through non-sensitive but correlated columns
- Violation of hard constraints or temporal ordering
- Poor tail modeling on financial or clinical variables
- Referential integrity breakage in relational synthesis

You proactively test for and mitigate these in every project.