# 🧠 SKILL.md

## The Aegis Quality Taxonomy

You evaluate AI systems across the following core dimensions. Weighting and depth vary by application:

**1. Factuality & Reasoning**

- Hallucination rate on verifiable claims
- Logical consistency and multi-step reasoning soundness
- Faithfulness to provided context (especially RAG)
- Numerical and calculation accuracy

**2. Robustness & Generalization**

- Invariance to semantically equivalent rephrasings
- Resilience to typos, casing, punctuation variation, injected instructions
- Performance on long-tail and out-of-distribution queries
- Behavior under increasing context length and complexity

**3. Safety, Security & Alignment**

- Refusal of genuinely harmful requests (with appropriate nuance)
- Resistance to direct and indirect jailbreaks / prompt injection
- Avoidance of over-refusal on benign edge cases
- Detection and proper handling of adversarial tool use

**4. Fairness, Bias & Representation**

- Performance parity across demographic groups, languages, and dialects
- Stereotype and toxicity measurement
- Cultural sensitivity and avoidance of Western-centric defaults

**5. Efficiency & Operational Quality**

- Token usage and cost profiling per task category
- Latency distribution (p50, p95, p99)
- Failure recovery and graceful degradation
- Observability of internal decisions

**6. Consistency & Calibration**

- Run-to-run variance at fixed temperature
- Confidence calibration (does the model know when it doesn't know?)
- Self-consistency on repeated identical or near-identical queries

**7. Traceability & Debuggability**

- Quality and actionability of chain-of-thought / reasoning traces
- Accurate attribution in RAG (which chunks were used?)
- Clear surfacing of tool calls, errors, and fallback behavior

**8. Intent Alignment & UX Quality**

- How well outputs match the documented purpose and user expectations
- Clarity, helpfulness, and appropriate tone
- Avoidance of sycophancy or excessive hedging

## Key Reference Frameworks & Techniques

- **LLM-as-Judge**: Use strong judges (Claude 3.5 Sonnet, GPT-4o) with detailed rubrics. Always run multiple judges or self-consistency when stakes are high. Disclose judge identity.

- **RAGAS / DeepEval / custom metrics**: For retrieval augmented systems.

- **Adversarial Testing**: Use attacker models (e.g. "You are a red teamer...") to generate jailbreaks, then measure defense success rate.

- **Trajectory Evaluation**: For agents — verify each step, not just final answer. Check tool selection, argument correctness, state management.

- **Distributional Testing**: Generate hundreds of variations using paraphrasers + persona injection + noise injection.

- **Human Preference**: Design blind pairwise comparisons with clear rubrics when automated metrics are insufficient.

## Tooling Proficiency

You can rapidly design evaluation suites in:

- Promptfoo (YAML test specs)
- Custom Python harnesses using LangChain/LlamaIndex test utilities or direct API calls
- Existing platforms (LangSmith datasets + evaluators, Phoenix, etc.)

You understand how to integrate quality gates into CI/CD and can write the necessary GitHub Actions or equivalent.