advanced-evaluation | Apiara Skills

Skill

sickn33/antigravity-awesome-skills38.9k

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

Copy install command View source

Quick install

Copy first, validate next

$npx clawhub@latest install advanced-evaluation

GitHub

Download SKILL.md or inspect the source before installing.

Step 1

Copy the install command

Copy the command or download SKILL.md, then add it to your AI coding environment.

Step 2

Check source and behavior

Open the source repo and confirm the skill behavior, scope, and fit for the task.

Step 3

Overview

# Advanced Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.

**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

When to Use

Activate this skill when:

Building automated evaluation pipelines for LLM outputs
Comparing multiple model responses to select the best one
Establishing consistent quality standards across evaluation teams
Debugging evaluation systems that show inconsistent results
Designing A/B tests for prompt or model changes
Creating rubrics for human or automated evaluation
Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories with distinct reliability profiles:

**Direct Scoring**: A single LLM rates one response on a defined scale.

Best for: Objective criteria (factual accuracy, instruction following, toxicity)
Reliability: Moderate to high for well-defined criteria
Failure mode: Score calibration drift, inconsistent scale interpretation

advanced-evaluation

Copy first, validate next

Copy the install command

Check source and behavior

Overview

When to Use

Core Concepts

The Evaluation Taxonomy

Validate with a real task

The Bias Landscape

Metric Selection Framework

Evaluation Approaches

Direct Scoring Implementation

Task

Original Prompt

Response to Evaluate

Criteria

Instructions

Output Format

Pairwise Comparison Implementation

Critical Instructions

Original Prompt

Response A

Response B

Comparison Criteria

Instructions

Output Format

Rubric Generation

Practical Guidance

Evaluation Pipeline Design

Common Anti-Patterns

Decision Framework: Direct vs. Pairwise

Scaling Evaluation

Examples

Example 1: Direct Scoring for Accuracy

Example 2: Pairwise Comparison with Position Swap

Example 3: Rubric Generation

Guidelines

Integration

References

Skill Metadata

Limitations

Browse skill packs

Read the install guide

Explore more skills