arize-evaluator | Apiara Skills

arize-evaluator

"INVOKE THIS SKILL for LLM-as-judge evaluation workflows on Arize: creating/updating evaluators, running evaluations on spans or experiments, tasks, trigger-run, column mapping, and continuous monitoring. Use when the user says: create an evaluator, LLM judge, hallucination/faithfulness/correctness/relevance, run eval, score my spans or experiment, ax tasks, trigger-run, trigger eval, column mapping, continuous monitoring, query filter for evals, evaluator version, or improve an evaluator prompt."

Overview

# Arize Evaluator Skill

This skill covers designing, creating, and running **LLM-as-judge evaluators** on Arize. An evaluator defines the judge; a **task** is how you run it against real data.

---

Prerequisites

Proceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.

If an `ax` command fails, troubleshoot based on the error:

`command not found` or version error → see references/ax-setup.md
`401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong: check `.env` for `ARIZE_API_KEY` and use it to create/update the profile via references/ax-profiles.md. If `.env` has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys)
Space ID unknown → check `.env` for `ARIZE_SPACE_ID`, or run `ax spaces list -o json`, or ask the user
LLM provider call fails (missing OPENAI_API_KEY / ANTHROPIC_API_KEY) → check `.env`, load if present, otherwise ask the user

---

Concepts

What is an Evaluator?

An **evaluator** is an LLM-as-judge definition. It contains:

| Field | Description |

|-------|-------------|

| **Template** | The judge prompt. Uses `{variable}` placeholders (e.g. `{input}`, `{output}`, `{context}`) that get filled in at run time via a task's column mappings. |

Overview

Prerequisites

Concepts

What is an Evaluator?

What is a Task?

Data Granularity

How trace and session aggregation works

The `{conversation}` template variable

Multi-evaluator tasks

Basic CRUD

AI Integrations

Evaluators

Tasks

Workflow A: Create an evaluator for a project

Step 1: Resolve the project name to an ID

Step 2: Understand what to evaluate

Step 3: Confirm or create an AI integration

Step 4: Create the evaluator

Step 5: Ask — backfill, continuous, or both?

Step 6: Determine column mappings from real span data

Step 7: Create the task

Step 8: Trigger a backfill run (if requested)

Workflow B: Create an evaluator for an experiment

Step 1: Resolve dataset and experiment

Step 2: Understand what to evaluate

Step 3: Confirm or create an AI integration

Step 4: Create the evaluator

Step 5: Determine column mappings from real run data

Step 6: Create the task

Step 7: Trigger and monitor

Best Practices for Template Design

1. Use generic, portable variable names

2. Default to binary labels

3. Be explicit about what the model must return

4. Keep temperature low

5. Use `--include-explanations` for debugging

6. Pass the template in single quotes in bash

7. Always set `--classification-choices` to match your template labels

Troubleshooting

Related Skills

Save Credentials for Future Use