"INVOKE THIS SKILL when optimizing, improving, or debugging LLM prompts using production trace data, evaluations, and annotations. Covers extracting prompts from spans, gathering performance signal, and running a data-driven optimization loop using the ax CLI."
LLM applications emit spans following OpenInference semantic conventions. Prompts are stored in different span attributes depending on the span kind and instrumentation:
| Column | What it contains | When to use |
|--------|-----------------|-------------|
| `attributes.llm.input_messages` | Structured chat messages (system, user, assistant, tool) in role-based format | **Primary source** for chat-based LLM prompts |
| `attributes.llm.input_messages.contents` | Array of message content strings | Extract message text |
| `attributes.input.value` | Serialized prompt or user question (generic, all span kinds) | Fallback when structured messages are not available |
| `attributes.llm.prompt_template.template` | Template with `{variable}` placeholders (e.g., `"Answer {question} using {context}"`) | When the app uses prompt templates |
| `attributes.llm.prompt_template.variables` | Template variable values (JSON object) | See what values were substituted into the template |
| `attributes.output.value` | Model response text | See what the LLM produced |
**LLM span** (`attributes.openinference.span.kind = 'LLM'`): Check `attributes.llm.input_messages` for structured chat messages, OR `attributes.input.value` for a serialized prompt. Check `attributes.llm.prompt_template.template` for the template.
**Chain/Agent span**: `attributes.input.value` contains the user's question. The actual LLM prompt lives on **child LLM spans** -- navigate down the trace tree.
**Tool span**: `attributes.input.value` has tool input, `attributes.output.value` has tool result. Not typically where prompts live.
Performance Signal Columns
These columns carry the feedback data used for optimization:
| `eval.<name>.explanation` | LLM-as-judge evals | Why the eval gave that score -- **most valuable for optimization** |
| `attributes.input.value` | Trace data | What went into the LLM |
| `attributes.output.value` | Trace data | What the LLM produced |
| `{experiment_name}.output` | Experiment runs | Output from a specific experiment |
Prerequisites
Proceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.
If an `ax` command fails, troubleshoot based on the error:
`command not found` or version error → see references/ax-setup.md
`401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong: check `.env` for `ARIZE_API_KEY` and use it to create/update the profile via references/ax-profiles.md. If `.env` has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys)
Space ID unknown → check `.env` for `ARIZE_SPACE_ID`, or run `ax spaces list -o json`, or ask the user
Project unclear → check `.env` for `ARIZE_DEFAULT_PROJECT`, or ask, or run `ax projects list -o json --limit 100` and present as selectable options
LLM provider call fails (missing OPENAI_API_KEY / ANTHROPIC_API_KEY) → check `.env`, load if present, otherwise ask the user
Phase 1: Extract the Current Prompt
Find LLM spans containing prompts
```bash
# List LLM spans (where prompts live)
ax spans list PROJECT_ID --filter "attributes.openinference.span.kind = 'LLM'" --limit 10
# Filter by model
ax spans list PROJECT_ID --filter "attributes.llm.model_name = 'gpt-4o'" --limit 10
# Filter by span name (e.g., a specific LLM call)
ax spans list PROJECT_ID --filter "name = 'ChatCompletion'" --limit 10
Once you have the span data, reconstruct the prompt as a messages array:
```json
[
{"role": "system", "content": "You are a helpful assistant that..."},
{"role": "user", "content": "Given {input}, answer the question: {question}"}
]
```
If the span has `attributes.llm.prompt_template.template`, the prompt uses variables. Preserve these placeholders (`{variable}` or `{{variable}}`) -- they are substituted at runtime.
Phase 2: Gather Performance Data
From traces (production feedback)
```bash
# Find error spans -- these indicate prompt failures
ax spans list PROJECT_ID \
--filter "status_code = 'ERROR' AND attributes.openinference.span.kind = 'LLM'" \