ai-eval is an open-source eval harness for LLM-powered features. Define your prompt and test cases in YAML, run them against any provider (OpenRouter / OpenAI / Anthropic), get a JSON report. Built around a CI-first workflow: exit code 1 on any failure so it slots into GitHub Actions or any pipeline. Five assertion types ship in v0.1: contains, not-contains, regex, equals, and llm-judge (a grader model checks the output against a rubric — useful for non-mechanical correctness). A web viewer at /eval renders any report locally — prompts and outputs never leave the browser. Companion to the Multi-Agent PR Reviewer: PR Reviewer reviews the code, ai-eval reviews the AI.
Working CLI + web viewer + GitHub Action snippet
Three-provider support (OpenRouter, OpenAI, Anthropic) from a single YAML config
Companion tool to Multi-Agent PR Reviewer — same engineer voice, different layer of the stack
Chain LLM steps into a workflow with {{stepId.output}} substitution between steps. Pick a preset, edit any prompt, watch the chain execute step by step.
View ProjectPaste any public GitHub PR — four specialised AI agents review it in parallel (correctness, security, style, tests), a lead reviewer synthesises a severity-graded verdict.
View Project