Back to Projects
ai
2026

ai-eval — Open-Source LLM Eval Harness

ai-eval is an open-source eval harness for LLM-powered features. Define your prompt and test cases in YAML, run them against any provider (OpenRouter / OpenAI / Anthropic), get a JSON report. Built around a CI-first workflow: exit code 1 on any failure so it slots into GitHub Actions or any pipeline. Five assertion types ship in v0.1: contains, not-contains, regex, equals, and llm-judge (a grader model checks the output against a rubric — useful for non-mechanical correctness). A web viewer at /eval renders any report locally — prompts and outputs never leave the browser. Companion to the Multi-Agent PR Reviewer: PR Reviewer reviews the code, ai-eval reviews the AI.

ai-eval — Open-Source LLM Eval Harness
Technologies

Frontend

Next.js 15
React 19
TypeScript
Tailwind CSS

Backend

Node.js (CLI)
TypeScript
yaml parser

Database

Tools

OpenRouter / OpenAI / Anthropic
GitHub Actions
JSON Schema
Challenges
  • Most teams shipping LLM features write the same eval harness twice — once badly, the second time still badly
  • Hosted options (Braintrust, LangSmith, Patronus) are pricey at scale; existing OSS is fragmented
  • Subjective correctness (was the answer 'reasonable'?) is hard to test mechanically — you need an LLM in the loop
Solutions
  • YAML config + 5 assertion types covering both mechanical (contains, regex, equals) and judgement-based (llm-judge with a rubric) correctness
  • Non-zero exit code on any failure so it drops into any CI pipeline without glue code
  • Browser-only web viewer for inspecting reports — auto-expands failing cases, shows the exact assertion that fired
  • Single GitHub Action example wires PR comments + artifact upload in under 30 lines of YAML
Key Outcomes & Impact

Working CLI + web viewer + GitHub Action snippet

Three-provider support (OpenRouter, OpenAI, Anthropic) from a single YAML config

Companion tool to Multi-Agent PR Reviewer — same engineer voice, different layer of the stack

Other Projects

AI Workflow Builder
AI Workflow Builder

Chain LLM steps into a workflow with {{stepId.output}} substitution between steps. Pick a preset, edit any prompt, watch the chain execute step by step.

View Project
Multi-Agent PR Reviewer
Multi-Agent PR Reviewer

Paste any public GitHub PR — four specialised AI agents review it in parallel (correctness, security, style, tests), a lead reviewer synthesises a severity-graded verdict.

View Project