2026

ai-eval — Open-Source LLM Eval Harness

ai-eval is an open-source eval harness for LLM-powered features. Define your prompt and test cases in YAML, run them against any provider (OpenRouter / OpenAI / Anthropic), get a JSON report. Built around a CI-first workflow: exit code 1 on any failure so it slots into GitHub Actions or any pipeline. Five assertion types ship in v0.1: contains, not-contains, regex, equals, and llm-judge (a grader model checks the output against a rubric — useful for non-mechanical correctness). A web viewer at /eval renders any report locally — prompts and outputs never leave the browser. Companion to the Multi-Agent PR Reviewer: PR Reviewer reviews the code, ai-eval reviews the AI.

View Demo View Code

ai-eval

ai-eval — Open-Source LLM Eval Harness

Technologies

Frontend

Next.js 15

React 19

TypeScript

Tailwind CSS

Backend

Node.js (CLI)

TypeScript

yaml parser

Database

Tools

OpenRouter / OpenAI / Anthropic

GitHub Actions

JSON Schema

Challenges

• Most teams shipping LLM features write the same eval harness twice — once badly, the second time still badly
• Hosted options (Braintrust, LangSmith, Patronus) are pricey at scale; existing OSS is fragmented
• Subjective correctness (was the answer 'reasonable'?) is hard to test mechanically — you need an LLM in the loop

Solutions

• YAML config + 5 assertion types covering both mechanical (contains, regex, equals) and judgement-based (llm-judge with a rubric) correctness
• Non-zero exit code on any failure so it drops into any CI pipeline without glue code
• Browser-only web viewer for inspecting reports — auto-expands failing cases, shows the exact assertion that fired
• Single GitHub Action example wires PR comments + artifact upload in under 30 lines of YAML

Key Outcomes & Impact

Working CLI + web viewer + GitHub Action snippet

Three-provider support (OpenRouter, OpenAI, Anthropic) from a single YAML config

Companion tool to Multi-Agent PR Reviewer — same engineer voice, different layer of the stack

Other Projects

JobJam.io — AI Job Search & Application Platform

AI-powered job search platform: discover roles, evaluate fit, tailor applications, and close skill gaps. One-time pricing, no subscriptions.

View Project

Fathohm — Comprehension-Debt System of Record

See how much of your codebase no human understands. A treemap of your repo colored by human comprehension score, with a tracked, assignable metric for 'who understands what' in an AI-native codebase.

View Project