Back to Blog
DevOps & Tools

Agent Observability: Tracing, Evaluating, and Debugging Multi-Step AI Workflows

The new ops stack for AI — how to trace agent runs, build eval harnesses, and catch silent regressions before your users do.

Amit ShrivastavaMay 18, 20268 min read

A code snippet from this post was tested

Node.js v22.22.2 · Verified May 18, 2026

Logic from this post, adapted into a runnable form and executed by the publishing pipeline.

node verify.mjs

Snippet

function runEvaluation(agent, testCases) {
    const results = [];
    for (const testCase of testCases) {
        const startTime = 1000; // Fixed start time for determinism
        const { output, traceId } = agent.run(testCase.input); // Agent's run method now returns output and traceId
        const endTime = startTime + Math.floor(testCase.input.length / 5) * 10; // Simulate latency based on input length

        // Simulate LLM-based evaluation
        // Use a deterministic "random" for predictable output
        const factualAccuracyScoreDeterminism = (testCase.input.length % 10) / 10; 
        const factualAccuracyScore = factualAccuracyScoreDeterminism; 
        
        // Simplified criteria check
        const criteriaMet = output.includes("differences") && output.length < 200; 

        results.push({
            testCaseInput: testCase.input,
            agentOutput: output,
            agentTraceId: traceId,
            metrics: {
                latencyMs: endTime - startTime,
                tokenCost: 0.001, 
                factualAccuracyScore,
                criteriaMet,
            },
            evaluatorFeedback: `Factual accuracy: ${ (factualAccuracyScore * 100).toFixed(0) }%. Criteria met: ${ criteriaMet }.`
        });
    }
    return results;
}

// Mock Agent to make the snippet self-contained and runnable
const mockAgent = {
    run: (input) => {
        let output = "";
        let traceId = "trace_" + input.substring(0, 5).replace(/\s/g, '_');

        if (input.includes("React and Vue")) {
            output = `React uses JSX, Vue uses templates. React's learning curve is steeper. Vue is generally perceived as simpler. Key differences.`;
        } else if (input.includes("Python vs JavaScript")) {
            output = `Python is server-side, great for AI/ML. JavaScript is client-side, for web. Both are popular.`;
            traceId = "trace_py_js";
        } else {
            output = `This is a generic response for: ${input.substring(0, 20)}...`;
        }
        return { output, traceId };
    }
};

// Test Cases
const testCases = [
    { 
        input: "Summarize the key differences between React and Vue.",
        expectedOutputCriteria: "Must mention 3 differences",
        expectedFacts: []
    },
    { 
        input: "Compare Python and JavaScript in web development.",
        expectedOutputCriteria: "Mention server-side vs client-side",
        expectedFacts: []
    },
    {
        input: "What is the capital of France?",
        expectedOutputCriteria: "Should be 'Paris'",
        expectedFacts: []
    }
];

// Run evaluation and log results
const evalResults = runEvaluation(mockAgent, testCases);
console.log(JSON.stringify(evalResults, null, 2));

// Test a case where criteriaMet might be false
const shortOutputTestCase = [
    {
        input: "Briefly compare React and Vue.",
        expectedOutputCriteria: "Must mention 3 differences",
        expectedFacts: []
    }
];
const shortEvalResults = runEvaluation(mockAgent, shortOutputTestCase);
console.log("\nResults for short output test case:");
console.log(JSON.stringify(shortEvalResults, null, 2));

Captured output

[
  {
    "testCaseInput": "Summarize the key differences between React and Vue.",
    "agentOutput": "React uses JSX, Vue uses templates. React's learning curve is steeper. Vue is generally perceived as simpler. Key differences.",
    "agentTraceId": "trace_Summa",
    "metrics": {
      "latencyMs": 100,
      "tokenCost": 0.001,
      "factualAccuracyScore": 0.2,
      "criteriaMet": true
    },
    "evaluatorFeedback": "Factual accuracy: 20%. Criteria met: true."
  },
  {
    "testCaseInput": "Compare Python and JavaScript in web development.",
    "agentOutput": "This is a generic response for: Compare Python and J...",
    "agentTraceId": "trace_Compa",
    "metrics": {
      "latencyMs": 90,
      "tokenCost": 0.001,
      "factualAccuracyScore": 0.9,
      "criteriaMet": false
    },
    "evaluatorFeedback": "Factual accuracy: 90%. Criteria met: false."
  },
  {
    "testCaseInput": "What is the capital of France?",
    "agentOutput": "This is a generic response for: What is the capital ...",
    "agentTraceId": "trace_What_",
    "metrics": {
      "latencyMs": 60,
      "tokenCost": 0.001,
      "factualAccuracyScore": 0,
      "criteriaMet": false
    },
    "evaluatorFeedback": "Factual accuracy: 0%. Criteria met: false."
  }
]

Results for short output test case:
[
  {
    "testCaseInput": "Briefly compare React and Vue.",
    "agentOutput": "React uses JSX, Vue uses templates. React's learning curve is steeper. Vue is generally perceived as simpler. Key differences.",
    "agentTraceId": "trace_Brief",
    "metrics": {
      "latencyMs": 60,
      "tokenCost": 0.001,
      "factualAccuracyScore": 0,
      "criteriaMet": true
    },
    "evaluatorFeedback": "Factual accuracy: 0%. Criteria met: true."
  }
]

The New AI Ops Stack: Why Agent Observability is Non-Negotiable

As a senior software engineer who's dabbled in everything from frontend wizardry to the wild west of Web3, I've seen my fair share of paradigm shifts. The current AI revolution, particularly with the rise of multi-step AI agents, feels different. It's not just about building smarter models; it's about building reliable AI systems that can operate autonomously without self-destructing or silently failing in production. And for that, we need a whole new approach to operations – what I'm calling the "AI Ops Stack," with agent observability at its core.

Think about it: A traditional microservice either works or it doesn't. If a database query fails, you get an error. If an API call times out, you see it in your logs. But an AI agent, especially one orchestrating multiple tools and models, can seem to work fine while producing subtly incorrect or suboptimal results. These "silent regressions" are the nightmares keeping me up at night, because by the time a user points it out, it's often too late. This post is about how we can proactively tackle that challenge by tracing, evaluating, and debugging these complex AI workflows.

The Challenge of Multi-Step AI Workflows

My most recent projects involve building intelligent agents that handle complex tasks—think agents that can research a topic, summarize findings, draft an email, and then integrate with a CRM. Each step might involve an LLM call, a tool invocation (like a web search or API call), and decision-making logic.

Here's a simplified visualization of such a workflow:

graph TD
    A[User Query] --> B{LLM: Planning};
    B --> C{Tool: Web Search?};
    C -- Yes --> D[Tool Output];
    C -- No --> E{LLM: Draft Response};
    D --> E;
    E --> F[API: CRM Update?];
    F -- Yes --> G[Tool Output];
    F -- No --> H[Final Response];
    G --> H;

When something goes wrong in this kind of system, how do you pinpoint the issue? Was the initial LLM prompt ambiguous? Did the web search tool return irrelevant results? Did the CRM update fail silently due to an incorrect payload? Without proper observability, it's like debugging in the dark.

Introducing the Three Pillars of Agent Observability

To bring clarity to these opaque workflows, I've found that focusing on three key areas is essential:

  1. Tracing Agent Runs: Understanding the "what" and "how" of each step.
  2. Evaluating Agent Performance: Quantifying the "goodness" of the agent's output.
  3. Debugging Silent Regressions: Proactively catching issues before they impact users.

Let's dive into each.

1. Tracing Agent Runs: The Digital Breadcrumbs

Tracing in traditional software gives us visibility into requests across services. For AI agents, it's about seeing the entire thought process and execution flow. This means capturing:

  • LLM calls: Prompts, responses, tokens used, latency, model name.
  • Tool invocations: Tool name, input parameters, raw output, execution time.
  • Intermediate thoughts/reasoning: The agent's internal monologue or chain of thought.
  • State changes: How the agent's internal state evolves throughout the run.

Most modern agent frameworks (like LangChain, LlamaIndex, or even custom orchestrators) offer hooks for this. I typically integrate with dedicated AI observability platforms or build a custom logging layer on top of a standard tracing system like OpenTelemetry.

Here's a simplified TypeScript example of how you might wrap an LLM call to add tracing information:

import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-ai-agent-tracer');

async function callLLMWithTracing(
    model: string,
    prompt: string,
    agentStepName: string
): Promise<string> {
    const span = tracer.startSpan(`AgentStep: ${agentStepName} - LLM Call`);
    return context.with(trace.set  Span(context.active(), span), async () => {
        try {
            span.setAttribute('agent.model', model);
            span.setAttribute('agent.prompt', prompt);
            // Simulate LLM call
            const response = await new Promise(resolve =>
                setTimeout(() => resolve(`Response for: ${prompt.substring(0, 50)}...`), 200)
            );
            span.setAttribute('agent.response', response as string);
            span.setStatus({ code: SpanStatusCode.OK });
            return response as string;
        } catch (error) {
            span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
            span.recordException(error);
            throw error;
        } finally {
            span.end();
        }
    });
}

// Usage example:
async function executeAgentFlow() {
    const initialPrompt = "Summarize the key differences between React and Vue.";
    const llmResponse = await callLLMWithTracing("gpt-4", initialPrompt, "Initial Summary Generation");
    console.log("LLM produced:", llmResponse);
    // ... continues with more agent steps
}

This ensures every LLM interaction, regardless of where it happens in the agent's logic, is captured as a clear trace segment. The same principle applies to tool calls and internal decision points.

2. Evaluating Agent Performance: Knowing What "Good" Looks Like

While tracing shows you what happened, evaluation tells you how well it happened. For AI agents, traditional unit tests are often insufficient. We need robust evaluation harnesses that can:

  • Compare outputs against a gold standard: For tasks with a definitive correct answer.
  • Assess quality metrics: Fluency, factual accuracy, completeness, helpfulness, conciseness. This often requires another LLM as an evaluator or human in the loop.
  • Measure latency and cost: Crucial for production systems.
  • Detect hallucinations or safety violations: Using specific prompt templates and LLM checkers.

I often build an EvalRunner component that takes an agent, a set of test cases (inputs and expected outputs/criteria), and then runs the agent against each case, capturing both the agent's full trace and evaluating its performance.

interface TestCase {
    input: string;
    expectedOutputCriteria?: string; // e.g., "Must mention 3 differences", "Should not exceed 100 words"
    expectedFacts?: string[]; // For factual accuracy checks
}

interface EvalResult {
    testCaseInput: string;
    agentOutput: string;
    agentTraceId: string; // Link to the full trace
    metrics: {
        latencyMs: number;
        tokenCost: number; // Placeholder for actual token cost calculation
        factualAccuracyScore?: number; // 0-1, from LLM-based evaluation
        criteriaMet?: boolean; // From LLM-based or regex check
        // ... other metrics
    };
    evaluatorFeedback?: string; // Specific feedback from LLM evaluator
}

async function runEvaluation(agent: Agent, testCases: TestCase[]): Promise<EvalResult[]> {
    const results: EvalResult[] = [];
    for (const testCase of testCases) {
        const startTime = Date.now();
        const { output, traceId } = await agent.run(testCase.input); // Agent's run method now returns output and traceId
        const endTime = Date.now();

        // Simulate LLM-based evaluation
        const factualAccuracyScore = Math.random(); // In reality, another LLM call or RAG based check
        const criteriaMet = output.includes("differences") && output.length < 200; // Simplified criteria check

        results.push({
            testCaseInput: testCase.input,
            agentOutput: output,
            agentTraceId: traceId,
            metrics: {
                latencyMs: endTime - startTime,
                tokenCost: 0.001, // Actual token calculation would be more complex
                factualAccuracyScore,
                criteriaMet,
            },
            evaluatorFeedback: `Factual accuracy: ${ (factualAccuracyScore * 100).toFixed(0) }%. Criteria met: ${ criteriaMet }.`
        });
    }
    return results;
}

This evaluation harness can be run continuously in CI/CD pipelines, flagging deviations in key metrics.

3. Debugging Silent Regressions: Proactive Issue Detection

This is where tracing and evaluation converge. A silent regression is when your agent appears to be working correctly from a high level, but the quality of its output has degraded, or it's subtly failing on edge cases.

To catch these:

  • Automated Regression Tests: Every time you deploy a new model, prompt, or tool, run your evaluation suite. Compare current metrics (e.g., factual accuracy, latency, token usage) against a baseline. If a defined threshold is crossed, alert the team.
  • Anomaly Detection on Traces: Look for deviations in trace patterns over time. Is the agent taking significantly more steps than usual for a given task type? Is it using different tools? Are LLM calls suddenly much longer or shorter? These can indicate a change in behavior, intentional or otherwise.
  • Human-in-the-Loop Feedback: While not strictly automated, making it easy for users to provide feedback (e.g., "This answer was unhelpful") and linking that feedback directly to the agent's run trace is invaluable. This allows for quick root cause analysis of real-world issues.

My workflow often involves:

  1. A new PR triggers an EvalRunner against a diverse set of test cases.
  2. Results are compared to a baseline. If metrics degrade or specific "canary" test cases fail, the PR is blocked.
  3. For accepted PRs, the new agent version is deployed to a staging environment with full tracing.
  4. A small percentage of production traffic could be routed to the new agent (canary deployment) for real-world validation, with dedicated monitoring for success rates and trace anomalies.

The Future of AI Ops

Agent observability isn't just a nice-to-have; it's a fundamental requirement for building robust, reliable, and scalable AI applications. As our agents become more sophisticated, interacting with more tools and making more autonomous decisions, our ability to understand their internal workings must keep pace.

By implementing comprehensive tracing, building rigorous evaluation harnesses, and leveraging these insights for proactive debugging, we can build confidence in our AI systems and tackle the unique challenges that this new era of intelligent automation presents. The days of treating LLMs as black boxes are over; it's time to shine a light on their inner workings.


I'm always eager to discuss the frontiers of AI Ops and engineering challenges. If you're building similar systems or have thoughts on improving agent observability, let's connect! You can find me on LinkedIn or X.

Observability
AI Agents
Evals
Tracing