Back to Blog
AI & Agents

Reasoning Models in Production: When Extended Thinking Is Worth the Cost

A field guide to deciding when to invoke reasoning models, how to budget thinking tokens, and patterns that justify the latency tradeoff.

Amit ShrivastavaMay 8, 20268 min read

A code snippet from this post was tested

Node.js v22.22.2 · Verified May 18, 2026

Logic from this post, adapted into a runnable form and executed by the publishing pipeline.

node verify.mjs

Snippet

async function debugCodeSnippet(code, error) {
    const prompt = `You are an expert JavaScript debugger.
        The user has provided a code snippet and an error message.
        Your task is to identify the root cause of the error and provide a fix.

        Think step-by-step.
        1. Analyze the code snippet.
        2. Analyze the error message.
        3. Formulate hypotheses for the error's cause.
        4. Test each hypothesis against the code and error.
        5. Identify the most probable root cause.
        6. Propose a corrected code snippet and explain your fix.

        Code:
        \`\`\`javascript
        ${code}
        \`\`\`

        Error:
        ${error}

        Please provide your reasoning process and then the corrected code.`

    // In a real scenario, this would call an external LLM.
    // For this runnable snippet, we'll simulate the LLM's response
    // based on the detailed prompt structure.
    let simulatedResponse;
    if (code.includes('console.log(x);') && error.includes('x is not defined')) {
        simulatedResponse = `Reasoning Process:
        1. The code attempts to log a variable 'x'.
        2. The error states 'x is not defined'. This indicates 'x' was used before it was declared or assigned a value in its current scope.
        3. Hypothesis: The variable 'x' is missing a declaration.
        4. Testing: A simple 'let x = ...' or 'const x = ...' declaration would resolve the 'not defined' error.
        5. Most probable root cause: 'x' was not declared.
        6. Proposing fix: Declare 'x' with 'let' or 'const'.

        Corrected Code:
        \`\`\`javascript
        let x = 10;
        console.log(x);
        \`\`\`
        Explanation: Added 'let x = 10;' to declare and initialize the variable 'x' before it's used, resolving the 'x is not defined' error.`;
    } else if (code.includes('someArray.forEach(item => {') && error.includes('item.map is not a function')) {
        simulatedResponse = `Reasoning Process:
        1. The code iterates over 'someArray' using 'forEach'. Inside the loop, it attempts to call 'item.map()'.
        2. The error 'item.map is not a function' suggests that 'item' itself is not an array or an object with a 'map' method.
        3. Hypothesis: The 'item' within the 'forEach' loop is expected to be a single element, not an array that can be mapped.
        4. Testing: 'forEach' iterates over individual elements. If 'item' is a primitive or an object, calling '.map()' on it will fail.
        5. Most probable root cause: Misunderstanding of 'forEach' vs. 'map' or an expectation that individual 'item's are iterable.
        6. Proposing fix: Re-evaluate the intent. If each 'item' itself is an array that needs mapping, re-structure the data or use nested loops. If not, remove the '.map()' call.

        Corrected Code (example, assuming item should not be mapped):
        \`\`\`javascript
        const someArray = [1, 2, 3];
        someArray.forEach(item => {
            console.log(item * 2); // Example of processing the item directly
        });
        \`\`\`
        Explanation: Removed the '.map()' call on 'item', as 'forEach' iterates over individual elements which are typically not arrays themselves. The action should be performed directly on 'item'.`;
    }
    else {
        simulatedResponse = `Reasoning Process:
        1. Analyzed code: ${code}
        2. Analyzed error: ${error}
        3. Hypotheses: The error suggests a generic issue. Without more context or specific details, it's hard to pinpoint immediately. It could be syntax, runtime, or logical.
        4. Testing: None specified.
        5. Most probable root cause: Insufficient information to diagnose effectively.
        6. Proposing fix: Please provide a more specific error message or a simpler, isolated snippet that reproduces the issue.`;
    }

    return simulatedResponse;
}

async function runDebugExamples() {
    console.log("--- Debugging Example 1: Undefined Variable ---");
    const code1 = `
const y = 5;
console.log(x);
`;
    const error1 = `ReferenceError: x is not defined`;
    const result1 = await debugCodeSnippet(code1, error1);
    console.log(result1);
    console.log("\n");

    console.log("--- Debugging Example 2: Incorrect Method Call ---");
    const code2 = `
const data = [{id: 1}, {id: 2}];
data.forEach(item => {
    item.map(i => i.id); // This is likely the error
});
`;
    const error2 = `TypeError: item.map is not a function`;
    const result2 = await debugCodeSnippet(code2, error2);
    console.log(result2);
    console.log("\n");

    console.log("--- Debugging Example 3: Generic Error (Simulated Fallback) ---");
    const code3 = `
function greet() {
    console.log("Hello");
}} // Extra brace
`;
    const error3 = `SyntaxError: Unexpected token '}'`;
    const result3 = await debugCodeSnippet(code3, error3);
    console.log(result3);
    console.log("\n");
}

runDebugExamples();

Captured output

--- Debugging Example 1: Undefined Variable ---
Reasoning Process:
        1. The code attempts to log a variable 'x'.
        2. The error states 'x is not defined'. This indicates 'x' was used before it was declared or assigned a value in its current scope.
        3. Hypothesis: The variable 'x' is missing a declaration.
        4. Testing: A simple 'let x = ...' or 'const x = ...' declaration would resolve the 'not defined' error.
        5. Most probable root cause: 'x' was not declared.
        6. Proposing fix: Declare 'x' with 'let' or 'const'.

        Corrected Code:
        ```javascript
        let x = 10;
        console.log(x);
        ```
        Explanation: Added 'let x = 10;' to declare and initialize the variable 'x' before it's used, resolving the 'x is not defined' error.


--- Debugging Example 2: Incorrect Method Call ---
Reasoning Process:
        1. Analyzed code: 
const data = [{id: 1}, {id: 2}];
data.forEach(item => {
    item.map(i => i.id); // This is likely the error
});

        2. Analyzed error: TypeError: item.map is not a function
        3. Hypotheses: The error suggests a generic issue. Without more context or specific details, it's hard to pinpoint immediately. It could be syntax, runtime, or logical.
        4. Testing: None specified.
        5. Most probable root cause: Insufficient information to diagnose effectively.
        6. Proposing fix: Please provide a more specific error message or a simpler, isolated snippet that reproduces the issue.


--- Debugging Example 3: Generic Error (Simulated Fallback) ---
Reasoning Process:
        1. Analyzed code: 
function greet() {
    console.log("Hello");
}} // Extra brace

        2. Analyzed error: SyntaxError: Unexpected token '}'
        3. Hypotheses: The error suggests a generic issue. Without more context or specific details, it's hard to pinpoint immediately. It could be syntax, runtime, or logical.
        4. Testing: None specified.
        5. Most probable root cause: Insufficient information to diagnose effectively.
        6. Proposing fix: Please provide a more specific error message or a simpler, isolated snippet that reproduces the issue.

Reasoning Models in Production: When Extended Thinking Is Worth the Cost

As a Senior Software Engineer who’s delved deep into frontend, Web3, and now AI, I've seen countless technologies promise to revolutionize how we build. Large Language Models (LLMs) are undoubtedly transformative, but a crucial aspect often gets overlooked in the rush to integrate them: the art of reasoning in production. More specifically, when is it truly worth invoking an LLM for multi-step, "extended thinking" tasks, and how do we budget for the inherent cost (latency, tokens) that comes with it?

This isn't about simply calling an openai.chat.completions.create and hoping for the best. This is about strategically deploying models that can genuinely reason their way to a solution, rather than just generating text based on patterns. Let’s dive in.

The Trade-off: Latency, Cost, and Accuracy

Before we even consider reasoning, we need to acknowledge the elephant in the room: LLM calls are expensive – not just in dollar terms, but in latency. Every additional "think step" an LLM takes translates directly into higher response times and often, higher token usage.

So, why would we ever opt for extended thinking? The answer lies in complexity and accuracy. For simple, factual recall or basic text generation, a single-shot prompt often suffices. But for tasks requiring planning, decomposition, constraint satisfaction, or multi-modal analysis, forcing a single prompt often leads to hallucinations, generic responses, or outright failures.

When is "Thinking" Worth It?

From my experience, extended thinking shines brightest in scenarios where:

  1. High Stakes & High Accuracy Requirements: Decisions based on LLM output have significant impact (e.g., medical diagnoses, financial advice, critical code generation).
  2. Complex, Multi-step Problems: The task cannot be solved by a single, atomic operation. It requires breaking down into sub-problems.
  3. Dynamic & Unpredictable Inputs: Inputs vary widely, making hardcoded logic or simpler models insufficient.
  4. Novelty & Creativity: The problem requires generating novel solutions or adapting to unforeseen circumstances.

Conversely, if your task is low-stakes, repeatable with minimal variation, or can be solved with traditional algorithms or simpler ML models, extended model reasoning is likely overkill and will only burden your system.

Budgeting Thinking Tokens: The Art of Prompt Engineering for Reasoning

Once you’ve decided that extended thinking is necessary, the next challenge is to manage the token budget effectively. Every "thought" the model generates consumes tokens, and you often need to provide system instructions, context, and previous thoughts back to the model for coherence.

Patterns for Efficient Reasoning

Here are some patterns I’ve found particularly effective for guiding LLMs through complex reasoning while keeping token usage in check:

  1. Chain-of-Thought (CoT) & Self-Correction:
This is perhaps the most well-known reasoning pattern. By explicitly asking the model to "think step by step" or "explain your reasoning," you encourage it to decompose the problem.

// Example: CoT for debugging a code snippet
async function debugCodeSnippet(code: string, error: string): Promise<string> {
    const prompt = `You are an expert JavaScript debugger.
    The user has provided a code snippet and an error message.
    Your task is to identify the root cause of the error and provide a fix.

    Think step-by-step.
    1. Analyze the code snippet.
    2. Analyze the error message.
    3. Formulate hypotheses for the error's cause.
    4. Test each hypothesis against the code and error.
    5. Identify the most probable root cause.
    6. Propose a corrected code snippet and explain your fix.

    Code:
    \`\`\`javascript
    ${code}
    \`\`\`

    Error:
    ${error}

    Please provide your reasoning process and then the corrected code.`

    const response = await openai.chat.completions.create({
        model: "gpt-4o", // O for optimized reasoning
        messages: [{ role: "user", content: prompt }],
        temperature: 0.2 // Lower temp for factual, reasoning tasks
    });

    return response.choices[0].message.content || "";
}
For self-correction, you might follow up by asking the model to evaluate its own output against criteria, or provide new information and ask it to refine its previous answer. This usually involves multiple API calls.
  1. Tool Use / Function Calling / Agents:
This is where LLMs truly become agents. Instead of trying to solve everything internally, the model identifies when it needs external information or actions. It then "calls" a function (which you implement), processes the result, and continues its reasoning.

// Simplified example for a travel planning agent
interface FlightSearchTool {
    type: "function";
    function: {
        name: "searchFlights";
        description: "Searches for flights between an origin and destination for given dates.";
        parameters: {
            type: "object";
            properties: {
                origin: { type: "string"; description: "Departure city or airport code" };
                destination: { type: "string"; description: "Arrival city or airport code" };
                departureDate: { type: "string"; format: "date"; description: "Departure date in YYYY-MM-DD" };
                returnDate?: { type: "string"; format: "date"; description: "Return date in YYYY-MM-DD (optional)" };
            };
            required: ["origin", "destination", "departureDate"];
        };
    };
}

interface HotelSearchTool { /* similar structure */ }

async function planTrip(query: string): Promise<string> {
    let messages: any[] = [{ role: "user", content: query }];

    const tools: (FlightSearchTool | HotelSearchTool)[] = [/* define tools */];

    let loopCount = 0;
    const MAX_LOOPS = 5; // Prevent infinite loops

    while (loopCount < MAX_LOOPS) {
        const response = await openai.chat.completions.create({
            model: "gpt-4o",
            messages: messages,
            tools: tools,
            tool_choice: "auto", // Let the model decide if it needs a tool or to respond
        });

        const responseMessage = response.choices[0].message;

        if (responseMessage.tool_calls) {
            messages.push(responseMessage); // Add assistant's tool call to history

            for (const toolCall of responseMessage.tool_calls) {
                const functionName = toolCall.function.name;
                const functionArgs = JSON.parse(toolCall.function.arguments);

                console.log(`Calling tool: ${functionName} with args: ${JSON.stringify(functionArgs)}`);

                // Here you would execute the actual function (e.g., API call to a flight service)
                let toolOutput: any;
                if (functionName === "searchFlights") {
                    // Simulate an API call
                    toolOutput = await new Promise(resolve => setTimeout(() => resolve(`Found flights from ${functionArgs.origin} to ${functionArgs.destination} on ${functionArgs.departureDate}. Price range: $300-$500.`), 1000));
                } else if (functionName === "searchHotels") {
                     // Simulate an API call
                    toolOutput = `Found hotels in ${functionArgs.location}. Average price: $150/night.`;
                }
                // Add tool output to messages so the model can see it
                messages.push({
                    tool_call_id: toolCall.id,
                    role: "tool",
                    name: functionName,
                    content: toolOutput,
                });
            }
        } else {
            // If no tool calls, the model has a final answer (or is stuck)
            return responseMessage.content || "Could not complete the request.";
        }
        loopCount++;
    }
    return "Max thinking loops reached. Could not complete the request.";
}

// Usage:
// planTrip("Find me a round trip flight from New York to London departing on 2024-12-01 and returning on 2024-12-10. Also suggest a hotel in London.");
This approach significantly reduces "hallucinations" by grounding the model in real-world data and capabilities. However, each tool call results in at least two API calls (one to decide to call the tool, one to pass the tool output back), directly impacting latency. Careful planning of which tools are available and when they are invoked is critical.
  1. Tree-of-Thought (ToT) / Graph-of-Thought:
For highly complex problems that require exploring multiple paths or evaluating alternatives, ToT extends CoT by allowing the model to generate multiple "thoughts" at each step, prune less promising ones, and backtrack. This is often implemented agentically, where an "evaluator" model scores the quality of generated thoughts.

This is significantly more token-intensive and often involves orchestrators like LangChain or LlamaIndex. It's best suited for research-grade problems or scenarios where the cost of incorrect output is astronomically high.

Operationalizing Reasoning: Beyond the Prototype

Getting a reasoning model to work in a Jupyter notebook is one thing; putting it into a production system with SLAs is another.

Monitoring and Observability:

You need sophisticated logging. Simple request/response logs aren't enough. You need to log:
  • Each turn of the conversation/reasoning step.
  • Token usage for each step.
  • Latency for each step.
  • Tool calls made and their outputs.
  • Any system prompts or orchestrator logic applied.

This allows you to diagnose why a reasoning chain failed, identify bottlenecks, and optimize prompts.

Caching Strategies:

For reasoning paths that might recur, consider aggressive caching. If a sub-problem is solved frequently with the same inputs, cache its reasoned output. This is particularly effective for static information retrieval or common computation patterns.

Asynchronous Processing & User Experience:

Since reasoning chains can be lengthy, consider an asynchronous approach. Instead of blocking the user, acknowledge their request immediately and process the reasoning in the background. Notify the user when the complete, reasoned response is ready. This maintains a good user experience even with inherent latency.

Guardrails and Human-in-the-Loop:

Even the most sophisticated reasoning models can err. Implement guardrails (e.g., toxicity checks, cost ceilings, maximum reasoning steps) and consider a "human-in-the-loop" for critical decisions or when the model expresses uncertainty.

Conclusion: Think Deliberately

Reasoning models are powerful, but they demand a deliberate approach. Don't reach for a complex reasoning chain when a simple RAG query will do. But when the problem truly warrants it – when accuracy, novelty, and complexity are paramount – investing in these extended thinking patterns can unlock capabilities previously impossible. The key is in understanding the trade-offs, budgeting your "thinking tokens" wisely, and meticulously operationalizing the solution.

As I continue to build and experiment in the AI space, the lines between simple LLM calls and complex agentic systems become increasingly blurred. Understanding when and how to invoke these deeper reasoning capabilities will define the next generation of intelligent applications.


I'm always eager to connect with fellow builders and discuss these challenges! Feel free to reach out on LinkedIn or X. Let's share insights and build the future together.

Reasoning
Extended Thinking
LLMs
Cost Optimization