The Economics of Inference: Routing, Caching, and Batching for LLM Apps
A practical playbook for cutting per-request cost without sacrificing latency — model routing, semantic cache hit rates, request batching, and the metrics that actually matter.
The Economics of Inference: Routing, Caching, and Batching for LLM Apps
As a Senior Software Engineer with a decade of experience across various tech landscapes, I've seen technologies evolve from nascent ideas to industry standards. Right now, large language models (LLMs) are experiencing that explosive growth. Building LLM-powered applications is exhilarating, but the operational costs – specifically, inference costs – can quickly escalate from an exciting experiment to a daunting budget line item.
The dream is simple: powerful AI at our fingertips. The reality? Each token generated, each model call made, chips away at our profit margins. This isn't just about saving money; it's about building sustainable, scalable LLM applications that can actually go to market. Over the past year, I've been deep diving into strategies to optimize these costs without sacrificing the user experience. This post is a practical playbook, a distillation of what I've learned about cutting per-request cost through intelligent routing, semantic caching, and request batching.
Let's dive into the metrics that truly matter and how to influence them.
The Metrics That Matter Most
Before we talk solutions, let's align on the problems we're trying to solve. The key metrics I focus on are:
- Cost Per Request (CPR): This is the holy grail. It’s the total cost of serving a single user interaction that involves an LLM. It includes model inference, RAG store lookups (if applicable), orchestration, and sometimes even infrastructure overhead. Our goal is to drive this down.
- Latency Per Request (LPR): The time a user waits for a response. While we want to reduce CPR, we absolutely cannot tank LPR. A cheap but slow LLM app is a dead LLM app.
- Throughput: The number of requests our system can handle per unit of time. This often correlates with scaling capacity.
These three are a constant balancing act. Optimizing one often impacts the others.
Intelligent Model Routing: The Right Model for the Job
Not all LLMs are created equal, nor are all requests. Using a 70B parameter model for a simple summarization task when a finely-tuned 7B model or even a quick regex could do the trick is like using a sledgehammer to crack a nut – expensive and overkill.
Intelligent model routing is about conditionally directing requests to the most appropriate (and typically, most cost-effective) LLM.
How It Works:
- Request Classification: Analyze the incoming request to understand its intent and complexity. This can be done using a smaller, cheaper LLM (e.g.,
gpt-3.5-turboto classify if a request is a simple Q&A, a creative writing prompt, or a code generation task) or even deterministic rules. - Model Selection Logic: Based on the classification, choose the optimal LLM. This logic could consider cost, latency, specific capabilities (e.g., code generation models for code, vision models for image analysis), and even current model load.
- Dynamic Invocation: Route the request to the selected model.
Example Scenario: A support chatbot.
- Simple FAQ: Can be handled by a highly optimized, smaller open-source model running on cheaper hardware or even a keyword-based lookup for deterministic answers.
- Complex Troubleshooting: Might require a
gpt-4level model for nuanced understanding and multi-turn conversations. - Internal Knowledge Base Query: Could use a different model, perhaps one fine-tuned on internal documentation, or one optimized for RAG.
Here's a simplified TypeScript example of a routing function:
type ModelConfig = {
name: string;
costPerToken: number; // For relative cost comparison
latencyMs: number;
capabilities: string[]; // e.g., ['summarization', 'code-gen', 'creative-writing']
endpoint: string;
};
const MODEL_CATALOG: Record<string, ModelConfig> = {
"fast-cheap-model": {
name: "FastCheapModel",
costPerToken: 0.0000005,
latencyMs: 150,
capabilities: ["summarization", "simple-qa"],
endpoint: "/api/llm/fast-cheap"
},
"mid-tier-model": {
name: "MidTierModel",
costPerToken: 0.0000015,
latencyMs: 300,
capabilities: ["complex-qa", "creative-writing", "sentiment-analysis"],
endpoint: "/api/llm/mid-tier"
},
"premium-model": {
name: "PremiumModel",
costPerToken: 0.00001,
latencyMs: 800,
capabilities: ["code-gen", "advanced-reasoning", "multi-modal"],
endpoint: "/api/llm/premium"
},
};
async function classifyRequest(prompt: string): Promise<string> {
// In a real scenario, this would call a smaller LLM or use ML classification
// For demo, let's use some simple keyword logic
if (prompt.includes("generate code") || prompt.includes("write a function")) {
return "code-gen";
}
if (prompt.includes("summarize") || prompt.includes("shorten")) {
return "summarization";
}
if (prompt.split(' ').length < 10 && prompt.endsWith('?')) {
return "simple-qa";
}
return "complex-qa";
}
async function routeLLMRequest(prompt: string): Promise<string> {
const requestType = await classifyRequest(prompt); // e.g., 'code-gen', 'summarization'
let selectedModel: ModelConfig | undefined;
// Simple routing logic - prefer cheapest model that meets capabilities
for (const modelName in MODEL_CATALOG) {
const model = MODEL_CATALOG[modelName];
if (model.capabilities.includes(requestType)) {
if (!selectedModel || model.costPerToken < selectedModel.costPerToken) {
selectedModel = model;
}
}
}
if (!selectedModel) {
// Fallback to a default or error
console.warn(`No specific model found for type: ${requestType}. Falling back to mid-tier.`);
selectedModel = MODEL_CATALOG["mid-tier-model"];
}
console.log(`Routing request to ${selectedModel.name} for type: ${requestType}`);
// In a real app, you'd then call an API client for selectedModel.endpoint
return `Response from ${selectedModel.name} for "${prompt}"`;
}
// Example Usage:
// routeLLMRequest("Summarize this article.");
// routeLLMRequest("Generate code for a simple API in Node.js.");Semantic Caching: Avoiding Rework
Semantic caching is probably my favorite optimization – it's like a superpower for reducing costs and latency for repeat queries. Unlike traditional exact-match caching which misses slightly rephrased questions, a semantic cache understands meaning.
How It Works:
- Embed Input: When an LLM request comes in, embed the user's prompt into a vector representation.
- Vector Search: Search a vector database (e.g., Pinecone, Weaviate, Milvus, ChromaDB) for semantically similar historical queries.
- Cache Hit/Miss:
- Hit: If a sufficiently similar query's response is found (based on a cosine similarity threshold), return the cached response. Bingo! Near-zero cost, sub-100ms latency.
- Miss: If no similar query is found, proceed with the normal LLM inference. Once the LLM response is generated, embed the new prompt and its response, and store both in the vector database for future use.
The key to success here is a good embedding model and a well-tuned similarity threshold. Too high, and you miss valid cache opportunities. Too low, and you return irrelevant answers.
Here’s a diagram illustrating the semantic cache flow:
graph TD
A[User Request] --> B{Calculate Embeddings};
B --> C{Vector Search in Cache};
C -- High Similarity --> D[Return Cached Response];
C -- Low Similarity --> E[Call LLM for Inference];
E --> F[Get LLM Response];
F --> G{Embed New Prompt & Response};
G --> H[Store in Vector Cache];
F --> D;Code Snippet for Semantic Cache (Conceptual):
import { embedInput } from './embeddingService'; // Assumes you have an embedding service
import { vectorDb } from './vectorDbService'; // Assumes you have a vector database client
const SIMILARITY_THRESHOLD = 0.85; // Tune this based on your use case
interface CacheEntry {
inputPrompt: string;
response: string;
embedding: number[];
}
async function getLlmResponseWithSemanticCache(prompt: string): Promise<string> {
const promptEmbedding = await embedInput(prompt);
// 1. Search cache for similar prompts
const searchResults = await vectorDb.search<CacheEntry>({
queryVector: promptEmbedding,
tableName: 'semantic_cache',
topK: 1,
});
if (searchResults.length > 0) {
const { item: cachedEntry, score } = searchResults[0];
if (score >= SIMILARITY_THRESHOLD) {
console.log(`Cache hit! Score: ${score.toFixed(2)}`);
return cachedEntry.response;
}
}
// 2. Cache miss: Call LLM
console.log("Cache miss. Calling LLM...");
// In a real app, this would be your actual LLM inference logic
const llmResponse = await callActualLLM(prompt);
// 3. Store new prompt and response in cache
const newCacheEntry: CacheEntry = {
inputPrompt: prompt,
response: llmResponse,
embedding: promptEmbedding, // Re-use the embedding calculated earlier
};
await vectorDb.upsert('semantic_cache', [newCacheEntry]);
return llmResponse;
}
// Helper (mock) functions
async function callActualLLM(prompt: string): Promise<string> {
// Simulate LLM call
await new Promise(resolve => setTimeout(resolve, 500));
return `LLM responded to: "${prompt}"`;
}Request Batching: Volume Discounts for Your Queries
Batching is a classic optimization technique that applies beautifully to LLMs, especially for asynchronous or background tasks. Many LLM providers (and self-hosted models) offer discounted rates or better throughput for batched requests. The idea is to combine multiple individual requests into a single, larger request to the LLM.
Types of Batching:
- Static Batching: You have a fixed number of inputs ready (e.g., summarising 10 articles from a daily feed), so you send them all at once.
- Dynamic Batching (Micro-batching): You accumulate requests over a short time window (e.g., 50ms) or until a certain batch size is reached. This is crucial for latency-sensitive applications where you don't want to wait too long, but still want to find opportunities to batch.
Considerations:
- Latency vs. Throughput: Dynamic batching introduces a small delay (the batching window), but it significantly improves overall throughput and reduces per-inference cost.
- Context Window Limits: Ensure your combined batched prompts don't exceed the LLM's maximum context window.
- Error Handling: If one sub-request in a batch fails, how do you handle the others?
Conceptual Batching Service:
type BatchedLLMRequest = {
id: string;
prompt: string;
resolve: (response: string) => void;
reject: (error: Error) => void;
timestamp: number;
};
const BATCH_SIZE = 5; // Or based on token limits
const BATCH_WINDOW_MS = 100; // ms to wait before flushing batch
let currentBatch: BatchedLLMRequest[] = [];
let batchTimeout: NodeJS.Timeout | null = null;
async function processBatch() {
if (currentBatch.length === 0) return;
const prompts = currentBatch.map(req => req.prompt);
console.log(`Processing batch of ${prompts.length} requests...`);
try {
// In a real scenario, this would be a single API call to YOUR LLM provider's batch endpoint
// or to a self-hosted model with batching capabilities.
const combinedResponse = await callBatchedLLM(prompts); // Returns an array of responses
currentBatch.forEach((req, index) => {
req.resolve(combinedResponse[index]);
});
} catch (error: any) {
currentBatch.forEach(req => {
req.reject(new Error(`Batch processing failed: ${error.message}`));
});
} finally {
currentBatch = [];
if (batchTimeout) clearTimeout(batchTimeout);
batchTimeout = null;
}
}
async function addRequestToBatch(prompt: string): Promise<string> {
return new Promise((resolve, reject) => {
currentBatch.push({ id: Math.random().toString(36), prompt, resolve, reject, timestamp: Date.now() });
if (currentBatch.length >= BATCH_SIZE) {
processBatch();
} else if (!batchTimeout) {
batchTimeout = setTimeout(processBatch, BATCH_WINDOW_MS);
}
});
}
// Mock batched LLM caller
async function callBatchedLLM(prompts: string[]): Promise<string[]> {
await new Promise(resolve => setTimeout(resolve, 300 + prompts.length * 50)); // Simulating longer batch call
return prompts.map(p => `Batched response for: "${p}"`);
}
// Example Usage:
// (async () => {
// console.log("Adding requests...");
// const p1 = addRequestToBatch("What is the capital of France?");
// const p2 = addRequestToBatch("Tell me a fun fact about giraffes.");
// const p3 = addRequestToBatch("Explain photosynthesis simply.");
// const p4 = addRequestToBatch("Suggest a good book.");
// const p5 = addRequestToBatch("Who won the last Super Bowl?"); // This should trigger batch processing
// const [r1, r2, r3, r4, r5] = await Promise.all([p1, p2, p3, p4, p5]);
// console.log({ r1, r2, r3, r4, r5 });
// })();Bringing It All Together: A Layered Approach
The most effective strategy combines these techniques. A typical request flow might look like this:
- Semantic Cache Lookup: Is there a cached response for this exact or semantically similar query? (Lowest cost, lowest latency)
- If Cache Miss, Route Request: Classify the request and route it to the most cost-effective LLM capable of handling it.
- If Asynchronous/Batchable, Apply Batching: Pool requests that can tolerate slight latency for cost savings.
- LLM Inference: Finally, call the selected LLM.
This layered approach creates a highly optimized inference pipeline, reducing CPR and LPR without compromising on the quality of your LLM application.
Final Thoughts
The world of LLMs is rapidly evolving, and so are the strategies for managing their operational costs. Implementing intelligent routing, robust semantic caching, and thoughtful request batching are no longer "nice-to-haves" but essential components of building financially viable LLM applications. These techniques empower you to deliver powerful AI experiences at a fraction of the cost, making your product more competitive and sustainable.
I'm constantly exploring new ways to optimize LLM interactions. If you're tackling similar challenges or have insights to share, I'd love to connect and exchange ideas!