Infra & Security 5 June 2026 · 7 min

Semantic Caching for Agent Crews: Cutting Token Bills by 40%

By wGrow Project Team · 5 June 2026

We audited our OpenAI bill last quarter. The number was embarrassing: thousands of dollars spent generating answers to questions we had already answered — not the same questions exactly, but the same questions with a swapped preposition, a different article, a rephrased subordinate clause. The LLM did not care. It billed us full generation cost every time.

That is the exact-match fallacy. And it is expensive.

The Exact-Match Fallacy in Agent Workflows

Traditional caching operates on cryptographic hashes. SHA-256 the input string, check the cache, return on hit, compute on miss. This works for deterministic function calls where the same bytes produce the same result. LLM prompts are another matter entirely.

A changed comma is a cache miss. A synonym is a cache miss. “What is the penalty for late submission under Section 14?” and “Under Section 14, what are the penalties if a submission is late?” hash to completely different keys. The LLM returns functionally identical answers. You paid twice.

In agentic workflows, this compounds fast. Intermediate agents dynamically assemble prompts from upstream context injection, so the same user intent arrives as a slightly different string every time. Exact-match hashing on dynamically assembled prompts produces hit rates that are functionally zero.

We saw this concretely on a document processing crew deployed for a public-sector client handling thousands of daily policy queries — citizens and officers asking about the same policies in different phrasing. In a three-month internal audit of production query logs (stratified random sample across policy domains, manually reviewed), our initial cache — SHA-256 keys against a Redis store — hit under 10 percent; manual review of the miss sample confirmed that the majority were paraphrases of queries the crew had already answered. We were paying full LLM generation rates on 90 percent of traffic that was semantically redundant.

Architecture of a Semantic Interceptor

Technical illustration of data routing through a central filter node.

Architecture Map

The fix is to cache on meaning, not bytes.

We built a semantic cache layer that sits in front of the agent crew and intercepts duplicate intent before it reaches the model. The stack is deliberately boring: we already run Redis Stack, whose RediSearch query engine provides HNSW vector indexing. We used that rather than standing up a dedicated vector database — fewer moving parts, one less managed service, one less failure domain. When you’re debugging a production incident at 2 a.m., you want boring.

The embedding model is OpenAI’s text-embedding-3-small. Cache key generation needs to be fast and cheap, not semantically rich — just stable and consistent. text-embedding-3-small fits that requirement well.

Incoming queries are embedded, then run against the Redis index via KNN vector search. If the cosine similarity of the nearest neighbor exceeds our threshold, we return the cached answer directly — the crew never runs. Below threshold, the crew executes, and we write the answer along with its embedded query vector and source document metadata back to Redis.

Routing is binary: a hit is a hit. Some teams add a lightweight verification pass for high-stakes domains — routing cache hits through a smaller model to confirm relevance before serving. For the statutory board deployment, a well-calibrated threshold made that unnecessary, but it is worth considering for any use case where a false positive carries real cost.

The Unit Economics of Interception

Caching Economics

Gen Compute (gpt-4o): $10.00
Embed Compute: $0.02
Redis VSS Latency: <50ms

Illustrative — Comparing OpenAI token generation estimates against embedding generation.

Caching is not free. You trade generation compute for embedding compute plus vector search. The math has to close.

At the time of this deployment, OpenAI’s pricing page listed gpt-4o output at $10.00 per million output tokens** (openai.com/api/pricing, accessed [month year]). A typical policy query response runs 400–600 output tokens — call it 500. That is **$ 0.005 per generation.

At the same pricing date, OpenAI listed text-embedding-3-small at $0.02 per million input tokens** (openai.com/api/pricing, same access date). A typical query is 20–40 tokens — call it 30. That is roughly **$ 0.0000006 per embedding.

The ratio is approximately 8,300 to 1. Avoiding a single LLM generation pays for the embedding compute on roughly 8,300 incoming queries. The break-even math is so lopsided it barely warrants a spreadsheet — and yet most teams skip the cache entirely because the architecture feels fiddly.

Latency is a secondary benefit that becomes significant at scale. In production traces over the first eight weeks post-launch (184,219 requests, p50s measured at the edge), Redis HNSW lookup added 20–50 ms at p50 on the hot path. Cache misses generating roughly 500 output tokens took 2–6 s end to end depending on load. Cache-hit responses completed in under 100 ms at p50 — a >95 percent reduction against the miss path.

Calibrating the Similarity Threshold

Vector caching introduces a failure mode that text caching does not have: fuzzy matching can return wrong answers.

Set the threshold too low, and you start serving cached answers to questions that are superficially similar but semantically distinct. “What is the penalty for late submission under Section 14?” might retrieve an answer generated for a question about Section 14 registration fees. Both questions are about Section 14. They are not the same question. In a regulatory context, that is not a UX problem — it is a liability.

Set it too high, and you approach near-exact-match behaviour. Hit rate collapses, and you have added embedding latency to every miss without gaining anything on the hits.

We do not guess this number. We ran three months of historical query logs through a calibration script: embed every logged query pair that produced the same crew output, plot the distribution of pairwise cosine similarities, find the threshold that maximises true positives while holding false positives below an acceptable rate. The distribution tends to be steep in the relevant range — there is a band where the threshold does useful work, and outside that band it is either useless or dangerous. You see it clearly when you plot the data.

For the statutory board project, we settled on 0.93 — deliberately strict. At that level, minor phrasing variation is absorbed; genuinely distinct questions are not collapsed. Across the first eight weeks after launch (184,219 requests), production telemetry showed a 42 percent cache hit rate. Support review of a sampled ticket set found no confirmed false-positive cache hits over that window — we tracked this as a monitoring signal rather than a guarantee of zero errors. OpenAI API spend for the workflow fell 40 percent against the eight-week pre-cache baseline, normalized for request volume.

The 0.93 figure is specific to this domain. Regulatory queries have tight semantic boundaries; a customer support workflow handling conversational, loosely-worded queries will calibrate lower. The methodology — empirical calibration from production logs — holds regardless of where you land.

Cache Invalidation and State Management

IT professional adjusting hardware in a modern server room.

Targeted Purge Workflow

Update Policy

Trigger Event

Clear Cache

Ingestion Pipeline

Save Doc

Fire Webhook

Redis Layer

Listen

Purge by Doc ID

Agent crews operating on live documents face a cache invalidation problem more serious than typical web caching. A stale cached answer about a regulatory penalty that was amended three weeks ago is not just wrong — in certain contexts it is actively harmful.

TTL expiration does not solve this. A 24-hour TTL means potentially serving stale regulatory guidance for a full day after a policy update. Shrink the TTL to an hour and you preserve freshness but gut your hit rate — you have traded one problem for another.

We use event-driven invalidation instead. When the statutory board updates a source document, the ingestion pipeline fires a webhook to our cache management service. Every cache entry in Redis carries metadata identifying which source document IDs were consulted to generate that answer. The webhook triggers a targeted purge of every entry referencing the updated document ID.

This requires discipline in the generation path: when the crew produces an answer, we record which document chunks were retrieved, and that provenance travels with the cache entry. Invalidation is precise. Unaffected entries survive. Hit rate recovers within a query cycle rather than waiting out a TTL window.

The implementation runs to roughly 80 lines of Python as a FastAPI webhook receiver. The complexity is not in the code — it is in deciding to do it correctly from the start rather than retrofitting after the first production incident.

Compute is for Reasoning, Not Regurgitating

Generating tokens is the most expensive phase of an AI pipeline. Context windows are growing, which means per-query generation costs can rise even as per-token prices fall. Hardware improvements help at the margins. The underlying cost structure is compute-bound, and that is not changing soon.

The statutory board deployment shows what a semantic routing layer delivers at production scale: 42 percent cache hit rate, 40 percent cost reduction, negligible false-positive rate, deterministic invalidation — with no exotic tooling.

The architecture generalises directly to any agent workflow with high query repetition and retrievable answers: customer support crews, internal knowledge-base agents, document processing pipelines. The embedding step is cheap enough that break-even analysis is almost beside the point.

Save the token budget for problems the model actually has to reason through. Let $0.02-per-million embeddings handle the regurgitation.

← All field notes Brief a crew →