An inference context memory primer — and why the next AI infrastructure bottleneck is context, not compute.
For the last two years, the AI infrastructure conversation has been dominated by one question: how big is your context window?
That question is no longer sufficient.
A 128K, 200K, or 1M token context window sounds like a model feature. In a product brief, it is a single line item. In production, it becomes a memory problem, a latency problem, a cost problem, a security problem, and a governance problem — sometimes all at once.
The new question is not whether your model can accept this much context. The new question is: can your infrastructure afford, reuse, move, compress, and govern that context?
That is what inference context memory is about. And like the disaggregated inference patterns we examined in the previous piece, it represents a quiet but consequential shift in how large language model systems are actually built.
What is inference context?
Inference context is everything the model needs during a request or session in order to produce its next output. It includes:
- The user prompt
- The system prompt
- Retrieved documents (in RAG workflows)
- Conversation history
- Tool call results
- Agent state and intermediate reasoning artifacts
- The KV cache that materializes all of the above inside the model
- User memory and organizational memory references
This is a longer list than most teams realize. When a chatbot answers a question, the model is not just looking at the latest user message. It is attending over everything the application has stuffed into context — and the cost of that operation is invisible until you start measuring it.
A clarifying point worth making early: the KV cache is not a peer item on this list. It is the model's internal representation of everything else once it has been processed. The user prompt, retrieved documents, and conversation history are raw inputs; the KV cache is what those inputs become inside the model's attention mechanism. This distinction matters when we get to the memory hierarchy, because raw text lives differently from materialized attention state.
Context window versus context memory
Two terms get conflated all the time. They are not the same thing.
The context window is the model's input capacity. It is a property of the model architecture. A model with a 128K window can attend over up to 128K tokens.
Context memory is the system's ability to manage the state created by using that context. It is a property of the inference infrastructure. It includes how the KV cache is stored, where it lives, who can see it, when it expires, what gets reused, and what gets recomputed.
Context window is a model feature. Context memory is an infrastructure problem.

A vendor can give you a million-token context window. They cannot give you the infrastructure discipline to actually use it well. That part is yours to build, buy, or rent.
The KV cache as working memory
In the disaggregated inference piece, we introduced the KV cache as the central artifact of LLM serving. It is worth revisiting that idea here, because the KV cache is also the most important object in any discussion of context memory.
When the model processes a prompt, it computes key and value tensors for each token at each transformer layer. These tensors are stored in the KV cache so that during generation, the model does not need to recompute the entire prompt for every output token. It can simply attend over the stored keys and values.
This is the difference between an LLM that takes seconds to generate each token and one that takes minutes.
But the KV cache has a property that makes it operationally awkward: it grows linearly with context length, multiplied by concurrency. For a Llama 3.1 70B model in BF16, the per-token KV cache footprint is roughly 320 KB. That sounds small. Multiplied by 128,000 tokens, it becomes about 40 GB. For one sequence.
Four concurrent sequences at 128K context: about 160 GB of KV cache alone, before counting the 140 GB of model weights. That already exceeds the 141 GB of HBM on an H200.

This is the operative number behind almost every infrastructure decision in this article. The KV cache, not the model, is what most often pushes long-context inference off the cliff.
Why long-context AI changes infrastructure economics
Long context is being adopted aggressively across enterprise workloads. Legal teams are running contract review over hundreds of pages. Financial analysts are summarizing earnings calls and supporting documents in single passes. Engineering teams are giving coding assistants entire repositories. Customer support copilots are loading product manuals into every conversation. Multi-agent systems are sharing large prompt prefixes across half a dozen agents.
Every one of these workloads increases:
- Prefill cost. The compute required to process input scales roughly linearly with input length (and quadratically with attention, before any optimization). Long prompts make prefill the single most expensive phase of inference.
- KV cache size. Already discussed. Linear in context length, multiplied by concurrency.
- Time to first token. With a long prompt, the user waits while the model reads. For interactive workloads, this is the latency the user actually feels.
- HBM memory pressure. Once HBM is saturated, the system has to make uncomfortable tradeoffs we will catalog shortly.
- Cache transfer complexity. In disaggregated serving, the KV cache may need to move from a prefill worker to a decode worker. Long contexts mean larger transfers.
- Storage tier decisions. Not all context state can live in HBM. Where else does it live, and how does it get back?
What used to be a model performance question is now a systems engineering question.
The memory hierarchy for AI inference
Not all memory in an inference stack serves the same purpose. Production systems need a hierarchy, and they need a policy for what lives where.
| Tier | Typical capacity | Bandwidth | Primary purpose |
| GPU HBM | 80–192 GB per GPU | 3–8 TB/s | Active KV cache, model weights |
| CPU DRAM | 1–4 TB per node | ~100–400 GB/s | KV cache offload, hot tiers |
| Local NVMe | 4–60 TB per node | 6–14 GB/s | Cold KV cache spill, session state |
| RDMA-pooled cluster cache | 10s–100s of TB | 10–50 GB/s effective | Cross-node KV reuse |
| Vector database | Application-defined | ms-class lookup | Retrieval, not inference state |
| Object storage | Effectively unlimited | s-class latency | Audit, archival, long-term memory |

A few things are worth noting about this table.
First, the bandwidth gap between HBM and CPU DRAM is roughly an order of magnitude, and the gap between DRAM and NVMe is another order of magnitude. These cliffs are why memory placement decisions matter so much. Spilling KV cache from HBM to DRAM is not free, and spilling to NVMe is dramatically less free.
Second, “distributed cache” is not actually a single rung. It is a cross-cutting layer that can be implemented at multiple tiers. LMCache backs onto DRAM. Mooncake pools DRAM, SSD, and RDMA across an entire cluster. Some implementations use object storage as a colder tier. The point is not that there is one distributed cache layer but that production systems can build a logical cache that spans multiple physical tiers.
Third, vector databases are often mentioned in the same breath as inference memory, but they are not the same thing. A vector database is for retrieval — turning a query into relevant documents. Once those documents are in the prompt, they become inference context, and from that point on it is the KV cache, not the vector store, that determines cost and latency.
Context reuse — and what makes it actually work
The cheapest context is the context you do not have to recompute.
Prefix caching is the simplest and most common form of context reuse. If two requests share the first 10,000 tokens (because they share the same system prompt and the same retrieved documents, for instance), the inference engine can reuse the KV cache for those 10,000 tokens and only run prefill on the divergent suffix. The savings can be dramatic. For a 20K-token RAG prompt with a stable prefix and a short user query at the end, prefix caching can take time-to-first-token from seconds to milliseconds.

But there are two technical points about prefix caching that are usually glossed over.
Prefix caching requires exact token-sequence prefix matches. One inserted character at position 5 invalidates the cache from position 5 onward. This is why prompt structure matters for cache friendliness. Stable elements should come first; volatile elements should come last. A typical cache-friendly RAG prompt structure looks like:
- System prompt (stable across all requests)
- Retrieved documents in a stable canonical order (stable per query)
- Conversation history (stable per session)
- User query (volatile)
If your application interleaves volatile content with stable content — say, by injecting timestamps every few paragraphs — your cache hit rate will collapse and you will not know why your inference bill is rising.
RadixAttention generalizes prefix caching to a radix tree. The SGLang inference engine indexes prefixes across all requests in a tree structure, allowing fine-grained sharing of cached prefixes between requests that share partial prefixes but diverge at different points. This is meaningfully better than naïve per-session caching and is now appearing in multiple production stacks. For workloads with many overlapping but non-identical prompts — multi-agent systems are the canonical example — RadixAttention can produce cache hit rates that simple session caching cannot reach.
The cheapest context is the context you do not have to recompute. The second cheapest is the context you can reuse from a sibling request.
The position encoding constraint
There is a constraint on cache reuse that most articles skip over, and it is worth understanding because it shapes what systems can actually do.
Most modern LLMs use rotary position embeddings (RoPE) or similar position-dependent encodings. This means a KV cache entry computed at position N inside the input cannot be trivially reused at position M. The encoding is tied to where the token sits in the sequence.
The practical consequences:
- You cannot naïvely splice KV caches from two different retrieved documents together in arbitrary order. If document A was cached as positions 0–999 and document B was cached as positions 0–999, you cannot just paste them together to get positions 0–1999, because document B's positions are wrong in the combined sequence.
- Prefix caching works because positions are preserved across runs of the same prefix. The system prompt is always at position 0, the retrieved documents are always at the same offset, and so on.
- Reordering retrieved chunks between requests breaks cache reuse, even when the chunks themselves are identical.
This is not an academic concern. It is one of the reasons enterprise RAG systems that look like they should be cache-friendly often are not.
Research systems like CacheBlend and CacheGen address this by selectively recomputing a small fraction of attention for tokens whose effective position has changed. The idea is to do a partial re-prefill — much cheaper than full re-prefill — that fixes up the position-dependent state without paying the full cost of recomputing the cache from scratch.
For architecture leaders, the takeaway is this: cache reuse is a function of how stable your prompt construction is, including the order of components. If your retrieval system returns chunks in scoring order and the order changes between requests, you are paying for prefill you could have avoided.
Context compression
Compression is the lever that turns memory pressure into a tunable knob. Three families of techniques matter in production.
KV cache quantization
Storing keys and values in lower precision than the model itself runs in. FP16 or BF16 KV cache is the default. FP8 KV cache cuts the footprint roughly in half. INT8 and INT4 KV cache cut it further, at some quality cost.
Note that this is separate from weight quantization. A model can run in BF16 weights with FP8 KV cache, and the quality impact of the latter is typically much smaller than the impact of weight quantization. For long-context workloads where KV cache dominates HBM usage, KV cache quantization is one of the highest-leverage production techniques available.
Eviction policies
Not all tokens in a long context contribute equally to future generation. Several techniques exploit this.
H2O (Heavy Hitter Oracle) observes that a small subset of tokens — the “heavy hitters” — receive most of the attention from later tokens. Keep those; evict the rest. The KV cache shrinks dramatically with little quality loss on many workloads.
StreamingLLM attention sinks preserve the first few tokens of the sequence (which act as anchoring “attention sinks” for stability) plus a sliding window of recent tokens, and discard the middle. This allows effectively infinite context length with bounded memory, at the cost of some long-range recall.
Sliding window attention is built into some model architectures, like several Mistral variants. The model itself attends only to a fixed-size recent window, so the KV cache size is bounded by the window, not the full input length.
Token-level prompt compression
Tools like LLMLingua and its successors compress the prompt text itself before it enters the model, removing tokens that contribute little to the model's output. This trades a small upfront compute cost (running the compressor) for a much smaller prompt — which means smaller prefill cost and smaller KV cache for the rest of the request lifecycle.
For long retrieved contexts in RAG, prompt compression can reduce token counts by 50% or more with modest quality impact on many tasks.
PagedAttention
PagedAttention deserves its own callout. Introduced in vLLM, it manages KV cache in fixed-size pages similar to virtual memory in an operating system. Instead of allocating contiguous KV cache memory per sequence — which leads to fragmentation when sequences have wildly different lengths — PagedAttention allocates pages on demand and lets the inference engine pack many variable-length sequences into available memory efficiently.
This was one of the most consequential systems innovations in LLM serving over the last few years. It is now table stakes. Most modern inference engines either implement PagedAttention or something functionally equivalent.
Memory pressure: what actually happens when HBM runs out
When KV cache demand exceeds HBM capacity, the inference system has to choose one of four costs. There is no fifth option.
Evict cache. The system drops some KV cache entries. The next time those tokens are needed, the system pays the re-prefill cost. This trades future compute for present memory.
Spill to slower tiers. Move some KV cache from HBM to CPU DRAM, or from DRAM to NVMe. The cache is still there, but every access pays a latency penalty. For decode-heavy workloads where every output token touches the cache, this penalty is paid many times.
Reject or queue the request. The system tells the user, or the calling application, to wait or try again. This is the cleanest from a systems perspective and the worst from a user experience perspective.
Shrink the batch. Reduce the number of concurrent sequences being processed. The remaining sequences fit in HBM and run at full speed, but overall throughput drops.
A well-designed inference stack makes these tradeoffs explicitly via policy. A poorly-designed one makes them by accident, and the failure mode is usually “everything got slow at some point yesterday and we don't know why.”
For an enterprise architecture team, the question to ask is: “When our HBM saturates, which of these costs are we paying, and is that the cost we would have chosen?”
Software requirements
A production-grade inference stack that takes context memory seriously needs the following layers. Most of these were touched on in the disaggregated inference piece, but here they are organized around context rather than around prefill/decode.
Context router. Decides where each request goes based on cache locality, topology, model availability, and SLO targets. The router is the brain of the system. Bad routing means good caches don't get used.
KV cache manager. Tracks what cache exists, where it lives, which requests share it, when it should be evicted, and when it should be transferred. This is the bookkeeping layer that makes everything else possible.
Memory tiering layer. Moves cache between HBM, DRAM, NVMe, and pooled remote tiers based on access patterns and capacity pressure. Implements the “spill” and “promote” operations that the manager decides on.
Compression engine. Applies KV quantization, eviction policies, and prompt compression based on configuration and workload characteristics.
Policy engine. Enforces who is allowed to put what into context, what content is sensitive, how long cache can live, and which tenants can share cache state.
Observability. Tracks time-to-first-token, inter-token latency, cache hit rate, HBM pressure, transfer latency, and cost per million tokens. Without this, the team is flying blind.
Audit trail. Records what was retrieved, what was cached, what was shared across tenants, and what was eventually deleted. This is increasingly a compliance requirement, not just a nice-to-have.
Retention and deletion controls. Enforces TTL on cache, supports right-to-be-forgotten requests, and ensures that when a user deletes their data, the deletion propagates to the inference layer and not just the application database.
Context governance and security
Once a production inference stack treats context memory as a first-class resource, governance becomes unavoidable. A short list of questions every enterprise will eventually need to answer.
Who is allowed to put what into context? A user-facing chatbot should not be able to retrieve and inject another customer's data into a prompt. Sounds obvious. Implementation is harder than it sounds, especially when retrieval is mediated by embeddings that do not respect tenant boundaries unless explicitly configured.
Can sensitive documents enter a prompt at all? Some classes of data — protected health information, payment card data, certain categories of legal documents — may need to stay out of model context regardless of who is asking.
Can retrieved content be reused across users? A FAQ document is reasonable to share. An internal policy document for one customer is not.
How long should context state live? A KV cache entry for an active session is reasonable to keep for hours. A KV cache entry from a deleted session should not survive the deletion.
Should agent memory decay? Long-running agents accumulate state. At what rate should that state be summarized, archived, or discarded? Most teams have not yet decided.
Should memory be auditable? For regulated industries, the answer is yes. For consumer applications, the answer may be a function of how much the user wants their data persisted.
There is one security concern worth pulling out specifically: cross-tenant cache reuse is not a performance choice. It is a security control.
If KV cache is keyed by hash of token sequence and shared globally across tenants, two different organizations with the same system prompt can have their cache behavior interact in measurable ways. At minimum this is a timing side channel — a careful attacker can probe what is in the shared cache by measuring response latencies. In pathological cases involving retrieval and partial cache reuse, it can leak content. Most production systems scope caches by tenant key for this reason. Some go further and scope by user within a tenant.
If your cache reuse strategy is more aggressive than your tenancy model, you have a security problem you have not noticed yet.
Agent memory: a useful typology
For the agent-platform audience, it is worth borrowing the cognitive-science vocabulary that has been quietly standardizing across the agent literature. Four types of memory map cleanly onto the inference memory hierarchy.
Working memory is the current in-context state. In LLM terms, this is the KV cache. It is fast, volatile, and exists only for the duration of the active session.
Episodic memory is past sessions, retrievable on demand. This is what most “memory” features in commercial AI products actually implement — a store of past conversations that gets vectorized and selectively recalled when relevant. It lives in vector databases, not in HBM.
Semantic memory is consolidated facts about the user, the organization, or the world. “This user prefers terse responses.” “This company's fiscal year ends in March.” Structured, queryable, durable. Typically backed by application databases and surfaced into context as needed.
Procedural memory is learned patterns about how to do things. In current AI systems, this is often expressed as few-shot examples, prompt scaffolding, or — in more sophisticated systems — fine-tuned adapters or in-weight specializations. It lives in the model itself or in adapter caches.
Working memory in HBM. Episodic in vector stores. Semantic in application databases. Procedural in model weights. This mapping gives architecture teams a vocabulary that connects the agent product surface to the underlying inference infrastructure. Without it, “memory” becomes an undifferentiated bucket where everything from session state to vector recall to fine-tuning gets mixed together in conversation.
Multimodal context: the next step-function
A brief callout that the curves in this article all get steeper when modalities mix.
A single high-resolution image, depending on the model, can become hundreds to thousands of tokens. A minute of video, depending on the model's frame sampling and tokenization strategy, can become tens of thousands. Audio is comparatively cheap but still material at length.
The implications for context memory are direct. A multimodal RAG system that retrieves three images and a transcript may have a prompt that looks short in characters but is enormous in tokens, with corresponding KV cache footprint. A video understanding agent that watches a thirty-minute recording is doing a long-context inference workload regardless of whether the user thinks of it that way.
This article focuses on text because the systems patterns are clearer there, but the patterns extend. If anything, multimodal workloads amplify every concern above — they pay more prefill, they generate larger KV caches, they spend more HBM, and they get less out of prefix caching because each modality's tokens are typically less reusable across requests than text tokens are.
When this matters
Inference context memory is most relevant when:
- Context windows are large, typically 32K and above
- Documents in the workload are long
- Users have recurring multi-turn sessions
- Agents perform multi-step workflows with shared context
- Retrieved context is reused across requests
- Latency and cost matter materially to the business case
- Data sensitivity is high enough that governance is a hard requirement
When it may not matter
It is fair to skip most of this for workloads where:
- Prompts are short, typically a few thousand tokens or fewer
- Sessions are stateless
- Workloads are low volume
- Outputs are simple and short
- The application does not reuse context across requests
- Existing latency and cost are already acceptable
The same discipline that applies to disaggregated inference applies here: do not over-engineer. Most small applications do not need a context memory architecture. They need a working API call. The right question is not “should we build this?” but “where is our actual bottleneck?”
The business case
The business case for taking context memory seriously is the same shape as the case for disaggregated inference. Better cache reuse reduces cost per token. Better context placement improves latency, particularly time-to-first-token. Better memory management enables workflows that would otherwise be infeasible — million-token contexts, long-running agents, multi-tenant RAG at scale.
Poor context design silently inflates cloud spend. We have seen enterprise AI deployments where the difference between a well-tuned context strategy and a careless one was roughly an order of magnitude in monthly inference cost, with no visible difference in user experience until the bill arrived. The careless deployment was paying for prefill it could have cached, KV cache it could have compressed, and HBM it could have shared.
Governance is the other side of the case. As enterprise AI deployments come under regulatory scrutiny — data residency, right to deletion, audit requirements — the inference layer becomes part of the compliance perimeter. A team that has not thought about context memory will discover this the hard way during their first audit.
For hyperscalers and model providers, the strategic angle is sharper. Inference is a continuous margin problem in a way training is not. Every user request consumes compute, memory, and network resources. The platforms that figure out context memory will run their fleets more efficiently than the ones that do not, and the difference will compound over time.
The future: a database engine for tokens
The inference stack is converging on something that looks structurally familiar to anyone who has worked on database systems.
There is a query planner — the request router that decides where each request should be served and which caches to consult. There is a buffer pool — the KV cache manager that decides what stays hot in HBM and what spills to colder tiers. There is a storage hierarchy — HBM, DRAM, NVMe, pooled remote cache, each with different access characteristics. There is a transaction log — the audit and retention layer that records what context entered the system and what eventually left it. There is replication — caches distributed across workers and, in larger deployments, across regions. There is an optimizer — cost-aware routing, compression policy, eviction policy, batching strategy, all dynamically chosen.
The lineage is not coincidental. The problems are recognizably the same problems database systems have been solving for forty years: managing fast volatile state against slow durable state, optimizing reuse, enforcing access policy, meeting latency targets under unpredictable load. The vocabulary of database systems will increasingly be the vocabulary of inference systems, and the engineers who learn to think this way will build the platforms that scale.
For now, most enterprise teams will consume this capability rather than build it. Their cloud providers and inference platforms will manage the buffer pool and the query planner. But the teams that understand what is happening inside those platforms will make better decisions about model choice, prompt design, retrieval architecture, and vendor selection. The ones that do not will keep paying for prefill they could have cached and HBM they could have shared.
Final takeaway
The first wave of generative AI infrastructure was about training capacity. The second wave was about getting access to GPUs. The wave we are in now is about inference efficiency, and inside that wave, context memory is one of the most consequential and least appreciated layers.
The old model was simple: send a prompt to a model, get a response.
The new model is more sophisticated. Manage what enters context, manage what is cached, manage where the cache lives, manage who can see it, manage when it expires, manage how it compresses, and manage what it costs.
As context windows grow, as agents become more common, and as enterprise AI moves from experiment to production, this layer will become harder to ignore. The platforms that treat context memory as a first-class infrastructure concern will have lower costs, better latency, stronger security, and more durable competitive positions than the ones that treat it as an implementation detail.
Context window is a model feature. Context memory is an infrastructure problem. The companies that figure out the second one will quietly outperform the ones that only optimized for the first.


