The "Memory Fragments" of Modern AI Systems: The Evolution from KV Cache to PagedAttention
In the inference process of Large Language Models (LLMs), one of the most expensive costs is not computational power, but memory bandwidth. When you converse wi

The "Memory Fragments" of Modern AI Systems: The Evolution from KV Cache to PagedAttention
In the inference process of Large Language Models (LLMs), one of the most expensive costs is not computational power, but memory bandwidth. When you converse with an AI, the model needs to review all previous context. If it had to recompute all previous Keys and Values every time a new token is generated, inference speed would degrade quadratically. To solve this problem, the industry introduced the KV Cache (Key-Value Cache).
However, the KV Cache has brought a new nightmare: memory fragmentation.
The Essence and Pain Points of KV Cache
Simply put, the KV Cache stores the Key and Value vectors of tokens already computed during the Transformer decoding process in GPU memory. This way, when generating the next token, the model only needs to compute the K-V for the current token and concatenate it with the historical K-Vs stored in the cache.
But in actual production environments, the KV Cache faces three core challenges:
1. Unpredictable Lengths: The number of tokens generated per request varies, so the system cannot know in advance how much memory to allocate.
2. Waste from Static Allocation: To prevent overflow, traditional inference frameworks (such as early versions of HuggingFace Transformers) typically pre-allocate a contiguous block of memory based on the model's maximum supported context length (e.g., 32k). This means that even if a request generates only 10 tokens, it still occupies space for 32k tokens.
3. External Fragmentation: Due to varying request lifecycles (some ending soon, others just starting), numerous unusable gaps appear in GPU memory.
This phenomenon leads to extremely low GPU memory utilization, directly limiting concurrency (Batch Size) and thereby reducing overall throughput.
PagedAttention: Borrowing from Operating System Virtual Memory
To thoroughly solve this problem, the vLLM team proposed PagedAttention. Its core idea is remarkably pure: manage the KV Cache just like an operating system manages physical memory.
In traditional contiguous storage, the KV Cache must be stored in a continuous physical address space. PagedAttention, however, splits the KV Cache into fixed-size "Pages," with each page storing a fixed number of tokens (e.g., 16).
How It Works
- Logical Blocks $\rightarrow$ Physical Blocks: The model still sees a continuous sequence (logical blocks), but the underlying system maps these to non-contiguous physical GPU memory blocks via a "Block Table."
- Dynamic On-Demand Allocation: A new physical block is allocated for a request only when it needs more space. There is no longer a need to pre-reserve space for the maximum possible length.
- Efficient Sharing: This is the most powerful aspect of PagedAttention. When handling "Parallel Sampling" or "Multi-turn Conversations," multiple requests can share the same physical block (for example, sharing the KV Cache for the System Prompt) simply by incrementing the block's reference count.
From Theory to Practice: Impact on Inference Performance
The introduction of PagedAttention has pushed LLM inference from being "memory-bound" to a higher dimension of efficiency:
- Improved GPU Memory Utilization: By eliminating internal fragmentation, GPU memory utilization can reach over 96%.
- Throughput Leap: Under the same hardware conditions, the number of tokens processed per second can typically increase by 2-4 times, as larger batch sizes can be accommodated.
- Long-Text Support: Through flexible page management, systems can handle ultra-long contexts more stably without easily triggering OOM (Out of Memory) errors.
Conclusion
If Continuous Batching solved the temporal scheduling problem in inference, then PagedAttention solved the spatial management problem. It has upgraded AI inference from simple "matrix operations" to complex "resource scheduling." For developers, understanding this evolution means being able to better optimize deployment parameters (such as gpu_memory_utilization and max_num_seqs) to find the optimal balance between cost and performance.
Comments
Share your thoughts!
Loading comments…