Modern AI's "Memory Management": The Engineering Truth Behind KV Cache and PagedAttention

In the inference process of Large Language Models (LLMs), the most expensive resource is not computational power (FLOPs), but memory bandwidth. When you convers

Illustration
Modern AI's "Memory Management": The Engineering Truth Behind KV Cache and PagedAttention

Modern AI's "Memory Management": The Engineering Truth Behind KV Cache and PagedAttention

In the inference process of Large Language Models (LLMs), the most expensive resource is not computational power (FLOPs), but memory bandwidth. When you converse with an AI, the model needs to review all previous dialogue history. If it had to recompute all previous Key and Value vectors every time a new token is generated, inference speed would degrade quadratically.

To solve this problem, the industry introduced KV Cache (Key-Value Cache). However, KV Cache brings another nightmare: memory fragmentation.

KV Cache: The Trade-off of Space for Time

During the decoding phase of a Transformer, the generation of each token depends on the attention weights of all preceding tokens. This means that for the $n$-th token, we need to compute the dot product with the $\text{Key}$ and $\text{Value}$ vectors of the previous $n-1$ tokens.

Since the previous $n-1$ tokens were already computed in earlier steps and their results remain unchanged during inference, we can store these $\text{K}$ and $\text{V}$ vectors in GPU memory (VRAM). Thus, at each step, we only need to compute the $\text{K}$ and $\text{V}$ for the current token and append them to the cache.

The cost is a dramatic increase in VRAM usage. For a Llama-3-70B model at FP16 precision, the KV Cache occupancy per request grows linearly with sequence length. As concurrent requests increase, VRAM fills up rapidly, leading to OOM (Out of Memory) errors.

The Dilemma of Static Allocation: Memory Fragmentation

Early inference frameworks (such as HuggingFace Transformers) adopted a contiguous memory allocation strategy. To ensure performance, the system would pre-allocate a sufficiently large contiguous block of VRAM for each request (e.g., supporting up to 4096 tokens).

This approach led to severe memory waste:
1. Internal Fragmentation: If a user inputs only 10 tokens, the remaining 4086 positions remain occupied and unavailable to others.
2. External Fragmentation: As requests are dynamically added and released, numerous small, non-contiguous gaps appear in VRAM, making it impossible to accommodate new large requests.
3. Reservation Waste: To prevent crashes caused by sequence growth, developers had to conservatively set maximum lengths, further reducing throughput.

PagedAttention: Borrowing from Operating System Virtual Memory

The PagedAttention mechanism proposed by the vLLM team fundamentally changed this landscape. Its core idea is to analogize the LLM's KV Cache to virtual memory pages in an operating system.

Core Mechanism

PagedAttention no longer requires the KV Cache to be stored contiguously in physical VRAM. Instead, it divides the cache into fixed-size blocks.
- Logical Blocks $\rightarrow$ Physical Blocks: The model sees a continuous sequence logically, but underneath, a "Block Table" maps logical indices to non-contiguous physical VRAM blocks.
- Dynamic On-Demand Allocation: A new physical block is allocated for a request only when the current physical block is full.
- Efficient Sharing: During parallel sampling or Beam Search, multiple output sequences can share the same physical blocks for the prefix (prompt), eliminating the need to store identical KV data repeatedly.

Engineering Impact

With PagedAttention, memory utilization for KV Cache has increased from the previous $60\% \sim 80\%$ to nearly over $96\%$. This means that under the same hardware conditions, the number of concurrent requests (Batch Size) a single machine can handle has increased several-fold, directly reducing inference costs and improving response speeds.

Conclusion: Closing the Loop from Algorithms to Engineering

KV Cache optimizes the computational redundancy of Transformers, while PagedAttention optimizes the storage redundancy of KV Cache. This reveals a key trend in modern AI systems: the upper limit of model capability is determined by algorithms, but the lower limit of commercial deployment is determined by engineering implementation. When we discuss "inference acceleration," the focus has shifted from simple operator optimization to deeper resource scheduling and memory management solutions.

Comments

Share your thoughts!

Leave a Comment

0/500

Loading comments…