"Memory Expansion" for Modern AI Systems: A Deep Dive into KV Cache Compression and Quantization

In the inference process of Large Language Models (LLMs), the most expensive resource is not computational power (FLOPs), but VRAM bandwidth and capacity. When

Illustration
"Memory Expansion" for Modern AI Systems: A Deep Dive into KV Cache Compression and Quantization

"Memory Expansion" for Modern AI Systems: A Deep Dive into KV Cache Compression and Quantization

In the inference process of Large Language Models (LLMs), the most expensive resource is not computational power (FLOPs), but VRAM bandwidth and capacity. When engaging in long conversations with an AI, the model needs to remember all previous context. To avoid recalculating all previous tokens every time a new token is generated, AI systems introduce the KV Cache (Key-Value Cache).

However, the KV Cache is a massive "memory hog." As the context length increases, the KV Cache grows linearly, quickly filling up GPU VRAM. This leads to a decrease in maximum concurrency (Batch Size) and can even trigger OOM (Out of Memory) errors.

This article explores how modern AI systems optimize the KV Cache through Compression and Quantization, achieving longer context windows and higher throughput without significantly sacrificing accuracy.

1. The Essence of KV Cache: Why Is It So Large?

In the Transformer's attention mechanism, each token generates a Query (Q), Key (K), and Value (V) vector at every layer.
- Query: What the current token is "looking for."
- Key: What information historical tokens "contain."
- Value: The actual content provided by historical tokens.

During autoregressive generation, the K and V vectors of historical tokens remain unchanged in subsequent steps. By caching them, the system only needs to compute the QKV for the current new token.

VRAM Usage Formula:
$\text{Size} = 2 \times \text{layers} \times \text{heads} \times \text{dim_head} \times \text{precision} \times \text{seq_len}$

Taking Llama-3-8B (FP16 precision) as an example:
- Layers $\approx 32$
- Heads $\approx 32$
- Dimension per head $= 128$
- Bytes per element $= 2$ (FP16)

For $1024$ tokens, the KV Cache for a single request requires approximately $0.5\text{GB}$. When concurrency reaches $100$ or the context length hits $32\text{k}$, the VRAM pressure becomes unbearable.

2. KV Cache Quantization: From FP16 to INT8/INT4

Quantization is the most direct way to reduce weight—converting high-precision floating-point numbers into low-precision integers.

FP16 $\rightarrow$ INT8 / FP8

By mapping $\text{FP16}$ values to $\text{INT8}$ or $\text{FP8}$ (such as E4M3/E5M2 supported by NVIDIA H100), VRAM usage is halved. To minimize precision loss, Per-token Quantization or Per-channel Quantization is typically used, maintaining a scale factor for each vector.

INT4 Quantization and Outlier Handling

Compressing further to $\text{INT4}$ saves $75\%$ of the space. However, KV Cache contains extremely rare "outliers" with huge magnitudes, which carry critical semantic information. Simple linear quantization leads to severe precision collapse. Modern solutions like KIVI perform low-bit quantization on K and V separately, combining dynamic scale factors to preserve this key information.

3. KV Cache Compression: Discarding Unimportant Memories

Not all historical tokens are equally important for current predictions. The core of compression lies in: identifying and removing redundant KV pairs.

H2O (Heavy Hitter Oracle)

Research has found that attention weight distribution is highly sparse—only a few "Heavy Hitter" tokens are frequently accessed. The H2O algorithm maintains a fixed-size cache pool, tracking in real-time which tokens have the highest cumulative attention weights. It retains only these core tokens while evicting low-contribution tokens from the cache.

StreamingLLM (Sliding Window + Attention Sinks)

When processing ultra-long texts, models suffer from "attention collapse." StreamingLLM discovered that by retaining Attention Sinks (the first few tokens at the beginning of the sequence) along with a recent sliding window, the model can maintain stable inference capabilities without retraining, while keeping VRAM usage constant.

4. Trade-offs in Engineering Practice

When deploying AI systems in production, the choice of scheme depends on the business scenario:

Scheme Implementation Difficulty VRAM Savings Impact on Accuracy Applicable Scenarios
FP8 Quantization Low (Hardware support) $2\times$ Negligible General acceleration, H100 clusters
INT4 Quantization Medium $4\times$ Perceptible Edge devices, ultra-long context
H2O / Sparsification High (Requires dynamic management) Customizable Moderate Long document analysis, RAG systems
StreamingLLM Medium Constant usage Severe loss of distant memory Infinite streaming chat, Bot services

Conclusion

Optimizing the KV Cache is a crucial step in transforming LLMs from "lab toys" into "industrial-grade products." By reducing the size of individual elements through quantization and decreasing the number of elements through compression, AI systems are breaking through the memory wall. The future trend will be a deep integration of quantization and sparsification—enabling models to store massive amounts of information efficiently while precisely extracting key memories, much like humans do.

Comments

Share your thoughts!

Leave a Comment

0/500

Loading comments…