SmallFireDragon Lab

AI Science

Making complex AI concepts understandable for humans

The "Spatial Magic" of Modern AI Inference: How PagedAttention Ends GPU Memory Fragmentation
Science

The "Spatial Magic" of Modern AI Inference: How PagedAttention Ends GPU Memory Fragmentation

In production environments for Large Language Models (LLMs), inference costs are not directly determined by the number of model parameters, but by a core metric

Read More → →
The "Memory Fragments" of Modern AI Systems: The Evolution from KV Cache to PagedAttention
Science

The "Memory Fragments" of Modern AI Systems: The Evolution from KV Cache to PagedAttention

In the inference process of Large Language Models (LLMs), one of the most expensive costs is not computational power, but memory bandwidth. When you converse wi

Read More → →
The "Scheduling Art" of Modern AI Inference: From Static Batching to Continuous Batching
Science

The "Scheduling Art" of Modern AI Inference: From Static Batching to Continuous Batching

In production environments for Large Language Models (LLMs), inference cost is not directly determined by the number of model parameters, but rather by a core m

Read More → →
KV Cache in Modern AI Systems: The Alchemy from Memory Bottlenecks to Inference Acceleration
Science

KV Cache in Modern AI Systems: The Alchemy from Memory Bottlenecks to Inference Acceleration

In the inference process of Large Language Models (LLMs), one of the most expensive costs is not computational power, but memory bandwidth. When you converse wi

Read More → →
The "Memory Wall" in Modern AI Systems: Pressure from KV Cache and Optimization Paths
Science

The "Memory Wall" in Modern AI Systems: Pressure from KV Cache and Optimization Paths

During the inference process of Large Language Models (LLMs), the most expensive resource is often not computational power (FLOPs), but memory bandwidth. When d

Read More → →
The Battle for “Memory” in Modern AI Systems: Engineering Trade-offs from Context Window to RAG
Science

The Battle for “Memory” in Modern AI Systems: Engineering Trade-offs from Context Window to RAG

In current LLM application development, the most common dilemma developers face is: how much can a model “remember,” and how much can it “retrieve”? With the em

Read More → →
KV Cache Optimization in Modern AI Systems: From the Memory Wall to PagedAttention
Science

KV Cache Optimization in Modern AI Systems: From the Memory Wall to PagedAttention

During LLM inference, one of the most critical performance bottlenecks is not compute capacity (compute-bound), but memory bandwidth (memory-bound). When we dis

Read More → →
Evaluation Drift: Why High Benchmark Scores Don’t Equal Production Stability
Science

Evaluation Drift: Why High Benchmark Scores Don’t Equal Production Stability

Model evaluations are often used as the basis for procurement and upgrades. If a model scores two percentage points higher on a leaderboard, it appears to be a

Read More → →
On-Device AI Runtimes: When MLX, Core ML, and WebGPU Are Worth It
Science

On-Device AI Runtimes: When MLX, Core ML, and WebGPU Are Worth It

In the past, AI applications defaulted to sending inference requests to the cloud. This approach was simple, centralized, easy to scale, and facilitated unified

Read More → →