SmallFireDragon Lab

AI Science

Making complex AI concepts understandable for humans

06/19/2026

The "Spatial Magic" of Modern AI Inference: How PagedAttention Ends GPU Memory Fragmentation

In production environments for Large Language Models (LLMs), inference costs are not directly determined by the number of model parameters, but by a core metric

The "Memory Fragments" of Modern AI Systems: The Evolution from KV Cache to PagedAttention

06/19/2026

Science

The "Memory Fragments" of Modern AI Systems: The Evolution from KV Cache to PagedAttention

In the inference process of Large Language Models (LLMs), one of the most expensive costs is not computational power, but memory bandwidth. When you converse wi

The "Scheduling Art" of Modern AI Inference: From Static Batching to Continuous Batching

06/18/2026

Science

The "Scheduling Art" of Modern AI Inference: From Static Batching to Continuous Batching

In production environments for Large Language Models (LLMs), inference cost is not directly determined by the number of model parameters, but rather by a core m

KV Cache in Modern AI Systems: The Alchemy from Memory Bottlenecks to Inference Acceleration

06/17/2026

Science

KV Cache in Modern AI Systems: The Alchemy from Memory Bottlenecks to Inference Acceleration

In the inference process of Large Language Models (LLMs), one of the most expensive costs is not computational power, but memory bandwidth. When you converse wi

The "Memory Wall" in Modern AI Systems: Pressure from KV Cache and Optimization Paths

06/16/2026

Science

The "Memory Wall" in Modern AI Systems: Pressure from KV Cache and Optimization Paths

During the inference process of Large Language Models (LLMs), the most expensive resource is often not computational power (FLOPs), but memory bandwidth. When d

The Battle for “Memory” in Modern AI Systems: Engineering Trade-offs from Context Window to RAG

06/15/2026

Science

The Battle for “Memory” in Modern AI Systems: Engineering Trade-offs from Context Window to RAG

In current LLM application development, the most common dilemma developers face is: how much can a model “remember,” and how much can it “retrieve”? With the em

KV Cache Optimization in Modern AI Systems: From the Memory Wall to PagedAttention

06/14/2026

Science

KV Cache Optimization in Modern AI Systems: From the Memory Wall to PagedAttention

During LLM inference, one of the most critical performance bottlenecks is not compute capacity (compute-bound), but memory bandwidth (memory-bound). When we dis

Evaluation Drift: Why High Benchmark Scores Don’t Equal Production Stability

06/13/2026

Science

Evaluation Drift: Why High Benchmark Scores Don’t Equal Production Stability

Model evaluations are often used as the basis for procurement and upgrades. If a model scores two percentage points higher on a leaderboard, it appears to be a

On-Device AI Runtimes: When MLX, Core ML, and WebGPU Are Worth It

06/12/2026

Science

On-Device AI Runtimes: When MLX, Core ML, and WebGPU Are Worth It

In the past, AI applications defaulted to sending inference requests to the cloud. This approach was simple, centralized, easy to scale, and facilitated unified