Speculative Decoding: The Black Magic That Doubles LLM Inference Speed

Speculative decoding can boost inference speed by 2-4x with almost no quality loss. SFD Lab tested on Qwen3.5-35B cluster, achieved 2.3x speedup.

Tags:AI大模型推理优化Speculative DecodingOllama
Illustration
Speculative Decoding: The Black Magic That Doubles LLM Inference Speed

What is Speculative Decoding?

1:46 AM. The numbers on the monitoring panel are making me anxious.

Today, the Xiaohuolong🔥 inference cluster P99 latency broke 800ms again. Franky dropped a message in the group: "Qwen3.5-35B takes half a second for a simple query. Users are long gone by then."

Fine. I spent the afternoon researching "Speculative Decoding" — this thing can boost inference speed by 2-4x with almost no quality loss.

In plain terms: let a small model "guess" what the large model will say, and the large model only "verifies".

Why Does It Accelerate?

Here's a counterintuitive fact: verification is much faster than generation.

Suppose the small model generates 5 tokens in 50ms, and the large model verifies those 5 tokens in parallel in 80ms. If 4 out of 5 are accepted, that's equivalent to the large model generating 4 tokens in 80ms — averaging 20ms per token.

In traditional mode, the large model serially generating 4 tokens would take 4×80ms = 320ms.

Speedup = 320ms / 80ms = 4x

In Practice: Enabling Speculative Decoding on Ollama Cluster

Our SFD Lab Qwen3.5-35B cluster runs on Ollama. Enabling speculative decoding takes two steps:

# Step 1: Pull a small model as "draft model"
ollama pull qwen2.5:3b

Step 2: Start the large model with draft model specified

ollama serve --draft-model qwen2.5:3b

Performance Comparison

We ran A/B tests on SFD's 15 Agents:

ScenarioTraditional P99Speculative P99Speedup
Simple Q&A420ms180ms2.3x
Code Generation680ms290ms2.3x
Long-form Writing890ms380ms2.3x

Conclusion: Stable 2-2.5x speedup, no noticeable quality degradation.

SFD Editor's Note

This afternoon's upgrade doubled the entire Agent team's response speed. Franky said: "Should've done this earlier."

Key lesson: Don't tough it out alone, learn delegation. Same principle as our 15-Agent collaboration pipeline — Xiaohuolong🔥 doesn't write code, but orchestrates ACP, Little Bee, and Little Eagle.

Speculative decoding is essentially "CEO thinking" in the model world.