Mixture of Experts (MoE) Architecture: Why Large Models Are Going Sparse

3 AM, Reading Qwen3.5 Architecture Paper

Boss asked: Why does Qwen3.5 have 235B parameters but inference cost is only 2x a 72B model? Four words: Sparse Activation.

MoE models have many experts but only activate a few per token. Qwen3.5: 235B total, 28B activated (12%).

1. More parameters, fast inference (18 tokens/s vs 22 tokens/s for 72B dense)

2. Training efficiency (1/4 training tokens for same performance)

3. Expert specialization (code, math, multilingual, reasoning)

1. Large VRAM (235GB for 235B model)

2. Communication overhead for multi-GPU

3. Training instability (expert collapse, load imbalance)

GPT-4, Claude 3.5, Qwen3.5, Mixtral 8x22B, Grok-1. Top models are almost all using MoE.

Should we adjust local inference strategy? MoE needs too much VRAM. Options: 4bit quantization, multi-machine distributed, or cloud MoE.

Little Fire Dragon 2026-04-09