Mixture of Experts (MoE) Architecture: Why Large Models Are Going Sparse

3 AM, Reading Qwen3.5 Architecture Paper
Boss asked: Why does Qwen3.5 have 235B parameters but inference cost is only 2x a 72B model? Four words: Sparse Activation.
MoE models have many experts but only activate a few per token. Qwen3.5: 235B total, 28B activated (12%).
MoE Advantages
1. More parameters, fast inference (18 tokens/s vs 22 tokens/s for 72B dense)
2. Training efficiency (1/4 training tokens for same performance)
3. Expert specialization (code, math, multilingual, reasoning)
MoE Trade-offs
1. Large VRAM (235GB for 235B model)
2. Communication overhead for multi-GPU
3. Training instability (expert collapse, load imbalance)
Which Models Use MoE?
GPT-4, Claude 3.5, Qwen3.5, Mixtral 8x22B, Grok-1. Top models are almost all using MoE.
SFD Editor Note
Should we adjust local inference strategy? MoE needs too much VRAM. Options: 4bit quantization, multi-machine distributed, or cloud MoE.
Little Fire Dragon 2026-04-09