Qwen3.5 35B Local Deployment: Complete Ollama Cluster Guide
Complete guide to deploying Qwen3.5 35B locally with Ollama on Mac Studio cluster. Quantization choices, troubleshooting, performance benchmarks.

Why We Deployed 35B Models Locally
In April 2026, SFD Lab made a decision: migrate core inference tasks from cloud to local. Not because we distrust APIs, but simple math—9 daily articles + 15 Agent collaborations = 500K+ monthly API calls. Two Mac Studio vs cloud costs: local deployment pays back in 18 months.
More importantly: data stays in-house. User conversations, skill configs, memory fragments—no need to send these to third parties.
Hardware Choice: Why Mac Studio
We chose two Mac Studio M3 Ultra:
- MS01: 96GB unified memory, primary inference node
- MS02: 96GB unified memory, backup + coding专用
Why not H100? Simple: VRAM is expensive. 80GB H100 single card costs 250K RMB, 96GB Mac Studio complete unit is 30K RMB. Ollama optimization on Apple Silicon is mature—Qwen3.5 35B Q8 quantized needs only 38GB, MS01 handles it easily.
Ollama Deployment Flow
Step 1: Install Ollama
brew install ollama
# macOS requires manual service start
ollama serve
Step 2: Pull Models
# MS01: Qwen3.5 35B Q8
ollama pull qwen3.5:35b-q8_0
MS02: Qwen3-Coder-Next (coding专用)
ollama pull qwen3-coder-next:latest
Quantization Guide
| Quant | Size | RAM | Quality Loss |
|---|---|---|---|
| Q8_0 | 38GB | ~42GB | Negligible |
| Q6_K | 30GB | ~34GB | Minimal |
| Q4_K_M | 23GB | ~27GB | Acceptable |
Lessons Learned
Issue 1: Download Interruption—38GB takes 20-30 min, network fluctuations cause breaks. Ollama supports resume, but ensure correct permissions on ~/.ollama/models.
Issue 2: OOM—First run on MS01 crashed macOS. Fix: limit Ollama max memory via launchctl setenv OLLAMA_MAX_VRAM 40000000000.
Issue 3: Request Queuing—Single instance handles one request at a time. Solution: run two instances on different ports (11434, 11435).
Performance Benchmarks
1000 tokens prompt, 500 tokens output:
- Q8_0 (MS01): First token 1.2s, 28 tokens/s
- Q4_K_M (MS02): First token 0.8s, 35 tokens/s
SFD Editor Note
This dual-machine cluster has run 2 weeks with 50+ daily inference requests, zero failures. Electricity cost: ~300 SGD/month vs 2000+ SGD cloud API savings. Next: explore Ollama distributed inference for 72B models.
Comments
Share your thoughts!
Loading comments…