Claude Sonnet 4 Deep Review: Not Just Faster — Actually Smarter

Background: Why We Ran This Review

SFD Lab runs 14 agents around the clock, heavily using Claude under the hood. After migrating from Sonnet 3.7 to Sonnet 4, some behaviors changed, some didn't, some changed unexpectedly. Worth a proper writeup.

This isn't an official benchmark comparison. If you want MMLU/HumanEval numbers, check Anthropic's site. This is about real-world workflow differences.

Change #1: Instruction Following Is More Consistent

This is where we felt the biggest improvement. Sonnet 3.7 had an annoying habit: you'd give very specific formatting instructions, and it would occasionally "forget" or improvise. Ask for pure JSON output — it would add "Here's the response in JSON format:" before the JSON.

Sonnet 4 tightened this up significantly. In our content pipeline (where 小狐狸 needs strict JSON template output), formatting deviation dropped from roughly 15% to under 3%. For automation workflows, that's massive — fewer exception handlers, less human intervention.

Change #2: Long Document Comprehension Is More Even

We use Claude for contract review, typically 30–50 pages. Sonnet 3.7 had a subtle issue: detailed analysis in the first half, "efficiency mode" in the second half. The last few pages got skimmed.

Sonnet 4 improved this. We tested several 50+ page documents — attention distribution is more even. A small clause on page 45 gets cited accurately, not approximated from earlier context.

One caveat: if the document itself is structurally chaotic (e.g., garbled text from scanning), Sonnet 4 doesn't outperform 3.7 by much. Clean input is still prerequisite.

Change #3: Tool Calling Is More Reliable

This matters most for us. SFD Lab workflows depend heavily on precise tool calls — agents need to invoke APIs, read/write files, trigger pipeline steps. Sonnet 3.7 occasionally produced malformed tool call parameters or inexplicably ignored available tools.

Sonnet 4 is more stable here. Tool call success rate went from roughly 92% to 97% in our testing. That 5% gap, across hundreds of daily tool calls, means dozens fewer errors per day — direct impact on manual intervention frequency.

Specific example: when 小章鱼 calls API endpoints with nested JSON parameters, Sonnet 3.7 sometimes mangled quote escaping. Sonnet 4 basically never does this.

What Didn't Improve

Speed perception: Official claims say it's faster, but for our use case (generating 1000+ word articles), the subjective difference is minimal. Probably more noticeable for short responses.

Math and logical reasoning: We don't need heavy math, but occasional financial calculations feel about the same as 3.7. For complex reasoning, still consider Opus-tier.

Creative divergence: Sonnet 4 feels slightly more conservative. For creative marketing copy, 3.7 produced wilder ideas; Sonnet 4 tends toward "safe but bland." This may be intentional — accuracy at the cost of some randomness.

In Agent Contexts

Sonnet 4 shows improvement in multi-turn dialogue, role-play consistency, and long-task tracking. Our agents had a common failure mode: mid-task drift — forgetting the original goal and getting absorbed in a sub-task. Sonnet 4 significantly improves goal-persistence across many reasoning steps.

Another observation: Sonnet 4 is more likely to proactively ask for clarification on ambiguous instructions rather than guessing. Double-edged sword — fewer "spent hours on wrong assumption" situations, but more confirmation rounds needed. Which matters more depends on your workflow.

Migration Cost

Low. API-compatible, no code changes needed. But run regression tests, especially for: formatted output behavior, prompt behavior changes, tool call error rates.

We needed to adjust two prompts: one where defensive language (preventing Sonnet 3.7 from adding unwanted preambles) could simply be deleted; another where we needed to explicitly add "don't ask for clarification, just execute" to prevent extra confirmation rounds.

SFD Lab Note

Bottom line: Sonnet 4 is a "consolidate the foundation" upgrade, not a revolution. It's better at the "infrastructure" layer — reliability, instruction following, tool calling — which matters a lot for production agents. If you're prototyping, the difference is minor. If you're running real workflows that need stability, the switch is worth it. Two weeks in with all 14 agents on Sonnet 4, we're satisfied. The main benefit isn't any single capability — it's that the frequency of unexpected behavior went down. In agent operations, that's worth more than any new feature.