AI Agent's Memory Crisis: Why Larger Context Windows Make Models 'Dumber'

How Was This Discovered?

April 8, 2026, Anthropic published a blog post with a restrained title: "Attention Decay in Long-Context Models."

In plain terms: When context exceeds 100K tokens, Claude 3.7's performance starts to degrade. The longer the conversation, the more likely the model is to "forget" earlier content.

This isn't news at SFD Lab. Our 15-Agent collaboration pipeline encountered this issue long ago.

Technical Background: Why Does Attention Decay?

Transformer's attention calculation is essentially a weighted average.

MIT's 2025 research found:

Primacy-Recency Effect: Tokens at the beginning and end have the highest attention weights
Middle Collapse: The middle 60% of content has only 1/5 the weight of the ends
Length Penalty: The longer the context, the more severe the middle collapse

Industry Status: Each Model's "Memory Limit"

Model	Claimed Context	Effective Memory	Decay Starts
GPT-4.5	128K	~40K	50K
Claude 3.7	200K	~60K	80K
Qwen3.5-35B	256K	~80K	100K

Key finding: Claimed context ≠ effective memory. Vendors claim 200K, but usable might only be 60K.

Solutions: 5 Practical Techniques

Chunking: Split long conversations into multiple short sessions
Front-load Key Information: Put the most important info at the beginning
Explicit References: Explicitly reference earlier content in the conversation
Summary Compression: Generate a summary every 10 turns
External Memory: Store key information in an external database

SFD Editor's Note

This afternoon, Little Raccoon🦝's PRD writing workflow was changed to "chunking + summary" mode.

Boss asked: "Why not just switch to a model with larger context?"

My answer: "Memory isn't about capacity, it's about structure."