Set Temperature Wrong and Your AI Is Basically Drunk

LLM temperature parameter production guide with SFD lab experience and real Agent configurations

Tags:LLMtemperatureAI调参大模型实战经验
Illustration
Set Temperature Wrong and Your AI Is Basically Drunk

Last Wednesday, Our Customer Service Bot Started Flirting With Users

Not exaggerating. At 2 AM, our monitoring panel flagged an alert: a customer service reply had hit 4,000 characters. A user asked "when will my order arrive," and the AI went from order status to the meaning of life, then wrote a poem about waiting.

Half an hour of debugging later, we found someone had bumped the temperature from 0.3 to 0.9. The reason: "0.3 feels too robotic, let me make it more natural."

Too natural. A temperature of 0.9 is like giving AI two shots of whiskey—it starts saying anything, making up any answer it can think of.

What the Heck Is Temperature

Don't let the formulas in papers scare you. One sentence: temperature controls how bold the AI is when picking the next word.

AI generates text one word at a time. For each word, it calculates probabilities for all possible candidates. Temperature adjusts that probability distribution:

  • temperature = 0: Always picks the highest probability word. Every response is identical. Like a student who only memorizes textbooks.
  • temperature = 0.2-0.4: Mostly picks the top word, occasionally tries the second choice. Stable but not rigid. Perfect for customer service, code generation, translation.
  • temperature = 0.5-0.7: Getting creative. Same question, slightly different answers each time. Good for copywriting, brainstorming.
  • temperature = 0.8-1.0: AI starts going wild. Answers bring surprises and scares alike. Fits creative writing and storytelling.
  • temperature > 1.0: Total chaos. It starts hallucinating, even outputting gibberish. Unless you're experimenting, stay away.

Three Blood-Stained Lessons

Lesson 1: Temperature and top_p Are Not Independent

Many people tune both temperature and top_p thinking they stack. Wrong. They are series, not parallel.

What actually happens: top_p first prunes low-probability candidates, then temperature adjusts the distribution among the rest. So setting top_p=0.9 and then tuning temperature gives a more conservative result than temperature alone.

Our SFD lab practice: fix top_p=0.9, only tune temperature. One variable means when things go wrong, you know exactly whose fault it is.

Lesson 2: Temperature Values Don't Transfer Across Models

temperature=0.7 on GPT-4 might feel like 0.5 on Claude. Each model's underlying probability distribution is different.

Tested with the same prompt (write a short poem about spring):

GPT-4 @ 0.7:    Polished, rich imagery, but formulaic
Claude @ 0.7:   Already quite wild, occasionally weird metaphors
Qwen @ 0.7:     Safe and steady, more conservative than GPT-4
Llama-3 @ 0.7:  Most creative, but sometimes off-topic

So when switching models, don't lazily reuse old temperature values. Spend 10 minutes on comparison tests.

Lesson 3: Temperature Doesn't Fix Factual Errors

This is the most common misunderstanding. People see factual errors and their first instinct is to lower temperature. But temperature only affects variation in phrasing, not factual accuracy.

If the AI says "Mount Everest is 8,848 meters"—whether temperature is 0 or 1, it will say 8,848. That is what it learned.

If the AI says "the Sun orbits the Earth"—temperature at 0 still won't fix that. Temperature does not change what the model knows, only how it expresses it.

Factual accuracy requires RAG or fine-tuning, not temperature.

SFD Lab Temperature Configurations

Here are our actual settings for 15 Agents—not theoretical optimums, but battle-tested:

AgentPurposeTemperatureWhy
小猎鹰Security audit0.1Zero tolerance for creativity
小狐狸Copywriting0.7Needs creativity but must stay on topic
小章鱼Code generation0.2Code cannot be ambiguous
小蝴蝶Design descriptions0.8The more creative the better
小春蚕Translation0.3Accuracy first, style second

SFD Editor's Note

Writing this article, I checked on that romantic customer service bot again. It is back to 0.3 now—boring replies, but at least no more poems. Sometimes I think the line between boring and interesting AI is just one temperature parameter. And our job as engineers is finding that sweet spot—not too rigid, not too crazy.