A Guide to Avoiding Pitfalls: Managing "Hallucinations" and Establishing an Engineering Closed Loop in AI Agent Deployment

In many delivery projects within AI Labs, the biggest headache for engineers isn't whether the model can generate an answer, but how to ensure it "consistently

Illustration
A Guide to Avoiding Pitfalls: Managing "Hallucinations" and Establishing an Engineering Closed Loop in AI Agent Deployment

A Guide to Avoiding Pitfalls: Managing "Hallucinations" and Establishing an Engineering Closed Loop in AI Agent Deployment

In many delivery projects within AI Labs, the biggest headache for engineers isn't whether the model can generate an answer, but how to ensure it "consistently avoids nonsense" in a production environment.

Many teams achieve 90% accuracy during the Demo phase by carefully crafting Few-shot Prompts. However, once deployed in real-world business scenarios facing massive volumes of uncontrollable user inputs, agent hallucinations rapidly drop usability below 60%.

This article shares our three key iterations in building enterprise-grade AI Agents, shifting from "Prompt Tuning" to an "Engineering Closed Loop."

Phase 1: From Prompt Engineering to RAG Enhancement

Initially, we attempted to reduce hallucinations by adding constraints to the System Prompt. For example: "You must answer strictly based on the provided context; if you do not know, state directly that you do not know."

Result: The model became extremely conservative, frequently entering a loop of "I cannot answer this question," or still forcibly stitching together answers when the context was slightly ambiguous.

Engineering Solution: We introduced a structured RAG (Retrieval-Augmented Generation) pipeline.
1. Hybrid Search: Instead of relying solely on Vector Search, we combined it with BM25 keyword retrieval. In engineering domains, specific error codes (e.g., ORA-00600) or component names are strong features; vectorization often leads to a loss of precision for these exact matches.
2. Reranking: The Top-K documents retrieved are not necessarily all relevant. We added a lightweight Cross-Encoder reranking model to filter out noisy documents, ensuring that the context fed to the LLM is of high purity.

Phase 2: Introducing "Reflection" Mechanisms and Self-Correction

Even with high-quality context, LLMs may still make logical leaps during reasoning.

Real-world Case: In an automated operations Agent, the repair suggestions generated by the model after analyzing logs sometimes included non-existent API parameters.

Engineering Solution: We designed a dual-loop structure: $\text{Generator} \rightarrow \text{Verifier}$.
- Generation Loop: The Agent generates a preliminary solution based on the knowledge base.
- Verification Loop: A separate Prompt (or a more powerful model) acts as an "Auditor," with the sole task of checking whether every technical point in the solution is supported by the context.
- Closed Loop: If the Verification Loop identifies contradictions $\rightarrow$ feedback the error details to the Generation Loop $\rightarrow$ regenerate.

Although this "self-reflection" mechanism increases Token costs and latency, it improved the accuracy of critical instructions by approximately 15%.

Phase 3: Establishing Quantifiable Evaluation Sets (Eval Sets) and Regression Testing

The most dangerous situation is: "I tweaked a Prompt to solve Problem A, but Problem B broke."

Pain Point: Relying on manual spot checks cannot support rapid iteration.

Engineering Solution: Build a Golden Dataset containing $\sim 500$ typical cases.
1. Case Definition: Each case includes Input $\rightarrow$ Expected Output $\rightarrow$ Critical Constraints (keywords that must be included / words that must be excluded).
2. LLM-as-a-Judge: Use GPT-4o or a comparable model as a judge to score the Agent's output based on predefined dimensions (accuracy, completeness, safety).
3. CI/CD Integration: Integrate Eval into the Git Pipeline. After every Prompt modification or knowledge base update, run a full suite of tests automatically. Code is only allowed to be merged if the overall score does not decrease and the pass rate for core cases is $100\%$.

Summary and Insights

Deploying AI Agents is not a one-time "alchemy" but a process of continuous engineering optimization.

If you find your Agent unstable in production, stop blindly tweaking Prompt wording. Instead, try approaching the problem from these three dimensions:
- Data Side: Is the retrieval quality high enough? Is there noise?
- Logic Side: Is there an independent verification step? Can self-correction be implemented?
- Evaluation Side: Do you have a quantifiable dataset that can quickly provide feedback on the impact of changes?

True stability comes from managing uncertainty, not from pursuing the perfect Prompt.

Comments

Share your thoughts!

Leave a Comment

0/500

Loading comments…