How to Reduce Hallucinations in LLMs for Reliable Enterprise Use
Gen AI has been well-received by enterprise decision-makers. Yet, wary of technology pitfalls from past experience, they have expressed serious concern about hallucinations, which are demonstrably false model responses.
Just how challenging is the problem? A recent study found that LLMs may hallucinate between 3-27% of the time, depending on the model. In specific contexts, this may be much worse. Another study found that LLMs provide false legal information between 69-88% of the time – very worrying given the criticality of legal transactions.
When LLMs Lie
What are hallucinations? Large language models (LLMs), such as GPT-4, Llama 3, and Mixtral, can generate rapid, fluent responses to varied user prompts in many scenarios. But some of these are nonsensical, some are untruthful but hard to detect as incorrect, and a few are accurate but not derived from source data, all categories that make it seem that the model is hallucinating. The underlying reasons that drive the LLMs to behave this way include not giving them enough context while training, overfitting them, and data ingestion errors such as wrong encoding.
It is still early days, but undesirable scenarios caused by hallucinations may convince enterprise leaders to pull back on funding gen AI initiatives and make business heads reluctant to pilot or deploy solutions.
To begin with, business users may hesitate to use LLM tools in their daily work when they find that they cannot trust the output, setting up a significant barrier to adoption. Second, even a single failure to detect hallucinations in sensitive use cases, such as health care, can cause serious harm to external stakeholders like patients and significant reputational damage to the organization, negating any ROI.
Guiding Models to Tell the Truth: Why Retrieval Augmented Generation (RAG) helps
Technology leaders, extremely aware of the urgent need to increase the reliability of LLM outputs, are quickly creating effective governance, including automated and human-guided accuracy checks. The strategies being tried include using guardrails with prompts, providing examples of the desired output while querying, and regularly fine-tuning the data sets that train the LLMs.
While constant fine-tuning of massive data sets requires significant resources, the other two approaches are not structured enough to guarantee reliability. This brings us to a promising fourth route – retrieval augmented generation (RAG). RAG leverages the robust self-learning mechanism of an LLM while focusing it on a limited set of pre-approved and up-to-date information sources. For instance, if an internal user not in the finance team wants to know the company’s latest turnover, a model not restricted by RAG may pick up these numbers from external websites of low credibility. However, an RAG-restricted model can be instructed to get the numbers from the latest internal financial updates, ensuring accuracy.
Hallucination Detection to improve the Reliability of RAG-restricted LLM Outputs
Restricting the LLM’s behavior with RAG can boost the reliability of its output and reduce hallucinations, but it does not completely eliminate them. Consider a marketing team using a tailored RAG-led LLM to scour the web for campaign ideas. The LLM may come up with something from a successful competitor campaign, not understanding that while it has to look at what competitors are doing, it cannot use the information for ideas.
Data scientists are fast building strategies to avert such disasters. To spot hallucinations in the Black Box LLMs used mainly by enterprises today, SelfCheckGPT, a recent research paper on hallucination detection, offers three recommendations – the BERT Score that uses semantic similarity, the prompt method that uses another LLM and its understanding of language to evaluate consistency, and evidence-based evaluation leveraging natural language inference (NLI).
To test which of these is the most effective within an RAG set-up, we built an RAG system based on the Llama 2-13B-chat model using a corpus of financial reports, and relevant questions, and applied the three approaches to evaluate the responses. By helping to generate hallucination-free output 88.63% of the time with optimal resource utilization, the NLI-led method knocked down the competition.
But is this enough, given that gen AI will soon see adoption in high-stakes situations where people’s lives or millions of dollars are on the line? To spot hallucinations to a finer degree by identifying them within responses, we recommend the integrated gradient approach that uses a baseline to detect hallucinations with up to 99% confidence. In the marketing use case, for example, the integrated gradient method will unerringly pick out parts of the responses that do not tally with the company’s brand and style guidelines.
Ready for Enterprise-level Adoption
The combined RAG, NLI, and integrated gradient methodology give enterprises a winning strategy for gen AI adoption. Users can confidently isolate problematic responses, increasing their trust in model output and making them amenable to use the technology frequently. While competitors struggle to tame pilot projects, IT teams that consistently generate high-quality output using this three-pronged method can rapidly scale LLMs enterprise-wide. Generative AI can be extended to more use cases and complex workflows, empowering employees with new insights, increasing ROI, and cementing competitive advantage.
The post How to Reduce Hallucinations in LLMs for Reliable Enterprise Use appeared first on AIM.


