Caution when using Gen AI to create content (e.g., blogs, whitepapers, podcast scripts). I recently ran a fun experiment. I input the following prompt into GPT 3.5: "Write me a 500 word article on the business value of AI in customer service written in an authoritative tone and providing statistics on ROI" The resulting article was impressive at first glance. It gave compelling information backed with data, including data from Aberdeen Strategy & Research, Gartner, McKinsey & Company & IDC. One of the data points just didn't make sense though. It's from the below paragraph: "Furthermore, AI-driven chatbots have emerged as a game-changer in customer service, providing real-time assistance to customers round the clock. Research by Gartner predicts that by 2025, over 80% of customer service interactions will be handled by AI chatbots. These intelligent virtual agents can efficiently handle routine inquiries, offer personalized recommendations, and even facilitate transactions, all while providing a seamless conversational experience." As an industry analyst who spent over a decade covering the #contactcenter & #cx space, I know 80% of customer interactions will not be handled by AI chatbots in a mere 8 months. #AI is just not ready for that. It's well suited for simple interactions but it can't yet match human critical thinking & empathy required for effective handling of more complex interactions. In fact, Aberdeen's latest research shows that as of February 2024, 49% of firms are using AI in their contact center. So, I did more (traditional online) research on the 80% figure and found that GPT's reference of the Gartner statistic was misrepresented. An August 2023 press release by the firm reports that the company predicts 80% of service organizations will use #GenAI by 2025. (Side note: as of February 2024, Aberdeen's research shows Gen AI adoption in the contact center standing at half that predicted rate: 40%...) This should be a good reminder that AI "hallucinations" are real. In other words, AI can make up things - in this case, misrepresent data while even referencing sources of the data. In fact, when I asked GPT 3.5 to provide me links for the sources of the data in the article it wrote based on my prompt I was provided with a response that it can't provide real-time links but that I can trace sources by following the titles of articles it reported using to generate content. A quick Google search using the source name provided by GPT was how I discovered the actual context of the prediction made by Gartner that was misrepresented in the GPT-created article. #Contentmarketing is changing rapidly. Gen AI is undoubtedly a very powerful tool that'll significantly boost #productivity in the workplace. However, it's not an alternative that can replace humans. Firms aiming to create accurate & engaging content should instead focus on empowering employees with AI capabilities to enjoy human ingenuity paired with computer efficiency.
Understanding AI Hallucination in Chatbot Responses
Explore top LinkedIn content from expert professionals.
Summary
AI hallucinations occur when chatbots or large language models (LLMs) generate false information that sounds plausible but isn't backed by factual or verified data. While generative AI can be a powerful tool, understanding and minimizing hallucinations is critical, particularly in industries like law, healthcare, and customer service where accuracy is vital.
- Focus on input quality: Provide clear and well-constructed prompts to set boundaries for the chatbot’s output, reducing the chance of generating misleading or incorrect information.
- Leverage external knowledge: Use methods like Retrieval-Augmented Generation (RAG) to ensure AI responses are supported by trusted, up-to-date knowledge bases.
- Involve human oversight: Incorporate human review processes to verify AI-generated content, especially in high-stakes or complex applications.
-
-
Don't be afraid of hallucinations! It's usually an early question in most talks I give on GenAI "But doesn't in hallucinate? How do you use a technology that makes things up?". It's a real issue, but it's a manageable one. 1. Decide what level of accuracy you really need in your GenAI application. For many applications it just needs to be better than a human, or good enough for a human first draft. It may not need to be perfect. 2. Control your inputs. If you do your "context engineering" well, you can point it to the data you want better. Well written prompts will also reduce the need for unwanted creativity! 3. Pick a "temperature". You can select a model setting that is more "creative" or one that sticks more narrowly to the facts. This adjusts the internal probabilities. The "higher temperature" results can often be more human-like and more interesting. 4. Cite your sources. RAG and other approaches allow you to be transparent about what the answers are based on, to give a degree of comfort to the user. 5. AI in the loop. You can build an AI "checker" to assess the quality of the output 6. Human in the loop. You aren't going to just rely on the AI checker of course! In the course of a few months we've seen concern around hallucinations go from a "show stopper" to a "technical parameter to be managed" for many business applications. It's by no means a fully solved problem, but we are highly encouraged by the pace of progress. #mckinseydigital #quantumblack #generativeai
-
While integrating generative AI into financial advisory services at Crediture, I encountered the propensity of LLMs to occasionally 'hallucinate' or generate convincing yet erroneous information. In this article, I share some of the strategies that I had to implement to safeguard against hallucination and protect our users. In summary, they include: ▪ Constrained prompts that scope the capabilities of the LLM to minimize false information generation. ▪ Rigorous testing by conducting invalid input testing with nonsensical prompts to detect over-eagerness in response. ▪ Evaluating confidence scores to filter out low-certainty responses to reduce misinformation risk. Follow Crediture's LinkedIn Page to learn more and keep up with our latest advancements: https://lnkd.in/ggAH79yx
-
Hallucination in LLMs refers to generating factually incorrect information. This is a critical issue because LLMs are increasingly used in areas where accurate information is vital, such as medical summaries, customer support, and legal advice. Errors in these applications can have significant consequences, underscoring the need to address hallucinations effectively. This paper (https://lnkd.in/ergsBcGP ) presents a comprehensive overview of the current research and methodologies addressing hallucination in LLMs. It categorizes over thirty-two different approaches, emphasizing the importance of Retrieval-Augmented Generation (RAG), Knowledge Retrieval, and other advanced techniques. These methods represent a structured approach to understanding and combating the issue of hallucination, which is critical in ensuring the reliability and accuracy of LLM outputs in various applications. Here are the three most effective and practical strategies that data scientists can implement currently: 1. Prompt Engineering: Adjusting prompts to provide specific context and expected outcomes, improving the accuracy of LLM responses. 2. Retrieval-Augmented Generation (RAG): Enhancing LLM responses by accessing external, authoritative knowledge bases, which helps in generating current, pertinent, and verifiable responses. 3. Supervised Fine-Tuning (SFT): Aligning LLMs with specific tasks using labeled data to increase the faithfulness of model outputs. This helps in better matching the model's output with input data or ground truth, reducing errors and hallucinations
-
OpenAI says reusing three key parameters can substantially reduce hallucinations and encourage deterministic generations. tl;dr set the same seed and temperature parameters with each GPT API call to mitigate LLMs' indeterministic nature. How? (1) Set a seed by choosing any number and using it consistently across API requests (2) Ensure all other parameters (prompt, temperature, top-p) are identical for each call (3) Monitor the system_fingerprint field and ensure it doesn't change 𝗘𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗲𝗱 𝗲𝘅𝗽𝗹𝗮𝗶𝗻𝗮𝘁𝗶𝗼𝗻 Many developers don’t know that every GPT API call returns an extra parameter called system_fingerprint, which is OpenAI's identifier for the currently running GPT model configuration. Storing and reusing the seed parameter for future API calls is likely to return the same result for the same system_fingerprint. Setting the same temperature would further increase the likelihood of consistent results. What do these three parameters have to do with reducing hallucinations? (a) It is easier to identify hallucination patterns when responses are more consistent, i.e. similar, and employ safety nets to mitigate downstream implications (b) More consistent generations also reduce the probability of a new response hallucination pattern that slips through the already-deployed safety nets Combined with advanced prompt engineering techniques, hallucinations can be significantly diminished https://lnkd.in/g7_6eP6y I’d be excited to see researchers publish the seed, system_prompt, temperature, and prompt in an AIConfig [0] format so others can easily reproduce their results. This would foster more reliable and trustworthy research in times when the AI community questions the credibility of reported benchmarks. [0] https://lnkd.in/gmvNTf8g from LastMile AI
-
Big shout out to this intuitive yet highly effective way to detect hallucinations, or confabulations—a specific type of hallucination where LLMs’ generation is a result of pure randomness. The innovation is to measure the confidence level (a.k.a. log-likelihood or entropy) not on token sequences, but on semantic meanings, thereby closing the gap between how humans understand languages (i.e., by understanding their semantics) and how LLMs are trained (i.e., by predicting tokens). The attached chart depicts how it works. Existing naive hallucination detection measures the confidence level of the most probable token sequence (“Paris” in the chart). But the signal strength is significantly lowered by the fact that there can be multiple token sequences bearing identical semantic meanings (“It’s Paris” and “France’s capital Paris”). In which case, it's hard to differentiate between fact (that bears relatively high confidence level compared to the rest) and random generation (that bears relatively low confidence level). By clustering token sequences into semantic-equivalent groups, the relative signal strength of the correct answer is boosted (see the right-hand side of the chart) and can be reliably used to measure the true confidence level. Additionally, we can use the same LLM to perform semantic equivalence grouping by asking if “Paris” and “It’s Paris” have the same meaning, and vice versa. It’s a pleasure to read the paper where a simple trick greatly advances the usefulness of LLMs. Paper link: https://lnkd.in/e3sMpzeR #artificialintelligence #machinelearning #deeplearning
-
I agree with Confucius. He said, "True wisdom is knowing what you DON'T KNOW". And when the 'wisest' AI bot of them all, #ChatGPT, hallucinates instead of saying "I don't know", it might as well be stupid but also potentially dangerous. I find it crazy that it's not common practice to always output certainty along with every prediction. At a basic level, any reliable AI system, or intelligent entity for that matter, must know how certain it is of its responses. But more importantly, 𝐚𝐛𝐬𝐭𝐚𝐢𝐧 𝐟𝐫𝐨𝐦 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐧𝐠 𝐰𝐡𝐞𝐧 𝐢𝐭'𝐬 𝐧𝐨𝐭 𝐬𝐮𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭𝐥𝐲 𝐜𝐞𝐫𝐭𝐚𝐢𝐧. 🤖 🍄 Foundation Large Language Models (LLMs) are particularly prone to uncertainty—Ambiguous prompts and general-purpose training data 𝐜𝐚𝐧 𝐥𝐞𝐚𝐝 𝐭𝐨 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬. 🤷🏽 To address these 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞𝐬 𝐰𝐢𝐭𝐡 𝐮𝐧𝐜𝐞𝐫𝐭𝐚𝐢𝐧𝐭𝐲, Conformal Prediction (CP) to the rescue! It is considered the most robust method for uncertainty estimation because it provides reliable statistical guarantees by assigning confidence levels to predictions and calibrating predicted confidence levels with true frequencies. The method is flexible and applicable to various models and data types, with no retraining necessary, making it extremely versatile. If you want to learn more about CP and its many implementations, check out this awesome repository: https://lnkd.in/ecDa3GSd 🔬 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡𝐞𝐫𝐬 𝐟𝐫𝐨𝐦 𝐏𝐫𝐢𝐧𝐜𝐞𝐭𝐨𝐧 𝐔𝐧𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 𝐚𝐧𝐝 𝐃𝐞𝐞𝐩𝐌𝐢𝐧𝐝 have created a "KnowNo" CP-based framework in LLMs such as GPT3.5 and PaLM-2L. It enables any LLM-based planner to recognize uncertainty in robot tasks (link in comments) and request human intervention when needed. Most importantly, this framework can be applied to other LLM pipelines without requiring (rather expensive) retraining. Moreover, the research team is actively working on extending the "KnowNo" framework to include vision-language models. 🤝 By incorporating uncertainty quantification into #AI systems, we can 𝐛𝐮𝐢𝐥𝐝 𝐭𝐫𝐮𝐬𝐭 and ensure safer interactions between humans and machines. #LLMS #ConformalPrediction #UQ #ArtificialIntelligence #Robotics #ResponsibleAI #FoundationModels
-
#LLM #GENAI #LAW: #Hallucinate, much? Much anecdotal evidence supports that thesis in #GenerativeAI's clumsy foray into law, but research at Stanford University Human-Centered #ArtificialIntellience delivers hard data. Here are just a couple of findings from Stanford's recent study: - "[I]n answering queries about a court’s core ruling (or holding), models hallucinate at least 75% of the time. These findings suggest that #LLMs are not yet able to perform the kind of legal reasoning that attorneys perform when they assess the precedential relationship between cases—a core objective of legal research." - "Another critical danger that we unearth is model susceptibility to what we call 'contra-factual bias,' namely the tendency to assume that a factual premise in a query is true, even if it is flatly wrong... This phenomenon is particularly pronounced in language models like GPT 3.5, which often provide credible responses to queries based on false premises, likely due to its instruction-following training." Read the full article here: https://lnkd.in/gEab43qK
-
RAG solutions are failing, and a new Stanford study reveals why. 🚨 A Stanford study finds a 20-30% hallucination rate for AI RAG solutions claiming "100% hallucination-free results." Their analysis explains why many RAG systems fail to meet expectations. Why Hallucinations Still Happening: 1. A RAG (Retrieval Augmented Generation) solution retrieves some "relevant" document/chunk, but that reference may not "answer" the question. 2. In the case of legal research, a hallucination can occur when: - A response contains incorrect information. - A response includes a false assertion that a source supports a proposition. 3. Legal research involves an understanding of facts (not just style) beyond the information retrieved. This level of grounded knowledge and reasoning is beyond today's LLMs. What Should You Do: 🧠 Understand the limits of RAG. LLMs have been purposely trained to output information that "appears" accurate and trustworthy. You need to include domain experts early in the evaluation. This will let you know if RAG can "answer" questions correctly. 📖 Get educated. Go read the paper. It provides a wealth of examples. If you weren't an expert in the law, you would not be aware that the model is hallucinating. The Takeaway 📚 RAG helps us make sense of enormous amounts of information, so keep using it for preliminary research. - Don't expect RAG to bring a sophisticated understanding of the subject matter. - Evaluation must include subject matter experts! LLMs are designed to output information that "appears" accurate and trustworthy. Go check out the paper: Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools https://lnkd.in/gXb7zVzk
-
Improving response quality in generative AI is a hot topic right now. At Botco.ai we've been experimenting with a new technique that takes inspiration from the scientific peer review process. In a recent experimental project to enhance the reliability of our language models, we've been delving into innovative ways to improve the quality of their output. Here's a glimpse into the process we tested: 1. A user types in an input/question. 2. This input is directly fed into Botco.ai’s InstaStack (our retrieval augmented generation - RAG - product). 3. A language model (LLM) then processes the output from InstaStack, carefully extracting relevant information from the knowledge base pertaining to the user's question. 4. Then, the LLM crafts a response, drawing from the insights it gathered and the original input. 5. Experimental feature: Another LLM critically reviews the proposed answer, cross-examining it against the user's input and the information to ensure accuracy and coherence. 6. If it detects a low quality output (we’ve tested many thresholds), the process is iteratively refined, modifying the response and reassessing to avoid potential infinite loops. 7. Once the response is verified as accurate, it is then delivered to the user. Overall, this method yielded good results but the user experience did take a hit as the messages took longer to deliver. From a business perspective, we can say that this experiment was successful today in terms of quality control: much higher accuracy in responses and nearly zero hallucinations. Now the challenge is generating this result with less latency. On to the next experiment! I'd love your feedback on what we should try next. CC: Crystal Taggart, MBA Vincent Serpico Ana Tomboulian Jacob Molina #genai #aiexplained #llms