Some challenges in building LLM-powered applications (including RAG systems) for large companies: 1. Hallucinations are very damaging to the brand. It only takes one for people to lose faith in the tool completely. Contrary to popular belief, RAG doesn't fix hallucinations. 2. Chunking a knowledge base is not straightforward. This leads to poor context retrieval, which leads to bad answers from a model powering a RAG system. 3. As information changes, you also need to change your chunks and embeddings. Depending on the complexity of the information, this can become a nightmare. 4. Models are black boxes. We only have access to modify their inputs (prompts), but it's hard to determine cause-effect when troubleshooting (e.g., Why is "Produce concise answers" working better than "Reply in short sentences"?) 5. Prompts are too brittle. Every new version of a model can cause your previous prompts to stop working. Unfortunately, you don't know why or how to fix them (see #4 above.) 6. It is not yet clear how to reliably evaluate production systems. 7. Costs and latency are still significant issues. The best models out there cost a lot of money and are very slow. Cheap and fast models have very limited applicability. 8. There are not enough qualified people to deal with these issues. I cannot highlight this problem enough. You may encounter one or more of these problems in a project at once. Depending on your requirements, some of these issues may be showstoppers (hallucinating direction instructions for a robot) or simple nuances (support agent hallucinating an incorrect product description.) There's still a lot of work to do until these systems mature to a point where they are viable for most use cases.
Challenges of Retrieval Augmented Generation
Explore top LinkedIn content from expert professionals.
Summary
Retrieval-augmented generation (RAG) combines large language models (LLMs) with external data sources to generate more accurate, context-aware responses. However, creating scalable and reliable RAG systems presents challenges, from reducing inaccuracies like hallucinations to optimizing data retrieval methods.
- Focus on context accuracy: Design retrieval systems that maintain relevance by refining query structures and understanding complex relationships between data points.
- Anticipate changing data: Implement adaptive systems that can update knowledge bases and embeddings to account for evolving information efficiently.
- Streamline performance: Balance speed and accuracy by employing multi-tiered storage strategies and tools like query refinement to ensure relevant and timely responses.
-
-
Many companies have started experimenting with simple RAG systems, probably as their first use case, to test the effectiveness of generative AI in extracting knowledge from unstructured data like PDFs, text files, and PowerPoint files. If you've used basic RAG architectures with tools like LlamaIndex or LangChain, you might have already encountered three key problems: 𝟭. 𝗜𝗻𝗮𝗱𝗲𝗾𝘂𝗮𝘁𝗲 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Existing metrics fail to catch subtle errors like unsupported claims or hallucinations, making it hard to accurately assess and enhance system performance. 𝟮. 𝗗𝗶𝗳𝗳𝗶𝗰𝘂𝗹𝘁𝘆 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗖𝗼𝗺𝗽𝗹𝗲𝘅 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀: Standard RAG methods often struggle to find and combine information from multiple sources effectively, leading to slower responses and less relevant results. 𝟯. 𝗦𝘁𝗿𝘂𝗴𝗴𝗹𝗶𝗻𝗴 𝘁𝗼 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗮𝗻𝗱 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻𝘀: Basic RAG approaches often miss the deeper relationships between information pieces, resulting in incomplete or inaccurate answers that don't fully meet user needs. In this post I will introduce three useful papers to address these gaps: 𝟭. 𝗥𝗔𝗚𝗖𝗵𝗲𝗸𝗲𝗿: introduces a new framework for evaluating RAG systems with a focus on fine-grained, claim-level metrics. It proposes a comprehensive set of metrics: claim-level precision, recall, and F1 score to measure the correctness and completeness of responses; claim recall and context precision to evaluate the effectiveness of the retriever; and faithfulness, noise sensitivity, hallucination rate, self-knowledge reliance, and context utilization to diagnose the generator's performance. Consider using these metrics to help identify errors, enhance accuracy, and reduce hallucinations in generated outputs. 𝟮. 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗥𝗔𝗚: It uses a labeler and filter mechanism to identify and retain only the most relevant parts of retrieved information, reducing the need for repeated large language model calls. This iterative approach refines search queries efficiently, lowering latency and costs while maintaining high accuracy for complex, multi-hop questions. 𝟯. 𝗚𝗿𝗮𝗽𝗵𝗥𝗔𝗚: By leveraging structured data from knowledge graphs, GraphRAG methods enhance the retrieval process, capturing complex relationships and dependencies between entities that traditional text-based retrieval methods often miss. This approach enables the generation of more precise and context-aware content, making it particularly valuable for applications in domains that require a deep understanding of interconnected data, such as scientific research, legal documentation, and complex question answering. For example, in tasks such as query-focused summarization, GraphRAG demonstrates substantial gains by effectively leveraging graph structures to capture local and global relationships within documents. It's encouraging to see how quickly gaps are identified and improvements are made in the GenAI world.
-
The Harsh Reality of Building a Production-Ready RAG Pipeline Building an AI chatbot with a RAG pipeline sounds simple—just watch a few YouTube tutorials, throw in an off-the-shelf LLM API, and boom, you have your own AI assistant. But anyone who has ventured beyond the tutorials knows that a real-world, production-level RAG pipeline is a completely different beast. It’s almost a month into my journey at LLE, where I’ve been working on developing an in-house RAG pipeline using foundational models—not just for efficiency but also to prevent data breaches and ensure enterprise-grade robustness. And let me tell you, the challenges are far from what the simplified tutorials portray. A Few Hard-Hitting Lessons I’ve Learned: ✅ Chunking is not just splitting text You can use pymupdf to extract chunks, but it fails when you need adaptive chunking—especially for scientific documents where preserving tables, equations, and formatting is critical. This is where Visual Transformer models that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language comes into play. ✅ Query Refinement is Everything A chatbot is only as good as the data it retrieves. Rewriting follow-up queries effectively is key to ensuring the LLM understands intent correctly. Precision in query structuring directly impacts retrieval efficiency and model response quality. ✅ Optimizing Retrieval = Speed + Relevance It's not just about retrieving data faster; it’s about retrieving the right data. Reducing chunks improves retrieval efficiency, but that’s not enough—multi-tiered storage strategies ensure queries target the right system for lightning-fast and relevant responses. These are just a few of the many challenges that separate a toy RAG implementation from a real-world, scalable, and secure pipeline. The deeper I dive, the clearer it becomes: production-ready AI isn’t just about making things work, it’s about making them work at scale, securely, and efficiently. Would love to hear from others working in this space—what are some of the biggest roadblocks you’ve faced while building a RAG pipeline? 🚀