Optimizing Enterprise Chatbots with LlamaIndex Chat Modes in a RAG System
Large language models (LLMs) like GPT have immense potential for enterprise applications. However, directly utilizing their capabilities for real-world scenarios requires additional optimization. This is where retrieval-augmented generation (RAG) systems come in. RAG combines the reasoning abilities of LLMs with retrieving relevant information from a company's knowledge base.
We discussed text chunking strategies in our previous post on using large language models (LLMs) for enterprise applications. Today we will discuss how you can design your system so that 'retrieval' is optimal in RAG systems. We will talk specifically about LlamaIndex features available to optimize this step. Even if you do not use LlamaIndex, these are still good concepts that you can build on your own.
What is the need for a RAG system for enterprise applications?
LLMs are trained on massive datasets but may lack the full context to answer queries accurately. RAG augments LLMs by retrieving relevant information from a knowledge base. A key part of RAG systems is retrieving appropriate text chunks from the knowledge database to generate relevant responses to user queries.
LlamaIndex provides helpful abstractions for building RAG systems and controlling retrieval and responses. Specifically, LlamaIndex offers chat modes to set the context before retrieval and response modes to handle retrieved data. Properly leveraging these modules ensures that generated responses are highly relevant to the user query.
Here are our insights from experiments with LlamaIndex's chat modes to demonstrate how to optimize RAG system performance:
LlamaIndex Chat Modes: add user context for better retrieval
Below is a high-level architecture of a RAG system and where chat modes come into the picture. The high-level steps to build such a system are:
- Step 1: We embed and store text.
- Step 2: Identify the context of the user query before retrieval.
- Step 3: Use semantic search to retrieve relevant chunks for queries.
LlamaIndex's chat modes help set the context (Step 2 of the RAG system). These chat modes affect retrieval quality and response accuracy as data can either come from the knowledge base or the LLM's prior knowledge. You can read more about them here.
Experiments and observations on chat modes
We have conducted various technical tests on different business-related topics. Though we performed these tests on our own data, to make our findings easily reproducible, we will share results using Paul Graham's essays.
Context mode: A simple chat mode using question context
This mode searches the user question directly on the knowledge base. It adds the context of all questions from that chat to the LLM prompt in each call. While this approach works well for standalone queries, it struggles when questions are interdependent. However, we observed that this mode taps into the prior knowledge of the LLM while answering the question
Even though it uses the LLM's knowledge to provide answers, there can be mismatches. For instance, in the image below, the follow-up query "Tell me more" results in information unrelated to the initial question from the knowledge base. However, the context is maintained as the answer still pertains to Apple II computers. This might be due to the LLM's inherent knowledge. Given this limitation, the Context Mode might not be ideal for chatbots relying heavily on a specific knowledge base.
Condense mode: Refining Questions with Context
This mode transforms each question based on past queries before searching the knowledge base. While it's great for preserving the chat's context and ensuring answers come strictly from the knowledge base, it struggles with meta-questions and broader knowledge topics. It also requires prompt tuning to optimize retrieval.
For example, when asked "Tell me more," this mode transforms it to "What were the top computer options in 1980, and which was the gold standard?," providing an answer from the knowledge base. Prompt engineering can further enhance the accuracy of these question transformations. However, it can't handle questions like "What was my previous question?" due to the semantic search for each question.
ReAct mode: Leveraging LLM's Reasoning for Responses
The ReAct mode chooses between querying the knowledge base or tapping into the LLM to reason and answer contextually. However, there is no control over when ReAct mode uses context versus general knowledge.
For the question "What was the gold standard in 1980?", even though the correct text is fetched using the query engine tool, the LLM decides that the data isn't enough. It then turns to its existing knowledge, which results in an inaccurate answer. The behavior of the ReAct agent heavily depends on the LLM being used. It's best suited for chatbots that sometimes need broader general knowledge.
Conclusion
Chat modes are instrumental in honing RAG systems when using LlamaIndex, guiding them to contextually relevant outputs. Through testing, we uncovered nuances that were not very apparent from the documentation and identified the nuances of each mode: the direct querying of Context mode, the transformation of Condense mode for continuous context, and the reasoning power of the ReAct agent using the LLM. While each has its merits, it's crucial to recognize their respective limitations. We chose Condense mode for our use case based on these experimentations and observations.
In the next post, we'll dive deeper into LlamaIndex response modes.