Enhancing Large Language Models with RAG

27 June 2024
10 min read
Utilizing retrieval augmented generation (RAG) models with LLMs promises better effectiveness of all LLM-based solutions. Explore more.

As of 2024 and potentially beyond, Retrieval-augmented generation (RAG) models are gaining traction. As evidence, Databricks reports that 60% of their LLMs use RAG. And, since the global LLM market, valued at $4.35 billion, shows an extremely fast compound annual growth rate (CAGR) equal to 35.9%, using RAG seems more than beneficial. 

In this post, you’ll learn basic concepts related to retrieval-augmented generation. Besides, you’ll find out how it works in relationship with LLMs, what tools are used to integrate RAG, and what the workflow of processing a user’s query looks like when there’s RAG technology involved. 

What is retrieval-augmented generation and why do you need it?

Retrieval-Augmented generation or simply RAG is a generative AI framework that augments or simply updates and improves Large Language Models (LLMs) with relevant data from chosen sources. 

RAG technique is primarily suited to Natural Language Processing (NLP) applications. RAG combines retrieval-based methods with generation-based models, leveraging the strengths of both approaches and putting them to use. 

Here’s why developers widely used RAG with LLMs:

  • Enhanced accuracy of LLMs’ predictions
  • Usage of up-to-day information in real-time
  • Higher-quality outputs that are both contextually accurate and creatively generated
  • Better efficiency when working with high volumes of data
  • Better capabilities for operating with highly customized responses

Using the RAG approach has no cons except for making the system more complex, and, therefore somewhat harder to develop and maintain. 

Need consulting with an expert provider of AI and ML development services? Reach out to Intelliarts. 

How does RAG work in LLMs?

RAG utilizes pre-trained LLMs, which allows it to generate text. At the same time, the technology incorporates a retrieval mechanism that allows the model to access external knowledge sources. This way, RAG integrates retrieval into the generation processes.

As for the RAG’s retrieval mechanism, it operates by retrieving relevant texts from external sources, which are later used as inputs to the generative model. The latter synthesizes the information and produces coherent and contextually appropriate responses.

Here’s the complete two-step procedure for the model to learn how to select the best documents and generate appropriate responses based on them simultaneously:

Step 1: Document retrieval with DPR techniques

RAG frameworks start by exploiting the dense passage retrieval (DPR) techniques. DPR encodes both the input query and the external documents into dense vectors using a transformer-based model. Such dense vectors represent the query and the documents in a high-dimensional space. It’s used by the model as a reference for retrieving documents. Basically, the selection of suitable documents is conducted by calculating the cosine similarity of the dot product between the query vector and document vectors and choosing the ones with the best scores. 

The use of dense vectors is a key feature of DPR that sets it apart from traditional information-retrieval methods that often use sparse representations. It allows the extraction of documents that are semantically related to the query, even if keyword matches are absent. It makes DPR effective in cases when a deep understanding of context is a must.

Step 2: Conditioning the response generation through BART

Once the relevant documents are retrieved, they are used to condition the response generation. This is done using a sequence-to-sequence model, such as Bidirectional and auto-regressive transformers (BART), which generate responses based on both the input and the retrieved documents. The documents are regarded as input extensions. So, the model is able to learn to generate responses based on both the original input and extensions to it. 

Integrating DPR within the RAG framework enhances the model’s knowledge base, making it exceptionally adept at handling tasks that demand a profound comprehension of the subject matter.

What is retrieval-augmented generation (RAG) in LLMs?

The RAG-LLM relationship in executing retrieval and generative functions

On the backend, RAGs execute retrieval and generative models in 5 key steps. Here’s the detailed procedure description:

#1 Sourcing

Sourcing from internal text documents and enterprise systems is the starting point of an RAG solution. Here, the unstructured and structured source data is searched and queried. The outcome of the sourcing stage is the pool of relevant information. 

Pro tip: To ensure accurate and diverse data sourcing, it’s essential to maintain up-to-date metadata and minimize data redundancy.

#2 Pre-processing data for retrieval

It’s essential to organize multi-source data and metadata for real-time access, which includes integration and unification. Here are the possible types of data:

  • Customer master data. Customer ID, name, contact information, and account information
  • Transactional data. Service requests, purchases, payments, and invoices
  • Interaction data. Emails, chats, and phone call transcripts

Depending on the use case, data may need to be arranged by other business entities, such as employees, products, or suppliers.

#3 Pre-processing documents for retrieval

Unstructured data, such as textual documents, must be divided into smaller, manageable pieces through chunking. 

Chunking is the process of breaking down large pieces of unstructured data, such as textual documents, into smaller, more manageable units called “chunks.”

Textual chunks are then transformed into vectors for storage in a vector database, enabling efficient semantic search. This embedding process, using an LLM-based embedding model, links the embeddings back to the source, ensuring more accurate and meaningful responses.

#4 Protecting the data

Here, Role-Based Access Controls (RBAC) and Dynamic Data Masking are the primary protection measures. 

RBAC is a security mechanism that restricts access to information based on the roles of individual users within an organization.

At the same time:

Dynamic data masking is a technique used to hide sensitive data in real-time, ensuring that unauthorized users see only obfuscated or anonymized information.

By using RBAC and dynamic data masking, a company ensures that sensitive customer information is protected and only accessible by authorized personnel. Thereby it complies with GDPR requirements for data minimization and access control and provides the required security level. 

#5 Engineering the prompt from enterprise data

After enterprise application data is retrieved, the retrieval-augmented generation model generates an enriched prompt by building a “story” out of the retrieved 360-degree data. 

Ongoing prompt engineering, ideally aided by machine learning models and advanced chain-of-thought prompting techniques, enhances the prompt’s quality. This might involve reflecting on the data to determine additional information needs or checking if the user has finished the conversation.

Bonus Example

Let’s review a more practical instance, revealing how the RAG with LLM relationship works on user queries. The RAG architecture diagram below illustrates the retrieval-augmented generation framework and data flow from the user prompt to response:

Retrieval-augmented generation framework and data flowSo, in practice, the simplified procedure encompasses these steps:

  1. The user enters a prompt.
  2. The retrieval model accesses the company’s internal sources for structured data and unstructured documents.
  3. The retrieval model enriches the user’s prompt with additional contextual information.
  4. The enriched prompt is passed to the generation model (LLM).
  5. The LLM uses the augmented prompt to generate a more accurate and relevant response.
  6. The response is provided to the user.

Integrating RAG with language models

Here’s the basic, step-by-step procedure for integrating RAG with any of the top large language models:

1. Setup prerequisites

Hugging Face Account: Sign up for a free Hugging Face account to access embedding models and download the Llama-2 model. Recommended options and ways of their usage are:

  • Llama-2: Request access to Meta’s Llama-2 model via Meta’s website for integration.
  • LlamaIndex: Use this framework to load data and feed it into Llama-2.
  • Chroma: Employ this embedding database for fast vector similarity search and retrieval to store the index.

2. Install required libraries

Install the necessary libraries for the project.

3. Import necessary modules

Import the required modules from the libraries to set up vector indexing, embeddings, storage functionalities, and other computations.

4. Provide context to the model

Load the data into the model. For this demonstration, a PDF research paper is used. However, other data sources like Arxiv Papers Loader can also be used.

5. Download and configure Llama-2

Set up an account with Meta and have your access token ready. Configure the model to fit into constrained memory environments, such as Google Colab, by using 4-bit quantization.

6. Initialize the language model

Initialize the Llama-2 model with the necessary configurations, including the model name, tokenizer name, query wrapper prompt, and quantization settings.

7. Compare outputs without context

Test the model by asking questions without providing any additional context to see how it responds and identify hallucinations.

8. Create a ChromaDB client and collection

Set up a client to interact with the ChromaDB database and create a new collection to hold the vector index.

9. Initialize the embedding model

Set up the HuggingFaceEmbedding class with a pre-trained model for embedding the text into vectors.

10. Index-embedded document vectors

Set up the vector store and use it to index the embedded document vectors, allowing quick retrieval of relevant passages for a given query.

11. Establish a summary index

Create a SummaryIndex to enable the model to summarize the data efficiently and retrieve pertinent information.

12. Query with indexed data

Ask the model the same question, this time using the indexed data to provide relevant context and observe the improved response quality.

13. Enable interactive conversations

Set up a chat engine to have a back-and-forth conversation with the model about the content of the PDF file and ask follow-up questions to support more productive and natural interactions.

As a result, you should obtain an instrument capable of answering user queries with data that is reinforced with additional information from the database in real-time. 

Looking for software engineering services from a trusted provider? Drop Intelliarts a line and let’s discuss. 

Key tools for integrating RAG with language models

You can see the comparison of the key tools for integrating RAG in the table below:

Key Tools for Integrating RAG with Language Models

Here are the Gitgub repositories for each of the provided tools:

Intelliarts experience with retrieval augmented generation language models

Data extraction chatbot development using ChatGPT

Data extraction chatbot development using ChatGPTThe customer, who is a non-governmental organization that advocates for gun control, requested to create a software solution. It was intended to allow employees to extract data from their internal knowledge database in an easy way. 

To assist the customer, the Intelliarts team built a chatbot using the GPT-4 model reinforced with the RAG model specifically for empowering the report requests. The chatbot was trained on the customer’s data. It can carry on a conversation exclusively about gun safety and related topics. It also can provide references as additional information.

The end solution which is custom and well-tailored to the customer’s workflow and content topics offers better productivity, automation, streamlined decision making, and ease of use. 

Discover more about the chatbot development case study.

AI assistant to help students pass the cybersecurity course

AI assistant

The Intelliarts’ customer, who is a provider of cybersecurity training, reached out to Intelliarts with the request to create an AI agent. The solution was intended to help analyze students’ work in virtual environments and provide recommendations. 

As a solution, the Intelliarts team provided recommendations regarding data collection. We researched and consulted the company on building an AI assistant by combining a classification ML approach and an LLM with an RAG model. Our team completed the discovery phase and built the PoC proved to perform the intended tasks. Now we’re moving to the full-scale solution implementation. 

As for business outcomes, the Intelliarts team proved the feasibility of such an AI agent and provided the customer with recommendations for the next project steps, including data pipeline and data type requirements. 

Discover more about the AI assistant development case study.

Final take

RAG enhances the capabilities of LLMs by combining retrieval-based methods with generative models. This innovative approach provides accurate, contextually rich responses and supports high data volumes. Businesses adopting RAG can achieve better decision-making, efficiency, and customer satisfaction. 

Should you need RAG-based solution development or another AI/ ML assistance, don’t hesitate to reach out to the Intelliarts team. With more than 24 years in the market and hundreds of completed projects under our belt, we are ready, willing, and able to address your needs with flying colors. 

Need AI or ML expert?
Contact Us


See all questions
Volodymyr Mudryi
DS/ML Engineer
Rate this article
1 ratings
White paper
White paper
Turning Predictive Maintenance into a Success Story for Your Manufacturing Company
Download now
Related Posts