Understanding Retrieval Augmented Generation
Understanding how RAG Content Creation works behind the scenes through the fusion of Retrieval and Generation Techniques
Introduction
The role of Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) becomes fundamental in the current dynamic landscape of LLMs. Through this article, you will delve into the world of Retrieval Augmented Generation (RAG) and how this advanced technique allows AI models to retrieve relevant information from sources and incorporate it into generated text.
This post explores the transformative synergy between these two approaches, illuminating how the combination enhances content creation and deepens the understanding of language applications.
What is Retrieval Augmented Generation?
RAG is an advanced and complex artificial intelligence technique that combines information retrieval with text generation. LLMs are commonly trained on enormous bodies of data, but they are not on individuals’ private data, and this is where Retrieval-Augmented Generation comes into play, by solving the problem of adding your private data to the one that the LLM model already has access to.
RAG was first introduced by Meta AI researchers in 2020 through their paper — Retrieval-Augmented Generation for Knowledge-Intensive NLP Task— to address those kinds of knowledge-intensive tasks. Basically what RAG was aiming to combine was the information retrieval features of already existent language models with a text generator model.
The diagram depicted above can be interpreted as a process where Users load data into an LLM coming from multiple sources (structured, semi-structured, and unstructured data), once the data is loaded, it is prepared for queries (it is “indexed”), this is where users’ queries acts on the index, that filters user data down to the most relevant context. Then this context and user query go to the LLM along with a prompt, and it responds. This process can be simplified as in the following diagram:
To understand what technically involves the whole process, it is necessary to understand the different stages within RAG:
The five key stages shown in the image above will be a fundamental part of each kind of application you may build on top of an LLM, specifically in LlamaIndex:
- Loading: This stage involves the process of ingesting your data from the sources you have available — whether it’s structured, semi-structured, or unstructured data — into your pipeline. For this purpose, LlamaHub provides hundreds of connectors to connect.
- Indexing: This is one of the fundamental processes in RAG, and this consists of the creation of a data structure that allows for querying the data. For LLMs this nearly always means creating
vector embeddings
, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data. - Storing: This stage starts working when the data has already been indexed, the user will always need to store an index, as well as other metadata, to avoid having to re-index it.
- Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries, and hybrid strategies.
- Evaluation: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful, and fast your responses to queries are.
Main Components
Besides the previously described stages, RAG has two main components:
The Retrieval Models essentially acts as a specialized ‘librarian’, which continuously pulls in relevant information from the multiple sources it has available, from databases to a very large corpus of documents to find relevant pieces of information that can be used for text generation. These models use algorithms to rank and select the most pertinent data, offering a way to introduce external knowledge into the text generation process.
Retrieval models can be implemented through the use of vector embeddings and vector search, and also document indexing databases.
On the other hand, the Generative Model will act as the ‘writer’ that crafts coherent and informative text based on the retrieved data. Once the retrieval models have sourced the required information, generative models come into play by synthesizing the retrieved information into coherent and contextually relevant text. Through this process, they have the capability of creating text that is grammatically correct, semantically meaningful, and aligned with the initial user query.
They both work together to provide answers that are not only accurate but also contextually rich.
Why use RAG to build accurate LLM models?
In this ever-evolving NLP field, having every time more intelligent and context-aware LLM systems becomes a crucial part of the development. This is where RAG fits into the picture, by filling the gaps in the limitations of traditional generative models (accurate context generation).
So there are 3 main reasons why RAG is being implemented in most LLM models:
- RAG combination of retrieval models with generative models, this way it ensures that the final prompts generated are both, well-informed and well-written. The retrieval component provides the “what” — the factual content — while the generative component of the model has the “how” — the way it composes the facts into coherent and meaningful language —. This way RAG provides a solution for generating text that isn’t just fluent but also factually accurate and information-rich.
- RAG’s dual nature: it offers a natural advantage in tasks requiring external knowledge or contextual understanding. An example of this scenario could be question-answering systems, where traditional generative models might struggle to generate accurate answers, RAG can pull in real-time information through its retrieval component, making its responses more accurate and detailed.
- RAG’s capability to search, select, and synthesize information makes it a unique solution in scenarios requiring multi-step reasoning from multiple sources, such as legal research, scientific literature reviews, or even complex customer service queries.
Technical implementation of RAG with LLMs
Implementing RAG in your LLM system it’s an intricate process that involves most of the 5 stages described in previous sections, which goes from data sourcing to the final output. In this article, we are specifically working with LlamaIndex in its Python implementation, so is recommended to refer to the API Reference to start working with its loading, transformation, and customization capabilities.
Loading and Ingesting data to our system
The starting point of any RAG system is its source data. It often consists of an enormous corpus of data stored in documents, websites, or databases. This data works silos of knowledge that the retrieval model scans through to find relevant information.
The way LlamaIndex performs data loading is via data connectors, also called Reader
. Data connectors ingest data from different data sources stored in a special registry of data connectors called LlamaHub, then it formats the data into Document
objects. A Document
is a collection of data and metadata.
Here we have used the simplest reader provided by the LlamaIndex API connector, which is SimpleDirectoryReader
, it creates documents out of every file in a provided directory, and it can read a variety of formats including Markdown, PDFs, Word documents, and PowerPoint decks, among others.
Data Transformations
Before the retrieval model can be able to search into the data and after the data is loaded, are frequently implemented some customizations, including chunking, metadata extraction, and then embedding each chunk. This process ensures that the system can efficiently scan through the data and enables quick retrieval of relevant content, and then putting it into a storage system.
LlamaIndex provides high-level and lower-level APIs for transforming documents into chunks of data.
from llama_index import VectorStoreIndex
vector_index = VectorStoreIndex.from_documents(documents)
vector_index.as_query_engine()
What is doing the above code under the hood, is splitting the data stored in Documents into Node objects (a Document
can be understood as a subclass of a Node
) that contain text and metadata and have a relationship to their parent Document.
Effective chunking strategies can drastically improve the model’s speed and accuracy: a document may be its chunk, but it could also be split up into chapters/sections, paragraphs, sentences, or even just “chunks of words.”
Some other customizations can be applied in the transformation stage of your LLM pipeline, and can be found on the official documentation page:
Conclusion
Retrieval Augmented Generation (RAG) has a diverse array of applications that cover multiple domains that require advanced and cutting-edge NLP capabilities. RAG’s unique approach of combining retrieval and generative components, makes it a fundamental component to every LLM system.
The challenges that RAG can solve go from complex AI-driven news aggregation platforms that can generate concise, coherent, and contextually relevant summaries, providing a rich user experience, to question-answering systems where RAG can scan through an extensive corpus of data to retrieve the most relevant information and craft detailed, accurate answers.
Though all RAG’s advantages covered in the article are not free at all of the limitations and drawbacks. One of the most obvious is the model complexity, as it combines both retrieval and generative components, the overall architecture becomes more intricate, requiring more computational power than other simpler models.
Another of its difficulties resides in data preparation, requiring clean and non-redundant data and then developing and testing an approach to chunk that data into pieces.
References
“Retrieval Augmented Generation (RAG)” www.promptingguide.ai, www.promptingguide.ai/techniques/rag.
Lewis, Patrick, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. 2020.
“LlamaIndex 🦙 0.9.24.” docs.llamaindex.ai/en/stable/index.html