Implementing a RAG Model for PDF Content Extraction and Query Answering
Using Python to extract and process text from a PDF document, generate embeddings, calculate cosine similarity, and answer queries using the extracted content.
Join the DZone community and get the full member experience.
Join For FreeThe Retrieval-Augmented Generation (RAG) model integrates two robust methodologies: information retrieval and language generation. The model initially gathers pertinent information from an extensive dataset in response to a query, subsequently formulating a reply utilizing the context obtained. This design improves the precision of produced responses by anchoring them in real data, rendering it especially beneficial for intricate information requests across extensive datasets, like lengthy PDF files.
This tutorial will walk you through the process of utilizing Python to extract and process text from a PDF document, create embeddings, conduct cosine similarity calculations, and respond to queries derived from the extracted content.
Prerequisites
Ensure you have the following libraries installed in your Python environment:
PyMuPDF (fitz)
: For extracting text from PDFs.rake-nltk
: For phrase extraction.openai
: To interact with OpenAI's embedding and language models.pandas
: To handle and export data.numpy
andscipy
: For numerical operations and cosine similarity calculations.
Step-by-Step Guide
Step 1: Import Libraries and Open the PDF
Import the libraries and open the PDF using this code:
import fitz
# PyMuPDF Open the PDF file
pdf_document = "path/to/your/document.pdf"
document = fitz.open(pdf_document)
# Initialize a dictionary to hold the text for each page
pdf_text = {}
# Loop through each page
for page_number in range(document.page_count):
# Get a page
page = document.load_page(page_number)
# Extract text from the page
text = page.get_text()
# Store the extracted text in the dictionary
pdf_text[page_number + 1] = text
# Pages are 1-indexed for readability
# Close the document
document.close()
# Output the dictionary
for page, text in pdf_text.items():
print(f"Text from page {page}:\n{text}\n")
Step 2: Chunk Text for Embedding
The text needs to be broken down into smaller, manageable chunks. We use RecursiveCharacterTextSplitter
to split each page's text into overlapping chunks.
Using the RecursiveCharacterTextSplitter
to break text into smaller, manageable chunks with overlapping sections is important for several reasons, especially when dealing with natural language processing (NLP) tasks, large documents, or continuous text analysis. Here’s why it’s beneficial:
1. Improves Context Retention
- When text is split into overlapping chunks, each chunk retains some of the previous and following content. This helps preserve context, which is especially crucial for algorithms that rely on surrounding information, like NLP models.
- Overlapping text ensures that important details spanning across chunk boundaries aren’t lost, which is critical for maintaining the coherence of the information.
2. Enhances Accuracy in NLP Tasks
- Many NLP models (such as question-answering systems or sentiment analysis models) can perform better when provided with complete context. Overlapping chunks help these models access more relevant information, leading to more accurate and reliable results.
3. Manages Memory and Processing Efficiency
- Breaking down large texts into smaller parts helps manage memory usage and processing time, making it feasible to handle extensive documents without overwhelming the system.
- Smaller chunks allow for parallel processing, improving the efficiency of tasks like keyword extraction, summarization, or entity recognition on large texts.
4. Facilitates Chunked Data Storage and Retrieval
- Overlapping chunks can be stored and retrieved more flexibly, making it easier to reconstruct portions of the text for further processing, such as when analyzing text in a sliding window approach for time series data or contextual searches.
5. Supports Recursive Splitting for Optimal Size
RecursiveCharacterTextSplitter
can recursively split text until the desired chunk size is achieved, allowing you to tailor chunk sizes according to model input limits or memory constraints while keeping context intact.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# Split text into chunks
page_chunks = {}
for page, text in pdf_text.items():
chunks = text_splitter.split_text(text)
page_chunks[page] = chunks
# Output chunks for each page
for page, chunks in page_chunks.items():
print(f"Text chunks from page {page}:")
for i, chunk in enumerate(chunks, start=1):
print(f"Chunk {i}:\n{chunk}\n")
Step 3: Extract Key Phrases
To extract meaningful phrases from the text, we use rake-nltk
, a Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm.
RAKE is an algorithm for extracting keywords from text, designed to be fast and efficient. It works by identifying words or phrases that are statistically significant within a document. Here's an overview of how it works:
How RAKE Works
- Word Segmentation: It splits the text into individual words and phrases, discarding common stop words (like "and," "the," "is," etc.).
- Phrase Construction: RAKE groups together contiguous words that are not stop words to form candidate phrases.
- Scoring: Each candidate phrase is given a score based on the frequency of its words and the degree of co-occurrence with other words in the text. This score helps determine the relevance of each phrase as a potential keyword.
- Sorting: The phrases are sorted based on their scores, and the highest-scoring phrases are selected as keywords.
from rake_nltk import Rake
rake = Rake()
# Extract phrases from each page and store in a dictionary
page_phrases = {}
for page, text in pdf_text.items():
rake.extract_keywords_from_text(text)
phrases = rake.get_ranked_phrases()
page_phrases[page] = phrases
chunk_phrases = {}
# Extract phrases for each chunk
for page, chunks in page_chunks.items():
for chunk_number, chunk in enumerate(chunks, start=1):
rake.extract_keywords_from_text(chunk)
phrases = rake.get_ranked_phrases()
chunk_phrases[(page, chunk_number)] = phrases
# Output phrases for each chunk
for (page, chunk_number), phrases in chunk_phrases.items():
print(f"Key phrases from page {page}, chunk {chunk_number}:\n{phrases}\n")
Step 4: Generate Embeddings
Generate embeddings for each phrase using OpenAI's text-embedding-ada-002
model and save in the Excel format. This model generates numerical representations (embeddings) of text. These embeddings capture the semantic meaning of the text, allowing you to compare and analyze pieces of text based on their content.
# Function to get embeddings for a phrase
openai.api_key = "YOUR-API-KEY"
def get_embedding(phrase):
response = openai.Embedding.create(input=phrase, model="text-embedding-ada-002")
return response['data'][0]['embedding']
# Dictionary to hold embeddings
phrase_embeddings = {}
# Generate embeddings for each phrase
for (page, chunk_number), phrases in chunk_phrases.items():
embeddings = [get_embedding(phrase) for phrase in phrases]
phrase_embeddings[(page, chunk_number)] = list(zip(phrases, embeddings))
# Prepare data for Excel
excel_data = []
for (page, chunk_number), phrases in phrase_embeddings.items():
for phrase, embedding in phrases:
excel_data.append({ "Page": page, "Chunk": chunk_number, "Phrase": phrase, "Embedding": embedding })
# Create a DataFrame
df = pd.DataFrame(excel_data)
# Save to Excel
excel_filename = "phrases_embeddings.xlsx"
df.to_excel(excel_filename, index=False)
print(f"Embeddings saved to {excel_filename}")
Step 5: Query Processing and Similarity Calculation
Generate embeddings for query phrases and find the most similar chunks using cosine similarity. Cosine similarity is a measure used to determine how similar two vectors are based on the angle between them in a multi-dimensional space. It’s commonly used in text analysis and information retrieval to compare text embeddings or document vectors, as it quantifies similarity irrespective of the vectors' magnitude. In the context of text embeddings, cosine similarity helps identify which documents or sentences are closely related based on their meaning, rather than just their content or word count.
def extract_phrases_from_query(query):
rake.extract_keywords_from_text(query)
return rake.get_ranked_phrases()
# Example query
query = "What are the results of the 2DRA algorithm?(This question should be based on your pdf)"
# Extract phrases from the query
query_phrases = extract_phrases_from_query(query)
# Output query phrases
print(f"Query phrases:\n{query_phrases}\n")
def get_embeddings(phrases):
return [openai.Embedding.create(input=phrase, model="text-embedding-ada-002")['data'][0]['embedding'] for phrase in phrases]
# Get embeddings for query phrases
query_embeddings = get_embeddings(query_phrases)
import numpy as np
from scipy.spatial.distance import cosine
# Function to calculate cosine similarity
def cosine_similarity(embedding1, embedding2):
return 1 - cosine(embedding1, embedding2)
# Dictionary to store similarities
chunk_similarities = {}
# Calculate cosine similarity for each chunk
for (page, chunk_number), phrases in phrase_embeddings.items():
similarities = []
for phrase, embedding in phrases:
phrase_similarities = [cosine_similarity(embedding, query_embedding) for query_embedding in query_embeddings]
similarities.append(max(phrase_similarities))
# Choose the highest similarity for each phrase
average_similarity = np.mean(similarities)
# Average similarity for the chunk
chunk_similarities[(page, chunk_number)] = average_similarity
# Get top 5 chunks by similarity
top_chunks = sorted(chunk_similarities.items(), key=lambda x: x[1], reverse=True)[:5]
# Output top 5 chunks
print("Top 5 most relatable chunks:")
selected_chunks = []
for (page, chunk_number), similarity in top_chunks:
print(f"Page: {page}, Chunk: {chunk_number}, Similarity: {similarity}")
print(f"Chunk text:\n{page_chunks[page][chunk_number-1]}\n")
selected_chunks.append(page_chunks[page][chunk_number-1])
Step 6: Generate and Retrieve Answer Using OpenAI
Compose the context for the query from the most similar chunks and retrieve the answer using OpenAI’s GPT model.
context = "\n\n".join(selected_chunks)
prompt = f"Answer the following query based on the provided text:\n\n{context}\n\nQuery: {query}\nAnswer:"
# Use the OpenAI API to get a response
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ],
max_tokens=300,
temperature=0.1 )
# Extract the answer from the response
answer = response['choices'][0]['message']['content'].strip()
# Output the answer
print(f"Answer:\n{answer}")
Finally, this is the answer that I received after asking that question:
Answer: The 2DRA model was utilized to perform data recovery on the Virtual Machine (VM) affected by ransomware. It was successful in retrieving all the 14,957 encrypted files. Additionally, an analysis of the encrypted files and their associated hash values on the VM was conducted using the 2DRA model after the execution of WannaCry ransomware. The analysis revealed that the hexadecimal values of the files were distinct prior to encryption, but were altered after the encryption.
The solution is based on the PDF I used in step 1. You'll see the solution in the PDF you submitted.
This concludes implementing a basic RAG pipeline that reads PDF content, extracts meaningful phrases, generates embeddings, calculates similarities, and answers queries based on the most relevant content.
Opinions expressed by DZone contributors are their own.
Comments