Building a RAG Model Pipeline Using Python With Online Text Data

In this article, you'll find an end-to-end guide for extracting, embedding, and querying text from online sources like Wikipedia using OpenAI for answers.

Vaibhavi Tiwari

Nov. 18, 24 · Tutorial

Likes (5)

Comment

Save

1.5K Views

In this tutorial, I will walk you through the process of constructing a Retrieval-Augmented Generation (RAG) pipeline using Python. This pipeline will be used to get, process, and query content from online publications. Using this method, text will be extracted from a Wikipedia page and then processed into more manageable chunks. Embeddings will be created, similarity will be calculated, and user queries will be answered with information pertinent to the question.

Prerequisites

Create a .ipynb file and start following the steps below:

    Python
   
   !pip3 install requests beautifulsoup4 openai pandas numpy scipy spacy langchain openpyxl

Library Descriptions

Requests

This library allows us to make HTTP requests in Python, which is essential for retrieving online data, such as extracting text from websites (e.g., Wikipedia articles).

BeautifulSoup4

A powerful library for web scraping, BeautifulSoup4 is used here to parse and extract text from HTML, which is particularly helpful for structuring text from online sources.

OpenAI

The OpenAI library enables interaction with OpenAI’s API for tasks like generating text and embeddings or performing language-based tasks using models such as GPT-3 or GPT-4.

pandas

pandas is a versatile data manipulation library that allows for structured data storage and management, making it easy to organize and export data (e.g., to Excel).

NumPy and SciPy

These libraries provide efficient mathematical functions. NumPy is used for numerical operations, while SciPy includes functions for calculating cosine similarity, which is helpful for comparing text embeddings.

spaCy

spaCy is a natural language processing (NLP) library that allows for keyword extraction, entity recognition, and other linguistic processing. Here, it helps us extract key phrases from chunks of text.

LangChain

This library supports the implementation of language model applications. It includes tools like RecursiveCharacterTextSplitter, which enables us to split text into manageable chunks while preserving context.

openpyxl

openpyxl is used to write data into Excel files, allowing us to save embeddings or other structured data for later use.

After installing these libraries, we're ready to set up our data processing pipeline for the RAG model.

Step-by-Step Guide

Step 1: Import Libraries and Fetch Article Content

First, we begin by importing the requests library and BeautifulSoup, which will allow us to fetch and parse content from a Wikipedia article online. We utilize requests to send an HTTP request to the article’s URL, and BeautifulSoup assists us in parsing the HTML response to extract the main text. We gather all the paragraphs, tidy up the text, and save each paragraph in a list.

At last, we merge these paragraphs into one cohesive string, article_text, which is prepared for the subsequent steps of processing.

    Python
   
 

   import requests
from bs4 import BeautifulSoup

# URL of the Wikipedia article (use any topic you prefer)
url = "https://en.wikipedia.org/wiki/Natural_language_processing"

# Fetch the page content
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract paragraphs from the article
text_data = []
for p in soup.find_all('p'):
    text = p.get_text(strip=True)
    text_data.append(text)

# Join all paragraphs into a single text
article_text = "\n\n".join(text_data)
print("Article content extracted.")

  

Step 2: Chunk Text for Embedding

In this step, we utilize the RecursiveCharacterTextSplitter from langchain_text_splitters to break the article text into manageable chunks while preserving essential context. Setting a chunk_size of 1000 characters, along with a chunk_overlap of 200 characters allows each chunk to maintain some overlap with the surrounding text, which aids in preserving context across the boundaries.

Next, we divide the complete article text into these segments and display each segment to verify the separation. This sets up the text for embedding and similarity analysis in the following steps.

    Python
   
   from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Split article text into chunks
chunks = text_splitter.split_text(article_text)

# Display chunks
for i, chunk in enumerate(chunks, start=1):
    print(f"Chunk {i}:\n{chunk}\n")

Step 3: Extract Key Phrases Using spaCy

Here, we utilize spaCy to extract key phrases from each text chunk, emphasizing significant noun phrases. Once the en_core_web_sm language model is installed and loaded, we proceed to define a function that will help us identify noun chunks in each segment of text. This function retrieves phrases that consist of multiple words, ensuring the extraction of more meaningful keywords.

Next, we apply this function to each chunk, collecting the extracted phrases in a dictionary and showing them to confirm our results. This step is crucial for identifying key terms and concepts that will be used in the following embedding and similarity calculations.

    Python
   
 

   import spacy
!python -m spacy download en_core_web_sm

# Load spaCy's English model (make sure it's installed)
# You may need to run this once: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Define a function to extract keywords using noun chunks from spaCy
def extract_keywords_spacy(text):
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks if len(chunk.text.split()) > 1]  # Only keep phrases longer than 1 word

# Initialize a dictionary to store phrases for each chunk
chunk_phrases = {}

# Extract phrases for each chunk
for chunk_number, chunk in enumerate(chunks, start=1):
    phrases = extract_keywords_spacy(chunk)
    chunk_phrases[chunk_number] = phrases

# Display extracted phrases
for chunk_number, phrases in chunk_phrases.items():
    print(f"Key phrases from chunk {chunk_number}:\n{phrases}\n")

  

Step 4: Generate Embeddings for Key Phrases

In this step, we create embeddings for each extracted key phrase utilizing OpenAI’s text-embedding-ada-002 model, which delivers numerical representations of text grounded in semantic meaning. A function named get_embedding is defined to retrieve embeddings for a specified phrase. We apply this function to each phrase in our chunks and store the results accordingly.

Next, we organize the phrases and embeddings from each chunk into a format compatible with Excel for export. In the end, we store the embeddings in an Excel file called phrases_embeddings_article.xlsx, making it convenient for future analysis.

    Python
   
 

   import openai
import pandas as pd

openai.api_key = "YOUR-API-KEY"

def get_embedding(phrase):
    response = openai.Embedding.create(input=phrase, model="text-embedding-ada-002")
    return response['data'][0]['embedding']

# Generate embeddings for each phrase
phrase_embeddings = {}
for chunk_number, phrases in chunk_phrases.items():
    embeddings = [get_embedding(phrase) for phrase in phrases]
    phrase_embeddings[chunk_number] = list(zip(phrases, embeddings))

# Prepare data for Excel output
excel_data = []
for chunk_number, phrases in phrase_embeddings.items():
    for phrase, embedding in phrases:
        excel_data.append({"Chunk": chunk_number, "Phrase": phrase, "Embedding": embedding})

# Save embeddings to Excel
df = pd.DataFrame(excel_data)
df.to_excel("phrases_embeddings_article.xlsx", index=False)
print("Embeddings saved to phrases_embeddings_article.xlsx")

  

Step 5: Query Processing and Similarity Calculation

In this step, we determine the similarity scores between the query and the embeddings of each chunk to identify the most relevant content. To start, we utilize OpenAI’s model to embed the query and establish a cosine_similarity function for assessing the similarity between embeddings. In each segment, we calculate similarity scores between the query embedding and every phrase embedding within that segment, choosing the highest score for each phrase.

Next, we proceed to save the average similarity for each chunk. In this step, we organize and extract the top five chunks with the highest similarity scores, showcasing the most pertinent content connected to the query.

    Python
   
 

   rom scipy.spatial.distance import cosine
import numpy as np

def cosine_similarity(embedding1, embedding2):
    return 1 - cosine(embedding1, embedding2)

query = "Explain the applications of NLP in healthcare."
query_phrases = [get_embedding(query)]
chunk_similarities = {}

# Calculate similarity for each chunk
for chunk_number, phrases in phrase_embeddings.items():
    similarities = []
    for phrase, embedding in phrases:
        phrase_similarities = [cosine_similarity(embedding, query_embedding) for query_embedding in query_phrases]
        similarities.append(max(phrase_similarities))
    chunk_similarities[chunk_number] = np.mean(similarities)

# Retrieve top 5 most relevant chunks
top_chunks = sorted(chunk_similarities.items(), key=lambda x: x[1], reverse=True)[:5]
selected_chunks = [chunks[chunk_number-1] for chunk_number, _ in top_chunks]
print("Top 5 relevant chunks:", selected_chunks)

  

Step 6: Generate and Retrieve Answer Using OpenAI

Combine relevant chunks into a context and ask a question using the OpenAI model.

    Python
   
 

   context = "\n\n".join(selected_chunks)
prompt = f"Answer the following question based on the article:\n\n{context}\n\nQuestion: {query}\nAnswer:"

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}],
    max_tokens=300,
    temperature=0.1
)

answer = response['choices'][0]['message']['content'].strip()
print(f"Answer:\n{answer}")

  

Once you have completed all of these steps, you can expect a similar result/answer:

"In healthcare, Natural Language Processing (NLP) is used to analyze notes and text in electronic health records. This data, which would otherwise be inaccessible, is crucial when seeking to improve care or protect patient privacy."

Note: I used the Wikipedia article/page about NLP; feel free to use any other article of your choice.

Library Python (language)

Opinions expressed by DZone contributors are their own.

Related

Trending