Better Search Results Through Intelligent Chunking and Metadata Integration

Chunking is an important, but underrated, aspect of LLM-based content retrieval, and effective management of It will yield world-class results.

Pavan Vemuri

Prince Bose

Tharakarama Reddy Yernapalli Sreenivasulu

Aug. 26, 24 · Tutorial

Likes (4)

Comment

Save

4.0K Views

Often, the knowledge bases over which we develop an LLM-based retrieval application contain a lot of data in various formats. To provide the LLM with the most relevant context to answer the question specific to a section within the knowledge base, we rely on chunking the text within the knowledge base and keeping it handy.

Chunking

Chunking is the process of slicing text into meaningful units to improve information retrieval. By ensuring each chunk represents a focused thought or idea, chunking assists in maintaining the contextual integrity of the content.

In this article, we will look at 3 aspects of chunking:

How poor chunking leads to less relevant results
How good chunking leads to better results
How good chunking with metadata leads to well-contextualized results

To effectively showcase the importance of chunking, we will take the same piece of text, apply 3 different chunking methodologies to it, and examine how information is retrieved based on the query.

Chunk and Store to Qdrant

Let us look at the following code which shows three different ways to chunk the same text.

    Python
   
 

   import qdrant_client
from qdrant_client.models import PointStruct, Distance, VectorParams
import openai
import yaml

# Load configuration
with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)

# Initialize Qdrant client
client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key
openai.api_key = config['openai']['api_key']

def embed_text(text):
    print(f"Generating embedding for: '{text[:50]}'...")  # Show a snippet of the text being embedded
    response = openai.embeddings.create(
        input=[text],  # Input needs to be a list
        model=config['openai']['model_name']
    )
    embedding = response.data[0].embedding  # Access using the attribute, not as a dictionary
    print(f"Generated embedding of length {len(embedding)}.")  # Confirm embedding generation
    return embedding

# Function to create a collection if it doesn't exist
def create_collection_if_not_exists(collection_name, vector_size):
    collections = client.get_collections().collections
    if collection_name not in [collection.name for collection in collections]:
        client.create_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
        )
        print(f"Created collection: {collection_name} with vector size: {vector_size}")  # Collection creation
    else:
        print(f"Collection {collection_name} already exists.")  # Collection existence check

# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.
text = """
Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.
"""

# Poor Chunking Strategy
def poor_chunking(text, chunk_size=40):
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
    print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}")  # Show chunks produced
    return chunks

# Good Chunking Strategy
def good_chunking(text):
    import re
    sentences = re.split(r'(?<=[.!?]) +', text)
    print(f"Good Chunking produced {len(sentences)} chunks: {sentences}")  # Show chunks produced
    return sentences

# Good Chunking with Metadata
def good_chunking_with_metadata(text):
    chunks = good_chunking(text)
    metadata_chunks = []
    for chunk in chunks:
        if "healthcare" in chunk:
            metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})
        elif "ethical implications" in chunk or "data privacy" in chunk:
            metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})
        else:
            metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})
    print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}")  # Show chunks produced
    return metadata_chunks

# Store chunks in Qdrant
def store_chunks(chunks, collection_name):
    if len(chunks) == 0:
        print(f"No chunks were generated for the collection '{collection_name}'.")
        return
    
    # Generate embedding for the first chunk to determine vector size
    sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]
    sample_embedding = embed_text(sample_text)
    vector_size = len(sample_embedding)
    create_collection_if_not_exists(collection_name, vector_size)
    
    for idx, chunk in enumerate(chunks):
        text = chunk if isinstance(chunk, str) else chunk["text"]
        embedding = embed_text(text)
        payload = chunk if isinstance(chunk, dict) else {"text": text}  # Always ensure there's text in the payload
        client.upsert(collection_name=collection_name, points=[
            PointStruct(id=idx, vector=embedding, payload=payload)
        ])
    print(f"Chunks successfully stored in the collection '{collection_name}'.")

# Execute chunking and storing separately for each strategy
print("Starting poor_chunking...")
store_chunks(poor_chunking(text), "poor_chunking")

print("Starting good_chunking...")
store_chunks(good_chunking(text), "good_chunking")

print("Starting good_chunking_with_metadata...")
store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")

  

The above code does the following:

embed_text method takes in the text, generates embedding by using the OpenAI embedding model, and returns the embedding generated.
Initializes a text string that is used for chunking and later content retrieval
Poor chunking strategy: Splits text into chunks of 40 characters each
Good chunking strategy: Splits text based on sentences to obtain a more meaningful context
Good chunking strategy with metadata: Adds appropriate metadata to sentence-level chunks
Once embeddings are generated for the chunks, they are stored in corresponding collections in Qdrant Cloud.

Keep in mind the poor chunks are created only to showcase how poor chunking impacts retrieval.

Below are the screenshots from Qdrant Cloud for the chunks, where you can see metadata was added to the sentence-level chunks to indicate the source and topic.

Retrieval Results Based on Chunking Strategy

Now let us write some code to retrieve the content from Qdrant Vector DB based on a query.

    Python
   
 

   import qdrant_client
import openai
import yaml

# Load configuration
with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)

# Initialize Qdrant client
client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key
openai.api_key = config['openai']['api_key']

def embed_text(text):
    response = openai.embeddings.create(
        input=[text],  # Ensure input is a list of strings
        model=config['openai']['model_name']
    )
    
    # Correctly access the embedding data
    embedding = response.data[0].embedding  # Access using the attribute, not as a dictionary
    
    return embedding

# Define the query
query = "ethical implications of AI in healthcare"
query_embedding = embed_text(query)

# Function to perform retrieval and print results
def retrieve_and_print(collection_name):
    result = client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        limit=3  
    )
    print(f"\nResults from '{collection_name}' collection for the query: '{query}':")
    
    if not result:
        print("No results found.")
        return
    
    for idx, res in enumerate(result):
        if 'text' in res.payload and res.payload['text']:
            print(f"Result {idx + 1}:")
            print(f"  Text: {res.payload['text']}")
            print(f"  Source: {res.payload.get('source', 'N/A')}")
            print(f"  Topic: {res.payload.get('topic', 'N/A')}")
        else:
            print(f"Result {idx + 1}:")
            print("  No relevant text found for this chunk. It may be too fragmented or out of context to match the query effectively.")

# Execute retrieval and provide appropriate explanations
retrieve_and_print("poor_chunking")


retrieve_and_print("good_chunking")


retrieve_and_print("good_chunking_with_metadata")


  

The above code does the following:

Defines a query and generates the embedding for the query
The search query is set to "ethical implications of AI in healthcare".
The retrieve_and_print function searches the particular Qdrant collection and retrieves the top 3 vectors closest to the query embedding.

Now let us look at the output:

    PowerShell
   
 

   python retrieval_test.py

Results from 'poor_chunking' collection for the query: 'ethical implications of AI in healthcare':
Result 1:
  Text: . The ethical implications of AI in heal
  Source: N/A
  Topic: N/A
Result 2:
  Text: ant impact is healthcare. AI is being us
  Source: N/A
  Topic: N/A
Result 3:
  Text: 
Artificial intelligence is transforming
  Source: N/A
  Topic: N/A

Results from 'good_chunking' collection for the query: 'ethical implications of AI in healthcare':
Result 1:
  Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.
  Source: N/A
  Topic: N/A
Result 2:
  Text: One of the key areas where AI is making a significant impact is healthcare.
  Source: N/A
  Topic: N/A
Result 3:
  Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

  Source: N/A
  Topic: N/A

Results from 'good_chunking_with_metadata' collection for the query: 'ethical implications of AI in healthcare':
Result 1:
  Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.
  Source: Healthcare Section
  Topic: AI in Healthcare
Result 2:
  Text: One of the key areas where AI is making a significant impact is healthcare.
  Source: Healthcare Section
  Topic: AI in Healthcare
Result 3:
  Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

  Source: General
  Topic: AI Overview

  

The output for the same search query varies depending on the chunking strategy implemented.

Poor chunking strategy: The results here are less relevant, as you can notice, and that is because the text was split into small, arbitrary chunks.
Good chunking strategy: The results here are more relevant because the text was split into sentences, preserving the semantic meaning.
Good chunking strategy with metadata: The results here are most accurate because the text was thoughtfully chunked and enhanced using metadata.

Inference From the Experiment

Chunking needs to be carefully strategized, and the chunk size should not be too small or too big.
An example of poor chunking is when the chunks are too small, cutting off sentences in unnatural places, or too big, with multiple topics included in the same chunk, making it very confusing for retrieval.
The whole idea of chunking revolves around the concept of providing better context to the LLM.
Metadata massively enhances properly structured chunking by providing extra layers of context. For example, we have added source and topic as metadata elements to our chunks.
The retrieval system benefits from this additional information. For example, if the metadata indicates that a chunk belongs to the "Healthcare Section," the system can prioritize these chunks when a query related to healthcare is made.
By improving upon chunking, the results can be structured and categorized. If the query matches multiple contexts within the same text, we can identify which context or section the information belongs to by looking at the metadata for the chunks.

Keep these strategies in mind and chunk your way to success in LLM-based search applications.

AI Knowledge base Metadata Chunking (division) Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending