Unlocking the Power of Search: Keywords, Similarity, and Semantics Explained

An overview of keyword, similarity, and semantic search techniques, providing insights into how each one works and guidance on when to use them effectively.

Pavan Vemuri

Jun. 13, 24 · Tutorial

Likes (5)

Comment

Save

4.6K Views

Delving Into Different Search Techniques

To set the context, let’s say we have a collection of texts about various technical topics and would like to look for information related to “Machine Learning.” We will now look at how Keyword Search, Similarity Search, and Semantic Search offer different levels of depth and understanding, from simple keyword matching to recognizing related concepts and contexts.

Let us first look at the standard code components used for the program.

1. Standard Code Components Used

A. Libraries Imported

    Python
   
 

   import os
import re
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT
from whoosh.qparser import QueryParser
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline
import numpy as np

  

The following necessary libraries are imported in this block:

os for file system operations.
re for regular expressions.
whoosh for creating and managing a search index.
scikit-learn for TF-IDF vectorization and similarity computation.
transformers for using a deep learning model for feature extraction.
numpy for numerical operations, specifically sorting.

B. Sample Document Initialization

    Python
   
 

   # Sample documents used for demonstrating all three search techniques
documents = [
   "Machine learning is a field of artificial intelligence that uses statistical techniques.",
    "Natural language processing (NLP) is a part of artificial intelligence that deals with the interaction between computers and humans using natural language. ",
    "Deep learning models are a subset of machine learning algorithms that use neural networks with many layers.",
    "AI is transforming the world by automating tasks, providing insights through data analysis, and enabling new technologies like autonomous vehicles and advanced robotics. ",
    "Natural language processing can be challenging due to the complexity and variability of human language. ",
    "The application of machine learning in healthcare is revolutionizing the way diseases are diagnosed and treated.",
    "Autonomous vehicles rely heavily on AI and machine learning to navigate and make decisions.",
    "Speech recognition technology has advanced considerably thanks to deep learning models. "
]

  

Defines a list of sample documents containing text related to various topics in artificial intelligence, machine learning, and natural language processing.

C. Highlight Function

    Python
   
def highlight_term(text, term):
    return re.sub(f"({term})", r'\033[1;31m\1\033[0m', text, flags=re.IGNORECASE)

Used to beautify the output to highlight the search term within the text.

2. Keyword Search

A traditional method that matches search queries with exact or partial keywords found in the documents.

Relies heavily on exact term matching and simple query operators (AND, OR, NOT).

A. How Keyword Search Works

Since our search query is “Machine Learning,” the keyword search looks for exact text matches and only returns text that contains “Machine Learning.” Some examples of texts that will be returned are “Machine learning is transforming many industries.” “A course on machine learning was introduced recently.”

B. Let’s Examine the Code Behind the Keyword Search

    Python
   
 

   # Function for Keyword Search using Whoosh
def keyword_search(query_str):
    schema = Schema(content=TEXT(stored=True))
    if not os.path.exists("index"):
        os.mkdir("index")
    index = create_in("index", schema)
    writer = index.writer()
    for doc in documents:
        writer.add_document(content=doc)
    writer.commit()

    with index.searcher() as searcher:
        query = QueryParser("content", index.schema).parse(query_str)
        results = searcher.search(query)
        highlighted_results = [(highlight_term(result['content'], query_str), result.score) for result in results]
        return highlighted_results
  

I used the Whoosh library to perform the keyword search.

Schema and TEXT define the schema with a single field content
os.path.exists and os.mkdir: Check if the index directory exists and create it if not.
create_in: Establishes an index in a directory named index.
writer: Opens a writer to add documents to the index.
add_document: Adds documents to the index.
commit: Commits the changes to the index
with index.searcher(): Opens a searcher to search the index.
QueryParser: Parses the query string.
searcher.search: Searches the index with the parsed query.
highlighted_results: Highlights the search term in the results and stores the results with their scores.

We will check the keyword search output and other search techniques later in this article.

3. Similarity Search

This method finds text similar to the search query by comparing the provided text to other texts based on features like the presence of related words or themes.

A. How Similarity Search Works

Going back to the same search query "Machine Learning" as before, Similarity search will bring back conceptually similar texts such as "AI applications in healthcare use machine learning techniques" and "Predictive modeling often relies on machine learning."

B. Let’s Examine the Code Behind the Similarity Search

    Python
   
 

   # Function for Similarity Search using Scikit-learn
def similarity_search(query_str):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    query_vec = vectorizer.transform([query_str])
    similarity = cosine_similarity(query_vec, tfidf_matrix)
    similar_docs = similarity[0].argsort()[-3:][::-1]  # Top 3 similar documents
    similarities = similarity[0][similar_docs]
    highlighted_results = [(highlight_term(documents[i], query_str), similarities[idx]) for idx, i in enumerate(similar_docs)]
    return highlighted_results
  

I used the Scikit-learn library to write a function to perform a similarity search

TfidfVectorizer: Converts documents into TF-IDF features. Learn more about TF-IDF here.
fit_transform: It fits the vectorizer to the documents and transforms the documents into a TF-IDF matrix. Fit learns the vocabulary from the document list and identifies the unique words to calculate their TF and IDF values.
transform: Transforms the query string into a TF-IDF vector using the same vocabulary and statistics learned during the fit step
cosine_similarity: Computes cosine similarity between the query vector and the TF-IDF matrix.
argsort()[-3:][::-1]: Gets the indices of the top 3 similar documents in descending order of similarity. This step is only pertinent to this article, and we can eliminate this if we want to not restrict the search results to the top 3.
highlighted_results: Highlights the search term in the results and stores the results with their similarity scores.

4. Semantic Search

We are now entering the realm of powerful search techniques. This method understands the meaning/context of the searched term and returns texts using the concept even though the searched term is not directly mentioned.

A. How Semantic Search Works

The same search query, "Machine Learning," when applied with semantic search, yields texts related to the concept of Machine Learning, such as "AI and data-driven decision-making are changing industries" and "Neural networks are a key component of many AI systems."

B. Let’s Examine the Code Behind the Semantic Search

    Python
   
 

   # Function for Semantic Search using Transformers
def semantic_search(query_str):
    semantic_searcher = pipeline("feature-extraction", model="distilbert-base-uncased")
    query_embedding = semantic_searcher(query_str)[0][0]
    
    def get_similarity(query_embedding, doc_embedding):
        return cosine_similarity([query_embedding], [doc_embedding])[0][0]
    
    doc_embeddings = [semantic_searcher(doc)[0][0] for doc in documents]
    similarities = [get_similarity(query_embedding, embedding) for embedding in doc_embeddings]
    sorted_indices = np.argsort(similarities)[-3:][::-1]
    highlighted_results = [(highlight_term(documents[i], query_str), similarities[i]) for i in sorted_indices]
    return highlighted_results
  

A function to perform Semantic search using the Hugging Face transformers library.

There is a lot going on in the semantic_searcher = pipeline("feature-extraction", model="distilbert-base-uncased") code snippet.

pipeline : This is the function imported from the transformers library that helps in setting up various types of NLP tasks using pre-trained models
feature-extraction: Pipeline performs the Feature extraction task to convert text into numerical representations (embeddings) that can be used for various downstream tasks.
The pre-trained model used for this task is the distilbert-base-uncased model, which is a smaller, faster version of the BERT model, which is trained to understand English text without case sensitivity.
query_embedding: Gets the embedding for the query string.
get_similarity: A nested function to compute the cosine similarity between the query embedding and a document embedding.
doc_embeddings: Gets embeddings for all documents.
similarities: Computes similarities between the query embedding and all document embeddings.
argsort()[-3:][::-1]: Gets the indices of the top 3 similar documents in descending order of similarity.
highlighted_results: Highlights the search term in the results and stores the results with their similarity scores.

Output

Now that we have the context about various search techniques, we have set up documents to be able to search, let us look at the output based on the search query for each of the search techniques.

    Python
   
 

   # Main execution
if __name__ == "__main__":
    query = input("Enter your search term: ")

    print("\nKeyword Search Results:")
    keyword_results = keyword_search(query)
    for result, score in keyword_results:
        print(f"{result} (Score: {score:.2f})")
    
    print("\nSimilarity Search Results:")
    similarity_results = similarity_search(query)
    for result, similarity in similarity_results:
        print(f"{result} (Similarity: {similarity * 100:.2f}%)")
    
    print("\nSemantic Search Results:")
    semantic_results = semantic_search(query)
    for result, similarity in semantic_results:
        print(f"{result} (Similarity: {similarity * 100:.2f}%)")
  

Let us now search our documents using the search term "Machine Learning" and the image of the search results below.

Highlights from the search results:

The highlighted_results function helps us highlight the search term.
Only 3 results are returned for similarity search and semantic search, and that is because our code limited the search results to 3 for both those search techniques.
Keyword Search uses TF-IDF to calculate a score based on the frequency and importance of terms in the document relative to the query.
Similarity Search uses vectorization and cosine similarity to measure how similar the documents are to the query in a vector space.
Semantic Search uses embeddings from transformer models and cosine similarity to capture the semantic meaning and relevance of the documents to the query.
Notice how semantic search, being powerful, retrieves text regarding Natural language processing as it is closer in context to machine learning.

Now let us take a look at the search results using other search terms, "Artificially Intelligent" and "Artificial Intelligence," (Please notice the incorrect spelling of Artificial is on purpose), and discuss the findings