How To Implement Cosine Similarity in Python
Cosine similarity is an indispensable tool that has a wide range of applications, from simplifying searches in large datasets to understanding natural language.
Join the DZone community and get the full member experience.
Join For FreeCosine similarity has several real-world applications, and by using embedding vectors, we can compare real-world meanings in a programmatic manner. Python is one of the most popular languages for data science, and it offers various libraries to calculate cosine similarity with ease. In this article, we’ll discuss how you can implement cosine similarity in Python using the help of Scikit-Learn and NumPy libraries.
What Is Cosine Similarity?
Cosine similarity is a measure of similarity between two non-zero vectors in an n-dimensional space. It is used in various applications, such as text analysis and recommendation systems, to determine how similar two vectors are in terms of their direction in the vector space.
Cosine Similarity Formula
The cosine similarity between two vectors, A and B, is calculated using the following formula:
Cosine Similarity (A, B) = (A · B) / (||A|| * ||B||)
In this formula, A · B represents the dot product of vectors A and B. This is calculated by multiplying the corresponding components of the two vectors and summing up the results. ||A|| represents the Euclidean norm (magnitude) of vector A, which is the square root of the sum of the squares of its components. It's calculated as ||A|| = √(A₁² + A₂² + ... + Aₙ²). ||B|| represents the Euclidean norm (magnitude) of vector B, calculated in the same way as ||A||.
How To Calculate Cosine Similarity
To calculate cosine similarity, you first complete the calculation for the dot product of the two vectors. Then, divide it by the product of their magnitudes. The resulting value will be in the range of -1 to 1, where:
- If the cosine similarity is 1, it means the vectors have the same direction and are perfectly similar.
- If the cosine similarity is 0, it means the vectors are perpendicular to each other and have no similarity.
- If the cosine similarity is -1, it means the vectors have opposite directions and are perfectly dissimilar.
In text analysis, cosine similarity is used to measure the similarity between document vectors, where each document is represented as a vector in a high-dimensional space, with each dimension corresponding to a term or word in the corpus. By calculating the cosine similarity between document vectors, you can determine how similar or dissimilar two documents are to each other.
Libraries for Cosine Similarity Calculation
NumPy
: Great for numerical operations, and it's optimized for speed.scikit-learn
: Offers various machine learning algorithms and includes a method for cosine similarity in its metrics package.
The following are some examples to show how cosine similarity can be calculated using Python. We’ll use our two now-familiar book review vectors [5,3,4] and [4,2,4].
Straight Python
While we previously calculated this by hand, of course, a computer can do it! Here is how you can compute cosine similarity using Python with no additional libraries:
A = [5, 3, 4]
B = [4, 2, 4]
# Calculate dot product
dot_product = sum(a*b for a, b in zip(A, B))
# Calculate the magnitude of each vector
magnitude_A = sum(a*a for a in A)**0.5
magnitude_B = sum(b*b for b in B)**0.5
# Compute cosine similarity
cosine_similarity = dot_product / (magnitude_A * magnitude_B)
print(f"Cosine Similarity using standard Python: {cosine_similarity}")
NumPy
Embedding vectors will typically have many dimensions — hundreds, thousands, even millions, or more! With NumPy, you can calculate cosine similarity using array operations, which are highly optimized.
import numpy as np
A = np.array([5, 3, 4])
B = np.array([4, 2, 4])
dot_product = np.dot(A, B)
magnitude_A = np.linalg.norm(A)
magnitude_B = np.linalg.norm(B)
cosine_similarity = dot_product / (magnitude_A * magnitude_B)
print(f"Cosine Similarity using NumPy: {cosine_similarity}")
Scikit-Learn
Scikit-learn's cosine_similarity
function makes it even easier to calculate highly optimized cosine similarity operations:
from sklearn.metrics.pairwise import cosine_similarity
A = np.array([[5, 3, 4]])
B = np.array([[4, 2, 4]])
cosine_similarity_result = cosine_similarity(A, B)
print(f"Cosine Similarity using scikit-learn: {cosine_similarity_result[0][0]}")
Tips for Optimizing Cosine Similarity Calculations in Python
If you are going to use Python to directly compute cosine similarity, there are some things to consider:
- Use optimized libraries like NumPy or scikit-learn: These libraries are optimized for performance and are generally faster than vanilla Python.
- Use Numba: Numba is an open-source JIT compiler for Python and NumPy code, built specifically to optimize scientific computing functions.
- Use GPUs: If you have access to a GPU, use Python libraries such as Tensorflow that have been optimized for use on a GPU.
- Parallelize Computations: If you have the hardware capabilities, consider parallelizing your computations to speed them up.
Search Large Numbers of Vectors With Vector Search on AstraDB
If you need to search large numbers of vectors, you may find it more efficient and scalable to use a vector database such as DataStax Astra’s Vector Search capability. Vector Search on Astra DB offers a powerful platform to help you execute vector searches with built-in cosine similarity calculations so you can get more insights from your data.
Opinions expressed by DZone contributors are their own.
Comments