Loading Vector Data Into Cassandra in Parallel Using Ray

DEMO: Delve into the nuances of combining the prowess of DataStax Astra with the power of Ray, which is a companion to this demo on GitHub.

Patrick McFadin

Oct. 06, 23 · Tutorial

Likes (1)

Comment

Save

2.4K Views

This blog will delve into the nuances of combining the prowess of DataStax Astra with the power of Ray and is a companion to this demo on GitHub. We’ll explore the step-by-step procedure, the pitfalls to avoid, and the advantages this dynamic duo brings to the table. Whether you’re a data engineer, a developer looking to optimize your workflows, or just a tech enthusiast curious about the latest in data solutions, this guide promises insights aplenty. Soon, you’ll be able to use Cassandra 5 in place of AstraDB in this demo — but for a quick start, AstraDB is a great way to get started with a vector-search-compliant Cassandra database!

Introduction

Vector search is a technology that works by turning data that we are interested in into numerical representations of locations in a coordinate system. A database that holds and operates on vectors is called a vector store. This functionality is coming to Cassandra 5.0, which will be released soon. To preview this functionality, we can make use of DataStax Astra. Similar items have their vector locations close to each other in this space. That way, we can take some items and find items similar to them. In this case, we have bits of text that are embedded. Embedding takes text into a machine-learning model that returns vectors that represent the data. You can almost think about embedding and translating data from real text into vectors.

Ray is a processing engine for Python code that is specialized for distributed machine-learning tasks. In this tutorial, we use Ray core to parallelize running text through a specific embedding model so that we can load those vectors into Astra.

We use Ray to speed up the process of embedding our text chunks. While it might seem that using a pre-trained machine learning model would be less taxing than training a new one in terms of the number of calculations that need to happen, calculating the model’s result a large number of times can still take a long time and a significant amount of computing power. We want to minimize the excess time that the process takes. Ray lets us run the process on multiple computers simultaneously to reduce how long it takes to complete.

Prerequisites

For this tutorial, you will need the following:

An Astra Account from DataStax
A Colaboratory account from Google.

Before you can proceed further, you will need to set up your Astra or Cassandra database. After creating a free account, you will need to create a database within that account and create a Keyspace within that database. All of this can be done purely using the Astra UI.

Running the Code

Once the setup is complete, open Google Colab. Under File → Open Notebook, go to the GitHub tab and paste in the link to the astra_vector_search.ipynb from our repo in order to open the notebook in Colab.

Download the files local_creds_secrets.py, and requirements.txt from the Github repo. Download your secure connect bundle and generate a token from the Astra UI.

Edit local_creds_secrets.py, pasting your client ID and client secret from AstraDB into the empty strings on the specified rows. Change the file name on the secure_connect_bundle line to reflect the file name of your secure connect bundle. If you changed the Keyspace name when creating your database, enter it into the db_keyspace line.

In Google Colab, enter the file sidebar on the left of the screen and upload local_creds_secrets.py, requirements.txt, and your secure connect bundle. Restart the runtime. In order to run individual cells from the notebook, select them and press Ctrl+Enter.

Explaining and Using the Code

Within the notebook, you can click on any cell to edit its contents. To execute a cell and see its output, select the cell and press Shift + Enter. As you make modifications, ensure you regularly save your notebook using the save icon or Ctrl + S. Once you’re done, close the browser tabs associated with Jupyter Notebook and go back to your terminal or command prompt. Press Ctrl + C to safely shut down the Jupyter Notebook server.

Notebook Cell Explanation

The first cell in the notebook installs the Python dependencies listed in requirements.txt.

The second cell imports ray and starts a Ray runtime environment that includes the specified dependencies.

You can, instead, connect your notebook to an existing Ray cluster during this step. Provide the IP address of the head node of an external Ray cluster in ray.init in order to run Ray processes on the specified cluster.

After that, we load more packages and create the RecursiveCharacterTextSplitter object that splits long text objects into shorter ones of a specified length.

Then, we use ArxivLoader (a LangChain utility for downloading text versions of scientific research papers from arxiv.org) to pull a paper and split it up into chunks using the previously defined text splitter.

Once that is complete, we create a Ray data object from our collection of split text bits. We do some minor pre-processing by replacing the newline character that ArxivLoader returns when there is a line break in the paper with a normal space character. We make this simple operation performant by using the ray flatmap function on our collection, applying the normal Python replace function to each bit of text.

Next, after we define the name of our embedding model (“intfloat/multilingual-e5-small” from HuggingFace, a model that turns up to 512 tokens of text into a vector of length 384), we create a class that loads the model upon creation and uses the model to return our embedded vectors when called.

Then, we use the Ray map_batches function with the Embed class as an argument to apply the embedding model to each piece of text in our data set. The map_batches function is used to apply a user-provided function to all elements in a ray Dataset, splitting it into batches to parallelize the process.

After that, we create our connection to Astra and create our tables and indices for storing the vector data. Our table is named papers and is defined using the following schema definition:

    SQL
   
 

   CREATE TABLE IF NOT EXISTS papers
 (
  id int PRIMARY KEY,
  name TEXT,
  description TEXT,
  item_vector VECTOR<FLOAT, 384> 
);
  

We also create a Storage Attached Index or SAI named ann_index, defined below:

    SQL
   
   CREATE CUSTOM INDEX IF NOT EXISTS ann_index
  ON {table_name}(item_vector) USING 'StorageAttachedIndex';

Once the table and index are created, we can load our data into the database. This is where we take the Python object that contains the article information and the embedded vector and process each entry. We make an insert query for each row and send it to the database. Do this by running the cell in the notebook under the header “Insert vector records into Astra DB.”

We can then search from the table using an ANN (Approximate Nearest Neighbor) query, like this

    SQL
   
   SELECT * FROM vector_search.papers
 ORDER BY item_vector ANN OF {embedding}
 LIMIT 2;

Where embedding is the example sentence, we want to find examples similar to it.

Conclusion

That puts us at the end of the embedding process.

In summary, the first step to make use of Astra’s new vector storage feature is to load pieces of vectorized text into an Astra database. This process involves running bits of text through a vector embedding model and often needs to be done tens of thousands to millions of times. In order to speed up this process, we can use multiprocessing frameworks to split the work up into manageable pieces and complete the process quickly across many machines. For this, we use Ray, a Python compute framework optimized for machine learning tasks. We create a temporary Ray cluster inside of the Google Collab resources and pull down data using ArxivLoader, which we then split, vectorize, and upload into Astra. Once all that is done, we can query the data using the Astra ANN vector search feature.

Getting Help

You can reach out to us on the Planet Cassandra Discord Server to get specific support for this demo. You can also reach out to the Astra team through the chat on Astra’s website. You can use this demo to introduce yourself to Ray — a pivotally important data engineering tool that takes many of the benefits of Spark and tunes them specifically for AI/ML and vector processing. We’ve found it works very well with Cassandra, and we’re sure you will, too. Happy coding!

Database Python (language)

Published at DZone with permission of Patrick McFadin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending