Build a Philosophy Quote Generator With Vector Search and Astra DB (Part 2)

In part 2 of this 'Infinite Wisdom Series,' we see how we store data from famous philosophers in a database and query it with a semantic search engine.

Stefano Lottini

Nov. 22, 23 · Tutorial

Likes (1)

Comment

Save

2.2K Views

This is the penultimate post in a mini-series about building a vector-search-based generative AI application with Python and Astra DB from scratch. We explained several important concepts in part 1. In this installment, we will store quotes from famous philosophers in the database and query them within a semantic search engine.

Create and Fill the Vector Store

The present application will focus on quotes by famous philosophers, such as:

"Restlessness is the hallmark of existence." (J.-P. Sartre)

"No one is more hated than he who speaks the truth." (Plato)

The starting point is a ready-made corpus of these quotes, where each quote is attributed to an author and can bear a few labels. Here is an excerpt from the dataset (which we adapted from a Creative Commons-licensed source and enriched somewhat — see the Appendix for more details):

     JSON
    
    {

  "plato": [

    {

      "body": "No one is more hated than he who speaks the truth.",

      "tags": [

        "ethics",

        "knowledge"

      ]

    },

    ...

  ],

  ...

}

Now, let's start by building a quote search engine. We can easily achieve something better than just term-based search: by exploiting the power of sentence embeddings and vector search, which do all the heavy lifting, we will have semantic search: for instance, the quote by Plato could be found for a search string such as "If you say things how they are, you'll be not liked at all."

Database Access

As remarked earlier, the present application can be implemented by direct usage of the Cassandra drivers, which use CQL (Cassandra Query Language) to interface with the database, or by using the CassIO library, built to automate most of the lower-level operations around ML/GenAI workloads (first and foremost vector-related interactions). While the latter choice certainly saves several lines of code and lets you get a running application without much knowledge of Cassandra-specific interfaces, the former opens the way to customizing your data structures and the corresponding workloads.

Note: The database connection steps described in this section refer to CassIO; the equivalent details for the CQL case are found in the corresponding notebook (see references at the end of this post).

First, we need a vector store, where we'll write all quotes, together with their vector embeddings, for later querying; for Astra DB, this means a table. We'll create the table, along with the index(es) required to run the search, in our Astra DB account: to do so, we need credentials and a connection parameter from the Astra web UI, namely, the Database ID and a secret Token (with role Database Administrator):

     JSON
    
    astra_token = "AstraCS:..."

database_id = "3df2a5b6-..."

In the rest of the post, we will mainly stick to the CassIO approach, highlighting the main differences with the CQL-based one: remember that, for deeper inspection, you can find links to the complete code for both implementations at the end of the post. In this spirit, the required dependencies are as follows:

pip install cassio openai

The CassIO library takes care of handling all of the low-level details of the database connection (the "Session" object, in CQL drivers parlance): all we need is:

    JSON
   
   import cassio

cassio.init(token=astra_token, database_id=database_id)

Vector Store Creation

Creating a vector store with CassIO is as simple as a method invocation. We just need to provide the database Session object and specify the number of dimensions of the vector. We are going to use OpenAI's embedding model, whose vectors are of length 1536:

     JSON
    
    from cassio.table import ClusteredMetadataVectorCassandraTable

v_table_partitioned = ClusteredMetadataVectorCassandraTable(

    table="philosophers_cassio_partitioned",

    vector_dimension=1536,

)

Of the various vector-capable table classes supported by CassIO, we are using the one that, besides supporting metadata for search/retrieval, is partitioned. What does it mean, and why do we make this choice?

Well, the fact is, each philosophical quote belongs exactly to one author, so we choose a storage model where quotes by the same author are kept together (on the same nodes in the distributed database, next to each other for more efficient retrieval). In this way, we can optimize for queries that are restricted to a single author. To be clear, vector ANN (approximate nearest neighbor) search can (and will!) be run on the whole table with no problems, and we could have lumped the quote author with the rest of the quote metadata. Indeed, the full example in the notebook demonstrates that choice as well. But the lesson here is: if your data is "naturally" grouped in disjoint subsets, and you anticipate you will often restrict ANN queries to a single group, consider a partitioned storage for a significant improvement in query latency.

What happens on the database when you run the above statement? A table is created, with a schema managed within CassIO, along with a couple of SAIs (Storage Attached Indexes, which power the vector-based queries we'll run). The created table has a column of the right type to host our embedding vectors, of course!

Vector Store Creation, the CQL Way

Still, you may opt for more advanced usage and create the table and the indexes yourself, with your own choice of schema and column names, rather than leaving it all to CassIO. Fair choice. In that case, you will run something like this:

     JSON
    
    create_table_p_statement = f"""CREATE TABLE IF NOT EXISTS {keyspace}.philosophers_cql_partitioned (

    author TEXT,

    quote_id UUID,

    body TEXT,

    embedding_vector VECTOR<FLOAT, 1536>,

    tags SET<TEXT>,

    PRIMARY KEY ( (author), quote_id )

) WITH CLUSTERING ORDER BY (quote_id ASC);"""

session.execute(create_table_p_statement)

create_vector_index_p_statement = f"""CREATE CUSTOM INDEX IF NOT EXISTS idx_embedding_vector_p

    ON {keyspace}.philosophers_cql_partitioned (embedding_vector)

    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'

    WITH OPTIONS = {{'similarity_function' : 'dot_product'}};

"""

session.execute(create_vector_index_p_statement)

create_tags_index_p_statement = f"""CREATE CUSTOM INDEX IF NOT EXISTS idx_tags_p

    ON {keyspace}.philosophers_cql_partitioned (VALUES(tags))

    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';

"""

session.execute(create_tags_index_p_statement)

You can see there are a few CQL commands that are run on the Session object (check the notebook for the full code, including how to create the Session). The structure of the table implements the same choice of partitioning (in this case, explicitly expressed by the author column) we discussed earlier.

Sanity Check: OpenAI Embeddings

Before starting to compute and store embedding vectors for all these texts, let's see what these vectors look like. We use OpenAI, so we need an API Key. We also specify an explicit choice for the embedding model being used:

     JSON
    
    import openai

openai.api_key = "..."

embedding_model_name = "text-embedding-ada-002"

We compute the embeddings for three sentences as easily as:

     JSON
    
    texts = [

    "The cat is on the table",

    "A feline sits on my desk right now!",

    "The dragon was spitting fire at the knights!",

]

test_results = openai.Embedding.create(

    input=texts,

    engine=embedding_model_name,

)

vectors = [

    d.embedding

    for d in test_results.data

]

Let's see the basic properties of embedding vectors in action. Observe the following REPL interaction:

     JSON
    
    >>> # 1. What's the dimensionality of these vectors?

>>> print([len(v) for v in vectors])

[1536, 1536, 1536]

>>> # 2. What's their length as vectors (or "norm")?

>>> print([sum(x*x for x in v)**0.5 for v in vectors])

[1.0000000319557796, 1.0000000459125642, 1.0000000206259116]

>>> # 3a. What's the scalar product between the first two?

>>> print(sum(x*y for x, y in zip(vectors[0], vectors[1])))

0.902832337449373

>>> # 3b. What's the scalar product between first and third?

>>> print(sum(x*y for x, y in zip(vectors[0], vectors[2])))

0.7731318100915844

The first two answers above confirm that OpenAI returns vectors of length 1536, normalized to unit length (modulo machine-precision fluctuations), … and the last two calculations show that, indeed, sentences of similar content result in a higher cosine than unrelated sentences (compare the contents of the texts list above)!

Smaller angles have higher cosine: the cosine of the angle between identical vectors is one, and it decreases as the vectors are less and less aligned. (Sketch, not to scale.)

As a technical note, remember that you can calculate the "dot," or scalar product, in place of the more expensive "cosine," as soon as it's guaranteed that the vectors have unit length. In the above REPL interaction, we just took advantage of this fact.

Populate the Store

A preliminary note: in this simple example, we first load the whole dataset in the vector store and then run the queries. A more complex, realistic application needs not to be structured like that: new items can be ingested and even removed at any moment, and the vector search results would reflect these changes in real time.

We will now loop through the quotes (loaded from a JSON file in this case), compute the corresponding embedding vectors, and save them into the vector store.

OpenAI's embedding API enables passing a batch of input texts and getting back a corresponding list of embedding vectors: we take advantage of this and make the whole process of embedding computation much faster. We will choose batches of 50 because there are exactly 50 quotes per author in the input dataset.

Another crucial optimization to populate the vector store faster is that of concurrent DB insertions. Astra DB (just like Cassandra) is designed to support heavy traffic, especially for write operations: it won't even flinch if we write all 50 rows of a batch together. To do so, we can make use of CassIO's put_async method, which returns a future (the later future.result() invocation ensures completion of the write operation):

     JSON
    
    quote_dict = json.load(open(".../philo_quotes.json"))

for philosopher, quotes in quote_dict["quotes"].items():

    print(f"{philosopher}: ", end="")

    result = openai.Embedding.create(

        input=[quote["body"] for quote in quotes],

        engine=embedding_model_name,

    )

    futures = []

    for quote_idx, (quote, q_data) in enumerate(zip(quotes, result.data)):

        futures.append(v_table_partitioned.put_async(

            partition_id=philosopher,

            row_id=f"q_{philosopher}_{quote_idx}",

            body_blob=quote["body"],

            vector=q_data.embedding,

            metadata={tag: True for tag in quote["tags"]},

        ))

    for future in futures:

        future.result()

    print(f" Done ({len(quotes)} quotes inserted).")

print("Finished inserting.")

When writing to the store, we provide the quote author as the partition_id to the put operation while still constructing a unique row_id for the quote entry (technically, it just needs to be unique within the partition, but there's no harm in adding the author to the ID). The signature of the put_async method is the same as that of the (synchronous) put, which blocks until completion: so, a more "casual" use of the vector store would just consist of a loop over the quotes and iterated invocation of the latter blocking insertion (check the linked notebook to see synchronous insertions demonstrated as well).

Populating the Store, the CQL Way

If you want to work at the CQL level, you will perform the writes in the form of CQL statements (essentially what CassIO would do for you otherwise), of course, adapted to the specific table schema you chose earlier. Moreover, you can perform concurrent writes very easily by executing several statements at once with the Cassandra Python driver's execute_concurrent_with_args primitive:

     JSON
    
    from cassandra.concurrent import execute_concurrent_with_args

prepared_insertion = session.prepare(

    f"INSERT INTO {keyspace}.philosophers_cql_partitioned (quote_id, author, body, embedding_vector, tags) VALUES (?, ?, ?, ?, ?);"

)

for philosopher, quotes in quote_dict["quotes"].items():

    print(f"{philosopher}: ", end="")

    result = openai.Embedding.create(

        input=[quote["body"] for quote in quotes],

        engine=embedding_model_name,

    )

    tuples_to_insert = []

    for quote_idx, (quote, q_data) in enumerate(zip(quotes, result.data)):

        quote_id = uuid4()

        tuples_to_insert.append( (quote_id, philosopher, quote["body"], q_data.embedding, set(quote["tags"])) )

    conc_results = execute_concurrent_with_args(

        session,

        prepared_insertion,

        tuples_to_insert,

    )

    # check that all insertions succeed (better to always do this):

    if any([not success for success, _ in conc_results]):

        print("Something failed during the insertions!")

    else:

        print(f"Done ({len(quotes)} quotes inserted).")

print("Finished inserting.")

A Vector-Based Search Engine

The next step is to build a search function that queries the store given an input "query string" (presumably a philosophical sentence of some sort). Conceptually, what we need to do unfolds through two steps:

Convert the input text into an embedding vector V itself;

Run an ANN query on the database that expresses, "Give me the top n rows whose vector is the closest to this query vector: V."

The first step is nothing different than what we did earlier for the insertion and amounts to invoking the OpenAI embedding model; for the second step, if we rely on CassIO to manage the low-level interaction with the store, we just call the ann_search method of the vector table object.

Remember, we are using a partitioned table, so when available, we can specify a partition_id (i.e., an author) to restrict the search. Likewise, the search engine can admit labels as filters on the results, which means passing an optional metadata parameter to ann_search. We end up with the following function, which accepts a textual input, a number of entries to return, and optionally, an author and/or any number of labels to narrow down the search:

     JSON
    
    def find_quote_and_author_p(query_quote, n, author=None, tags=None):

    query_vector = openai.Embedding.create(

        input=[query_quote],

        engine=embedding_model_name,

    ).data[0].embedding

    metadata = {}

    partition_id = None

    if author:

        partition_id = author

    if tags:

        for tag in tags:

            metadata[tag] = True

    #

    results = v_table_partitioned.ann_search(

        query_vector,

        n=n,

        partition_id=partition_id,

        metadata=metadata,

    )

    return [

        (result["body_blob"], result["partition_id"])

        for result in results

    ]

The function returns a list of (quote, author) pairs. Let's see it at work with different filtering options:

    JSON
   
   >>> find_quote_and_author_p("We struggle all our life for nothing", 3)

[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.', 'schopenhauer'), ('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.', 'aristotle'), ('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry', 'freud')]

>>> find_quote_and_author_p("We struggle all our life for nothing", 2, author="nietzsche")

[('To live is to suffer, to survive is to find some meaning in the suffering.', 'nietzsche'), ('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.', 'nietzsche')]

>>> find_quote_and_author("We struggle all our life for nothing", 2, tags=["politics"])

[('Mankind will never see an end of trouble until lovers of wisdom come to hold political power, or the holders of power become lovers of wisdom', 'plato'), ('Everything the State says is a lie, and everything it has it has stolen.', 'nietzsche')]

Vector Search, the CQL Way

By now you know the pattern: there is an equivalent CQL command, to be run with session.execute(...), that performs the vector ANN search query and returns the desired search matches. The CQL command looks similar to the following:

     JSON
    
    SELECT body, author

FROM my_keyspace.philosophers_cql_partitioned

WHERE author = 'nietzsche'

  AND tags CONTAINS 'politics'

  AND tags CONTAINS 'ethics'

ORDER BY embedding_vector ANN OF [0.0012, -0.00091, ...]

  LIMIT 3;

To collect the returned (quote, author) pairs in a list, one can employ a list comprehension as follows:

     JSON
    
    result_rows = session.execute(search_statement)

pairs = [

    (result_row.body, result_row.author)

    for result_row in result_rows

]

We have simplified the management of the CQL statement somewhat to better highlight the important aspects: in reality, one usually does not embed literal values in the statement but rather passes them separately to the execute method to match as many placeholders in the CQL string. Moreover, it is strongly suggested to use prepared statements for queries that are run over and over (something that CassIO would automatically do behind the scenes). Please check the CQL end-to-end example linked at the end of this post for the full code in this case.

Vector Search At-A-Glance

Now, we have a pretty advanced search engine. Before wrapping up here and turning to the quote generator in the final part of this mini-series, let's try to summarize the "essence" of vector search in a single picture. Let's pretend for a moment that the embedding vectors are not 1536-, but three-dimensional vectors, i.e., points in our everyday space, the three numeric components anchored to a "zero point" or origin. Remember, all embeddings we use are of length one: this means that the "space" spanned by the arrows' tips is … the surface of a sphere.

All good so far. Now, when we run a search, we have a vector, i.e., an arrow pointing to a certain location on the sphere, and we want to get the closest sentences (i.e., the closest vectors). This is easily pictured:

In the space where quote embeddings live, similarity search means locating the vectors (or points on the sphere) closest to the query vector.

A quick aside: So far, we have mentioned the "cosine" as the way to check how similar or how "close to each other" two vectors happen to be. You might prefer a different measure, such as what you'd get by placing a "measuring tape" between the two arrow tips — that is called Euclidean measure. Now, as long as your vectors lie on a sphere, choosing one or the other does not really make a difference in that you'll get the very same results back (and in the same order)!

Next Steps

This is the end of Part 2. In the next and last episode, the search engine will become the heart of a full GenAI application: we will implement a generator of new philosophical quotes similar in style and content to the actual quotes we have seen so far.

Remember that you can jump straight to the Google Colab version of the application to see it all in action: all you need is an Astra DB free account, along with an OpenAI API key.

Reference Links

DataStax Astra DB, Vector overview
CassIO version of the notebook (opens in Colab)
CQL version of the notebook (opens in Colab)

Appendix: Dataset Source and Preprocessing

The 450 philosophical quotes used in this example are adapted from a dataset with a CC BY-NC-SA 4.0 license, originally hosted on Kaggle at this link. Originally, the dataset consisted of one text file per author, with a variable number of quote texts (one per line) and nothing else.

We selected a smaller number of quotes, cleaned out "weird" characters for easier processing with Python, and added the tags in an AI-assisted fashion. While the present code loads the quotes from a local JSON file, the same is available (including an outline of the data-preparation procedure) as a HuggingFace dataset hosted within DataStax's organization.

API Data structure Database Database connection

Opinions expressed by DZone contributors are their own.

Related

Trending