Getting Started With Vector Databases

Table of Contents

Introduction About Vector Databases Key Concepts of Vector Databases Vector Databases: Getting Started Conclusion

Section 1

Introduction

Vector databases are specialized databases designed for scenarios where understanding the context, similarity, or pattern is more important than matching exact values. Leveraging the mathematics of vectors and the principles of geometry to understand and organize the data, these capabilities are essential to boosting the power of analytical and generative artificial intelligence (AI). The explosion of AI and machine learning (ML) technologies is the key driver behind the rapid growth of vector databases in the last two years, providing greater value via performance, agility, and cost.

Unlike other evolutions in databases, vector databases were not made to replace any technology but to solve new cases for which there was no existing technological alternative. The main purpose of this Refcard is to provide a clear and accessible overview of vector databases, outlining their importance, applications, and underlying principles.

In addition, we will use a functional example throughout to better demonstrate key points and objectives.

Section 2

About Vector Databases

A vector database is a specialized database for storing, searching, and managing information as vectors, which are the numerical representation of objects in a high-dimensional space (e.g., documents, text, images, videos, audio) that capture certain features of the object itself. This numerical representation is called a vector embedding, or simply embedding, which we will dive into more detail later on.

Figure 1: Vector database overview

Vector embeddings are created using ML models that are able to translate the semantic and qualitative value of the object into a numerical representation. There are a variety of ML models for each data type, such as text, audio, image, and other embedding models. The use of a vector database is not a mandatory requirement to be able to generate or use vector embeddings. This is because there are many vector index libraries focused on storing embeddings with in-memory indexes, but vector databases are highly recommended for enterprise architectures, production, and when working with high concurrency and data volume.

Nowadays, vector databases are designed to support the association of that embedding with the object metadata, which can include a variety of information such as the structured definition and object definition. Having this information alongside vectors enables more sophisticated querying, filtering, and management of capabilities that are similar to the queries made in traditional databases. This certainly makes vector databases more integrable, versatile, and interpretable with end users and within data architectures.

Figure 2: Metadata

Vector databases are a complete system designed to manage embeddings at scale. Here are the key differentiators and advantages of using vector databases:

Persistence and durability: Allow data to be stored on disk as well as in-memory and provide fault-tolerant features like data replication or regular backups.
High availability and reliability: Operate continuously and provide tolerance to failures and errors based on clustering and data replication architectures.
Scalability: Scale horizontally across multiple nodes.
Optimized performance and cost effectiveness: Handle and organize data through high-dimensional vectors that can contain thousands of dimensions.
Support complex queries and APIs: Enable complex queries that combine vector similarity searches with traditional database queries.
Security and access control: Contain built-in security features, such as authentication and authorization, data encryption, data isolation, and access control mechanisms, that are essential for enterprise applications and compliance with data protection regulations.
Seamless integration and SDKs: Integrate seamlessly with existing data ecosystems, providing integration libraries for several programming languages, a variety of APIs (e.g., GraphQL, RESTful), and integrations with Apache Kafka.
Support for CRUD operations: Vector databases allow you to add, update, and delete objects with their vectors. This is so that users don't have to reindex the entire database when any underlying data changes.

Traditional Relational vs. Vector Database

Traditional or relational databases are indispensable for applications requiring structured and semi-structured data that will return the exact match to the query. These databases store the information in rows or documents, and at the end of each row, there is a record that provides structured information such as product attributes, customer details, etc.

Vector databases, on the other hand, are optimized for storing and searching through high-dimensional vector data that will return items based on similarity metrics rather than exact matches.

Figure 3: Differences between traditional and vector databases

Section 3

Key Concepts of Vector Databases

Using vector databases involves understanding their fundamental concepts: embeddings, indexes, and distance and similarity.

Embeddings and Dimensions

As we explained previously, embeddings are numerical representations of objects that capture their semantic meaning and relationships in a high-dimensional space that includes semantic relationships, contextual usage, or features. This numerical representation is composed by an array of numbers in which each element corresponds to a specific dimension.

Figure 4: Embedding representation

The number of dimensions in embeddings are so important because each dimension corresponds to a feature that we capture from the object. It is represented as a numerical and quantitative value, and it also defines the dimensional map where each object will be located.

Let's consider a simple example with a numerical representation of words, where the words are the definition of each fashion retail product stored in our transaction database. Imagine if we could capture the essence of these targets with only two dimensions.

Figure 5: Array of embeddings

In Figure 6, we can see the dimensional representation of these objects to visualize their similarity. T-shirts are closer because both are the same product with different colors. The jacket is closer to t-shirts because they share attributes like sleeves and a collar. Furthest to the right are the jeans that do not share attributes with the other products.

Figure 6: Dimensional map

Obviously, with two dimensions, we cannot capture the essence of the products. Dimensionality plays a crucial role in how well these embeddings can capture the relevant features of the products. More dimensions may provide more accuracy but also more resources in terms of compute, memory, latency, and cost.

Vector Embedding Models Integration

Some vector databases provide seamless integration with embedding models, allowing us to generate vector embeddings from raw data and seamlessly integrate ML models into database operations. This feature simplifies the development process and abstracts away the complexities involved in generating and using vector embeddings for both data insertion and querying processes.

Figure 7: Embeddings generation patterns

Table 1: Embedding generation comparative

Examples	Without Integration	With Model Integrations
Data ingestion	1. Before we can insert each object, we must call our Model to generate a vector embedding. 2. Then, we can insert our data with the vector.	We can insert each object directly into the vector database, delegating the transformation to the database.
Query	1. Before we run a query, we must call our Model to generate a vector embedding from our query first. 2. Then, we can run a query with that vector.	We can run a query directly in the vector database, delegating the transformation to the database.

Distance Metrics and Similarity

Distance metrics are mathematical measures and functions used to determine the distance (similarity) between two elements in a vector space. In the context of embeddings, distance metrics evaluate how far apart two embeddings are. A similarity query search retrieves the embeddings that are similar to a given input based on a distance metric; this input can be a vector embedding, text, or another object. There are several distance metrics. The most popular ones are the following.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vector embeddings, and it's often used as a distance metric in text analysis and other domains where the magnitude of the vector is less important than the direction.

Figure 8: Cosine

Euclidean Distance

Euclidean distance measures the straight-line distance between two points in Euclidean space.

Figure 9: Euclidean

Manhattan Distance

Manhattan distance (L1 norm) sums the absolute differences of their coordinates.

Figure 10: Manhattan

The choice of distance metric and similarity measure has a profound impact on the behavior and performance of ML models; however, the recommendation is to use the same distance metric as the metric used to train the given model.

Vector Indexes

Vector indexes are specialized data structures designed to efficiently store, organize, and query high-dimensional vector embeddings. These indexes provide fast search queries in a cost-effective way. There are several indexing strategies that are optimized for handling the complexity and scale of the vector space. Some examples include:

Approximate nearest neighbor (ANN)
Inverted index
Locality-sensitive hashing (LSH)

Generally, each database implements a subset of these index strategies, and in some cases, they are customized for better performance.

Scalability

Vector databases are usually highly scalable solutions that support vertical and horizontal scaling. Horizontal scaling is based on two fundamental strategies: sharding and replication. Both strategies are crucial for managing large-scale and distributed databases.

Sharding

Sharding involves dividing a database into smaller, more manageable pieces called shards. Each shard contains a subset of the database's data, making it responsible for a particular segment of the data.

Table 2: Key sharding advantages and considerations

Advantages	Considerations
By distributing the data across multiple servers, sharding can reduce the load on any single server, leading to improved performance.	Implementing sharding can be complex, especially in terms of data distribution, shard management, and query processing across shards.
Sharding allows a database to scale by adding more shards across additional servers, effectively handling more data and users without degradation in performance.	Ensuring even distribution of data and avoiding hotspots where one shard receives significantly more queries than others can be challenging.
It can be cost effective to add more servers with moderate specifications than to scale up a single server with high specifications.	Query throughput does not improve when adding more sharded nodes.

Replication

Replication involves creating copies of a database on multiple nodes within the cluster.

Table 3: Key advantages and considerations for replication

Advantages	Considerations
Replication ensures that the database remains available for read operations even if some servers are down.	Maintaining data consistency across replicas, especially in write-heavy environments, can be challenging and may require sophisticated synchronization mechanisms.
Replication provides a mechanism for disaster recovery as data is backed up across multiple locations	Replication requires additional storage and network resources, as data is duplicated across multiple servers.
Replication can improve the read scalability of a database system by allowing read queries to be distributed across multiple replicas.	In asynchronous replication setups, there can be a lag between when data is written to the primary index and when it is replicated to the secondary indexes. This lag can impact applications that require real-time or near-real-time data consistency across replicas.

Use Cases

Vector databases and embeddings are crucial for several key use cases, including semantic search, vector data in generative AI, and more.

Semantic Search

You can retrieve information by leveraging the capabilities of vector embeddings to understand and match the semantic context of queries with relevant content. Searches are performed by calculating the similarity between the query vector and document vectors in the database, using some of the previously explained metrics, such as cosine similarity. Some of the applications would be:

Recommendation systems: Perform similarity searches to find items that match a user's interests, providing accurate and timely recommendations to enhance the user experience.
Customer support: Obtain the most relevant information to solve customers' doubts, questions, or problems.
Knowledge management: Find relevant information quickly from the organization's knowledge composed by documents, slides, videos, or reports in enterprise systems.

Vector Data in Generative AI: Retrieval-Augmented Generation

Generative AI and large language models (LLMs) have certain limitations given they must be trained with a large amount of data. These trainings impose high costs in terms of time, resources, and money. As a result, these models are usually trained with general contexts and are not constantly updated with the latest information.

Retrieval-augmented generation (RAG) plays a crucial role because it was developed to improve the response quality in specific contexts using a technique that incorporates an external source of relevant and updated information into the generative process. A vector database is particularly well suited for implementing RAG models due to its unique capabilities in handling high-dimensional data, performing efficient similarity searches, and integrating seamlessly with AI/ML workflows.

Figure 11: Overview of RAG architecture

Using vector databases in the RAG integration pattern has the following advantages:

Semantic understanding: Vector embeddings capture the nuanced semantic relationships within data, whether text, images, or audio. This deep understanding is essential for generative models to produce high-quality, realistic outputs that are contextually relevant to the input or prompt.
Dimensionality reduction: By representing complex data in a lower-dimensional vector space, this is aimed to reduce vast datasets to make it feasible for AI models to process and learn from.
Quality and precision: The precision of similarity search in vector databases ensures that the information retrieved for generation is of high relevance and quality.
Seamless integration: Vector databases provide APIs, SDKs, and tools that make it easy to integrate with various AI/ML frameworks. This flexibility facilitates the development and deployment of RAG models, allowing researchers and developers to focus on model optimization rather than data management challenges.
Context generation: Vector embeddings capture the semantic essence of text, images, videos, and more, enabling AI models to understand context and generate new content that is contextually similar or related.
Scalability: Vector databases provide a scalable solution that can manage large-scale information without compromising retrieval performance.

Vector databases provide the technological foundation necessary for the effective implementation of RAG models and make them an optimal choice for interaction with large-scale knowledge bases.

Other Specific Uses Cases

Beyond the main use cases discussed above are several others, such as:

Anomaly detection: Embeddings capture the nuanced relationships and patterns within the data, making it possible to detect anomalies that might not be evident through traditional methods.
Retail comparable products: By converting product features into vector embeddings, retailers can quickly find products that are similar in characteristics (e.g., design, material, price, sales).

Section 4

Vector Databases: Getting Started

To get started, we have conducted a practical exercise below that demonstrates the use of a vector database for identifying comparable products in a fashion retail scenario (i.e., semantic search use case). We'll go through setting up the environment, loading fashion product data into the open-source vector database, and querying it to find similar items.

For the environment, ensure the following tools are installed:

Docker 24 or higher
Docker Compose v2
Python 3.8 or higher

Data Sample

The following is a list of the datasets that we will use during this practical exercise based on the concepts explained in previous sections:

Table 4: Data sample

Name	Section	Family	Fit	Composition	Color
Relaxed Fit Tee	Men	T-shirts	Non-stretch, Relaxed fit	100% cotton. Jersey. Crewneck, Short sleeves	Red
Relaxed Fit Tee	Men	T-shirts	Non-stretch, Relaxed fit	100% cotton. Jersey. Crewneck, Short sleeves	Green
Trucker Jacket	Men	Jackets	Standard fit	100% cotton, Denim, Point collar, Long sleeves	Gray
Slim Welt Pocket Jeans	Women	Jeans	Mid rise: 8 3/4'', Inseam: 30'', Leg opening: 13''	62% cotton~28% viscose, ECOVERO™)~8% elastomultiester~2% elastane, Denim, Stretch, Zip fly, 5-pocket styling	Black
Baggy Dad Utility Pants	Women	Jeans	Mid rise, Straight leg	95% cotton, 5% recycled cotton, Denim, No Stretch	Green
The Perfect Tee	Women	T-shirts	Standard fit, Model wears a size small	100% cotton, Crewneck, Short sleeves	White
Lelou Shrunken Moto Jacket	Women	Jackets	Slim fit	100% polyurethane — releases plastic microfibers into the environment during washing, Long sleeves	Black

Step 1: Start Up Your Vector Database

In this example, we are going to use the following Docker Compose file to locally run our vector database instance, using the open-source Weaviate vector database in the following configuration:

    Shell
   
 

   ---
version: '3.4'
services:
 weaviate:
   command:
   - --host
   - 0.0.0.0
   - --port
   - '8080'
   - --scheme
   - http
   image: cr.weaviate.io/semitechnologies/weaviate:1.24.4
   ports:
   - 8080:8080
   - 50051:50051
   volumes:
   - weaviate_data:/var/lib/weaviate
   restart: on-failure:0
   environment:
     QUERY_DEFAULTS_LIMIT: 25
     AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
     PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
     DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
     TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
     ENABLE_MODULES: 'text2vec-transformers'
     CLUSTER_HOSTNAME: 'node1'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:
      ENABLE_CUDA: '0'
  volumes:
 weaviate_data:
...
  

In this example, the most relevant part is the modules' configuration:

DEFAULT_VECTORIZER_MODULE is the vectorization module, which transforms objects into embeddings by default (or you need to enter a vector for each data point that you add manually).
TRANSFORMERS_INFERENCE_API is the location of the inference API where this API is located. In our case, we are running this service in another image defined in the Docker Compose file.
ENABLE_MODULES are enabled inside Weaviate. We are going to use text2vec-transformer to vectorize the products' data objects.
t2v-transformers is the image with "text2vec-transformer" service.

Once we create the Docker Compose file, all we have to do is execute it:

    Shell
   
   # Docker Compose runs two images the Weaviate database and  t2v-transformers-1
$ sudo docker compose up -d

To check if our vector database is running, we will run the following commands:

    Shell
   
   # Check if the container's status is up.
$ sudo docker ps

CONTAINER ID                         …                       STATUS
16dbc16744a8                        …                       Up 2 minutes                    
cb4175cec9a2                         …                       Up 2 minutes

# Check database status by querying the API
$  curl -X GET http://localhost:8080/v1/meta

# In case of error, check the logs 
$ docker compose logs -f --tail 100 weaviate

Step 2: Install the Client Library

Next, install the Weaviate Python client:

    Shell
   
   $ pip install weaviate-client

Step 3: Preparing Your Fashion Retail Data

Prepare a dataset of fashion retail products based on Table 4. Each product should have attributes like name, description, or composition.

    Shell
   
 

   products_data = [
    {
        "name": "Relaxed Fit Tee",
        "section": "MEN",
        "family": "T-SHIRTS",
        "fit": "Non-stretch, Relaxed fit",
        "composition": "100% cotton. Jersey. Crewneck, Short sleeves",
        "color": "Red"
    },
    {
        "name": "Relaxed Fit Tee",
        "section": "MEN",
        "family": "T-SHIRTS",
        "fit": "Non-stretch, Relaxed fit",
        "composition": "100% cotton. Jersey. Crewneck, Short sleeves",
        "color": "Green"
    },
    {
        "name": "TRUCKER JACKET",
        "section": "MEN",
        "family": "JACKETS",
        "fit": "Standard fit",
        "composition": "100% cotton, Denim, Point collar, Long sleeves",
        "color": "Gray"
    },
    {
        "name": "SLIM WELT POCKET JEANS",
        "section": "WOMEN",
        "family": "JEANS",
        "fit": "Mid rise: 8 3/4'', Inseam: 30'', Leg opening: 13''",
        "composition": "62% cotton, 28% viscose (ECOVERO™), 8% elastomultiester, 2% elastane, Denim, Stretch, Zip fly, 5-pocket styling",
        "color": "Black"
    },
    {
        "name": "BAGGY DAD UTILITY PANTS",
        "section": "WOMEN",
        "family": "JEANS",
        "fit": "Mid rise, Straight leg",
        "composition": "95% cotton, 5% recycled cotton, Denim, No Stretch",
        "color": "Green"
    },
    {
        "name": "THE PERFECT TEE",
        "section": "WOMEN",
        "family": "T-SHIRTS",
        "fit": "Standard fit, Model wears a size small",
        "composition": "100% cotton, Crewneck, Short sleeves",
        "color": "White"
    },
    {
        "name": "LELOU SHRUNKEN MOTO JACKET",
        "section": "WOMEN",
        "family": "JACKETS",
        "fit": "Slim fit",
        "composition": "100% polyurethane - releases plastic microfibers into the environment during washing, Long sleeves",
        "color": "Black"
    }
]
  

Step 4: Create a Collection

To create a collection, we need to define the collection and schema for the products' data objects. There are two options here:

Create a schema that includes these properties
Let your vector database auto-detect and generate the properties automatically

In this case, we are going to use the second option, using Weaviate as our example:

    Shell
   
 

   import weaviate

# Defined previously Step 3
products_data = [{....}]

# Connect with default parameters
client = weaviate.connect_to_local()

# Check if the connection was successful
try:
    client.is_ready()
    print("Successfully connected to Weaviate.")
    products_collection = client.collections.create(
            name="Products",
            vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_transformers(
                vectorize_collection_name=True
                )
            )

    products_objs = list()
    for i,d in enumerate(products_data):
        products_objs.append({
            "name": d["name"],
            "section": d["section"],
            "family" : d["family"],
            "fit": d["fit"],
            "composition": d["composition"],
            "color": d["color"],
            })

products_collection.data.insert_many(products_objs)

finally:
    client.close()
  

Step 5: Similarity Query

Once your data is indexed, we can query for similar products using Weaviate's vector search capabilities. For example, to find products similar to a "Red T-Shirt" or "Jeans for women," you can use a search query with its description:

    Shell
   
 

   import weaviate
import weaviate.classes as wvc

# Connect with default parameters
client = weaviate.connect_to_local()

# Check if the connection was successful
try:
    client.is_ready()
    print("Successfully connected to Weaviate.")

    products  = client.collections.get("Products")

    response = products.query.near_text(
        query="Red T-Shirt",
        return_metadata=wvc.query.MetadataQuery(distance=True),
        limit=2,
        return_properties=["name", "family", "color"]
    )

    for o in response.objects:
        print(o.properties)
        print(o.metadata.distance)
finally:
    client.close()
  

This query uses the NEAR_TEXT function to find products with descriptions similar to the given concept. Weaviate will return products that its AI considers semantically similar based on the vector embeddings of their descriptions.

Step 6: Output

The output of this query returns the two closest products, including some of the object properties and the distance:

    Shell
   
 

   Successfully connected to Weaviate.
{'family': 'T-SHIRTS', 'color': 'Red', 'name': 'Relaxed Fit Tee'}
0.0
{'family': 'T-SHIRTS', 'color': 'White', 'name': 'THE PERFECT TEE'}
0.0
  

Section 5

Conclusion

This Refcard provides an overview of vector database fundamentals as well as a practical application in fashion retail. By customizing the dataset and queries, you can explore the full potential of vector databases for similarity searches and other AI-driven applications. This is just the starting point to get you started in the world of vectors. ML models and vectors represent powerful tools in the area of machine learning and artificial intelligence, offering a nuanced and high-dimensional representation of complex data.

Vector databases are not a magical solution that provides immediate value, yet like all good wine, engineers — and wineries alike — must employ careful experimentation, parameter optimization, and ongoing evaluation.

Getting Started With Vector Databases

Introduction

About Vector Databases

Traditional Relational vs. Vector Database

Key Concepts of Vector Databases

Embeddings and Dimensions

Vector Embedding Models Integration

Distance Metrics and Similarity

Cosine Similarity

Euclidean Distance

Manhattan Distance

Vector Indexes

Scalability

Sharding

Replication

Use Cases

Semantic Search

Vector Data in Generative AI: Retrieval-Augmented Generation

Other Specific Uses Cases

Vector Databases: Getting Started

Data Sample

Step 1: Start Up Your Vector Database

Step 2: Install the Client Library

Step 3: Preparing Your Fashion Retail Data

Step 4: Create a Collection

Step 5: Similarity Query

Step 6: Output

Conclusion

{{ parent.title || parent.header.title}}

{{ parent.linkDescription }}

Getting Started With Vector Databases

Introduction

About Vector Databases

Traditional Relational vs. Vector Database

Key Concepts of Vector Databases

Embeddings and Dimensions

Vector Embedding Models Integration

Distance Metrics and Similarity

Cosine Similarity

Euclidean Distance

Manhattan Distance

Vector Indexes

Scalability

Sharding

Replication

Use Cases

Semantic Search

Vector Data in Generative AI: Retrieval-Augmented Generation

Other Specific Uses Cases

Vector Databases: Getting Started

Data Sample

Step 1: Start Up Your Vector Database

Step 2: Install the Client Library

Step 3: Preparing Your Fashion Retail Data

Step 4: Create a Collection

Step 5: Similarity Query

Step 6: Output

Conclusion

Like This Refcard? Read More From DZone

{{ parent.title || parent.header.title}}

{{ parent.linkDescription }}