Semantic Search With Weaviate Vector Database
Learn how to implement a vector similarity search using Weaviate, a vector database.
Join the DZone community and get the full member experience.
Join For FreeIn a previous blog, the influence of the document format and the way it is embedded in combination with semantic search was discussed. LangChain4j was used to accomplish this. The way the document was embedded has a major influence on the results. This was one of the main conclusions. However, a perfect result was not achieved.
In this post, you will take a look at Weaviate, a vector database that has a Java client library available. You will investigate whether better results can be achieved.
The source documents are two Wikipedia documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting part of these documents is that they contain facts and are mainly in a table format. Parts of these documents are converted to Markdown in order to have a better representation. The same documents were used in the previous blog, so it will be interesting to see how the findings from that post compare to the approach used in this post.
The sources used in this blog can be found on GitHub.
Prerequisites
The prerequisites for this blog are:
- Basic knowledge of embedding and vector stores
- Basic Java knowledge, Java 21 is used
- Basic knowledge of Docker
The Weaviate starter guides are also interesting reading material.
How to Implement Vector Similarity Search
1. Installing Weaviate
There are several ways to install Weaviate. An easy installation is through Docker Compose. Just use the sample Docker Compose file.
version: '3.4'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: semitechnologies/weaviate:1.23.2
ports:
- 8080:8080
- 50051:50051
volumes:
- weaviate_data:/var/lib/weaviate
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'none'
ENABLE_MODULES: 'text2vec-cohere,text2vec-huggingface,text2vec-palm,text2vec-openai,generative-openai,generative-cohere,generative-palm,ref2vec-centroid,reranker-cohere,qna-openai'
CLUSTER_HOSTNAME: 'node1'
volumes:
weaviate_data:
Start the Compose file from the root of the repository.
$ docker compose -f docker/compose-initial.yaml up
You can shut it down with CTRL+C or with the following command:
$ docker compose -f docker/compose-initial.yaml down
2. Connect to Weaviate
First, let’s try to connect to Weaviate through the Java library. Add the following dependency to the pom file:
<dependency>
<groupId>io.weaviate</groupId>
<artifactId>client</artifactId>
<version>4.5.1</version>
</dependency>
The following code will create a connection to Weaviate and display some metadata information about the instance.
Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
Result<Meta> meta = client.misc().metaGetter().run();
if (meta.getError() == null) {
System.out.printf("meta.hostname: %s\n", meta.getResult().getHostname());
System.out.printf("meta.version: %s\n", meta.getResult().getVersion());
System.out.printf("meta.modules: %s\n", meta.getResult().getModules());
} else {
System.out.printf("Error: %s\n", meta.getError().getMessages());
}
The output is the following:
meta.hostname: http://[::]:8080
meta.version: 1.23.2
meta.modules: {generative-cohere={documentationHref=https://docs.cohere.com/reference/generate, name=Generative Search - Cohere}, generative-openai={documentationHref=https://platform.openai.com/docs/api-reference/completions, name=Generative Search - OpenAI}, generative-palm={documentationHref=https://cloud.google.com/vertex-ai/docs/generative-ai/chat/test-chat-prompts, name=Generative Search - Google PaLM}, qna-openai={documentationHref=https://platform.openai.com/docs/api-reference/completions, name=OpenAI Question & Answering Module}, ref2vec-centroid={}, reranker-cohere={documentationHref=https://txt.cohere.com/rerank/, name=Reranker - Cohere}, text2vec-cohere={documentationHref=https://docs.cohere.ai/embedding-wiki/, name=Cohere Module}, text2vec-huggingface={documentationHref=https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task, name=Hugging Face Module}, text2vec-openai={documentationHref=https://platform.openai.com/docs/guides/embeddings/what-are-embeddings, name=OpenAI Module}, text2vec-palm={documentationHref=https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings, name=Google PaLM Module}}
The version is shown and the modules that were activated, this corresponds to the modules activated in the docker compose file.
3. Embed Documents
In order to query the documents, the documents need to be embedded first. This can be done by means of the text2vec-transformers module. Create a new Docker Compose file with only the text2vec-transformers
module enabled. You also set this module as DEFAULT_VECTORIZER_MODULE
, set the TRANSFORMERS_INFERENCE_API
to the transformer container and you use the sentence-transformers-all-MiniLM-L6-v2-onnx
image for the transformer container. You use the ONNX image when you do not make use of a GPU.
version: '3.4'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: semitechnologies/weaviate:1.23.2
ports:
- 8080:8080
- 50051:50051
volumes:
- weaviate_data:/var/lib/weaviate
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
ENABLE_MODULES: 'text2vec-transformers'
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: 'node1'
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2-onnx
volumes:
weaviate_data:
Start the containers:
$ docker compose -f docker/compose-embed.yaml up
Embedding the data is an important step that needs to be executed thoroughly. It is therefore important to know the Weaviate concepts.
- Every data object belongs to a Class, and a class has one or more Properties.
- A Class can be seen as a collection and every data object (represented as JSON-documents) can be represented by a vector (i.e. an embedding).
- Every Class contains objects which belong to this class, which corresponds to a common schema.
Three markdown files with data of Bruce Springsteen are available. The embedding will be done as follows:
- Every markdown file will be converted to a Weaviate Class.
- A markdown file consists out of a header. The header contains the column names, which will be converted into Weaviate Properties. Properties need to be valid GraphQL names. Therefore, the column names have been altered a bit compared to the previous blog. E.g. writer(s) has become writers, album details has become AlbumDetails, etc.
- After the header, the data is present. Ever row in the table will be converted to a data object belonging to a Class.
An example of a markdown file is the Compilation Albums file.
| Title | US | AUS | CAN | GER | IRE | NLD | NZ | NOR | SWE | UK |
|----------------------------------|----|-----|-----|-----|-----|-----|----|-----|-----|----|
| Greatest Hits | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 |
| Tracks | 27 | 97 | — | 63 | — | 36 | — | 4 | 11 | 50 |
| 18 Tracks | 64 | 98 | 58 | 8 | 20 | 69 | — | 2 | 1 | 23 |
| The Essential Bruce Springsteen | 14 | 41 | — | — | 5 | 22 | — | 4 | 2 | 15 |
| Greatest Hits | 43 | 17 | 21 | 25 | 2 | 4 | 3 | 3 | 1 | 3 |
| The Promise | 16 | 22 | 27 | 1 | 4 | 4 | 30 | 1 | 1 | 7 |
| Collection: 1973–2012 | — | 6 | — | 23 | 2 | 78 | 19 | 1 | 6 | — |
| Chapter and Verse | 5 | 2 | 21 | 4 | 2 | 5 | 4 | 3 | 2 | 2 |
In the next sections, the steps taken to embed the documents are explained in more detail. The complete source code is available at GitHub. This is not the most clean code, but I do hope it is understandable.
3.1 Basic Setup
A map is created, which contains the file names linked to the Weaviate Class names to be used.
private static Map<String, String> documentNames = Map.of(
"bruce_springsteen_list_of_songs_recorded.md", "Songs",
"bruce_springsteen_discography_compilation_albums.md", "CompilationAlbums",
"bruce_springsteen_discography_studio_albums.md", "StudioAlbums");
In the basic setup, a connection is set up to Weaviate, all data is removed from the database, and the files are read. Every file is then processed one by one.
Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
// Remove existing data
Result<Boolean> deleteResult = client.schema().allDeleter().run();
if (deleteResult.hasErrors()) {
System.out.println(new GsonBuilder().setPrettyPrinting().create().toJson(deleteResult.getResult()));
}
List<Document> documents = loadDocuments(toPath("markdown-files"));
for (Document document : documents) {
...
}
3.2 Convert Header to Class
The header information needs to be converted to a Weaviate Class.
- Split the complete file row by row.
- The first line contains the header, split it by means of the | separator and store it in variable
tempSplittedHeader
. - The header starts with a | and therefore the first entry in
tempSplittedHeader
is empty. Remove it and store the remaining part of the row in variablesplittedHeader
. - For every item in
splittedHeader
(i.e. the column names), a Weaviate Property is created. Strip all leading and trailing spaces from the data. - Create the Weaviate
documentClass
with the class name as defined in thedocumentNames
map and the just created Properties. - Add the class to the schema and verify the result.
// Split the document line by line
String[] splittedDocument = document.text().split("\n");
// split the header on | and remove the first item (the line starts with | and the first item is therefore empty)
String[] tempSplittedHeader = splittedDocument[0].split("\\|");
String[] splittedHeader = Arrays.copyOfRange(tempSplittedHeader,1, tempSplittedHeader.length);
// Create the Weaviate collection, every item in the header is a Property
ArrayList<Property> properties = new ArrayList<>();
for (String splittedHeaderItem : splittedHeader) {
Property property = Property.builder().name(splittedHeaderItem.strip()).build();
properties.add(property);
}
WeaviateClass documentClass = WeaviateClass.builder()
.className(documentNames.get(document.metadata("file_name")))
.properties(properties)
.build();
// Add the class to the schema
Result<Boolean> collectionResult = client.schema().classCreator()
.withClass(documentClass)
.run();
if (collectionResult.hasErrors()) {
System.out.println("Creation of collection failed: " + documentNames.get(document.metadata("file_name")));
}
3.3 Convert Data Rows to Objects
Every data row needs to be converted to a Weaviate data object.
- Copy the rows containing data in variable
dataOnly
. - Loop over every row, a row is represented by variable
documentLine
. - Split every line by means of the | separator and store it in variable
tempSplittedDocumentLine
. - Just like the header, every row starts with a |, and therefore, the first entry in
tempSplittedDocumentLine
is empty. Remove it and store the remaining part of the row in variablesplittedDocumentLine
. - Every item in the row becomes a property. The complete row is converted to properties in variable
propertiesDocumentLine
. Strip all leading and trailing spaces from the data. - Add the data object to the Class and verify the result.
- At the end, print the result.
// Preserve only the rows containing data, the first two rows contain the header
String[] dataOnly = Arrays.copyOfRange(splittedDocument, 2, splittedDocument.length);
for (String documentLine : dataOnly) {
// split a data row on | and remove the first item (the line starts with | and the first item is therefore empty)
String[] tempSplittedDocumentLine = documentLine.split("\\|");
String[] splittedDocumentLine = Arrays.copyOfRange(tempSplittedDocumentLine, 1, tempSplittedDocumentLine.length);
// Every item becomes a property
HashMap<String, Object> propertiesDocumentLine = new HashMap<>();
int i = 0;
for (Property property : properties) {
propertiesDocumentLine.put(property.getName(), splittedDocumentLine[i].strip());
i++;
}
Result<WeaviateObject> objectResult = client.data().creator()
.withClassName(documentNames.get(document.metadata("file_name")))
.withProperties(propertiesDocumentLine)
.run();
if (objectResult.hasErrors()) {
System.out.println("Creation of object failed: " + propertiesDocumentLine);
}
String json = new GsonBuilder().setPrettyPrinting().create().toJson(objectResult.getResult());
System.out.println(json);
}
3.4 The Result
Running the code to embed the documents prints what is stored in the Weaviate vector database. As you can see below, a data object has a UUID, the class is StudioAlbums, the properties are listed and the corresponding vector is displayed.
{
"id": "e0d5e1a3-61ad-401d-a264-f95a9a901d82",
"class": "StudioAlbums",
"creationTimeUnix": 1705842658470,
"lastUpdateTimeUnix": 1705842658470,
"properties": {
"aUS": "3",
"cAN": "8",
"gER": "1",
"iRE": "2",
"nLD": "1",
"nOR": "1",
"nZ": "4",
"sWE": "1",
"title": "Only the Strong Survive",
"uK": "2",
"uS": "8"
},
"vector": [
-0.033715352,
-0.07489116,
-0.015459526,
-0.025204511,
...
0.03576842,
-0.010400549,
-0.075309984,
-0.046005197,
0.09666792,
0.0051724687,
-0.015554721,
0.041699238,
-0.09749843,
0.052182134,
-0.0023900834
]
}
4. Manage Collections
So, now you have data in the vector database. What kind of information can be retrieved from the database? You are able to manage the collection, for example.
4.1 Retrieve Collection Definition
The definition of a collection can be retrieved as follows:
String className = "CompilationAlbums";
Result<WeaviateClass> result = client.schema().classGetter()
.withClassName(className)
.run();
String json = new GsonBuilder().setPrettyPrinting().create().toJson(result.getResult());
System.out.println(json);
The output is the following:
{
"class": "CompilationAlbums",
"description": "This property was generated by Weaviate\u0027s auto-schema feature on Sun Jan 21 13:10:58 2024",
"invertedIndexConfig": {
"bm25": {
"k1": 1.2,
"b": 0.75
},
"stopwords": {
"preset": "en"
},
"cleanupIntervalSeconds": 60
},
"moduleConfig": {
"text2vec-transformers": {
"poolingStrategy": "masked_mean",
"vectorizeClassName": true
}
},
"properties": [
{
"name": "uS",
"dataType": [
"text"
],
"description": "This property was generated by Weaviate\u0027s auto-schema feature on Sun Jan 21 13:10:58 2024",
"tokenization": "word",
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-transformers": {
"skip": false,
"vectorizePropertyName": false
}
}
},
...
}
You can see how it was vectorized, the properties, etc.
4.2 Retrieve Collection Objects
Can you also retrieve the collection objects? Yes, you can, but this is not possible at the moment of writing with the java client library. You will notice, when browsing the Weaviate documentation, that there is no example code for the java client library. However, you can make use of the GraphQL API which can also be called from java code. The code to retrieve the title property of every data object in the CompilationAlbums Class is the following:
- You call the
graphQL
method from the Weaviate client. - You define the Weaviate Class and the fields you want to retrieve.
- You print the result.
Field song = Field.builder().name("title").build();
Result<GraphQLResponse> result = client.graphQL().get()
.withClassName("CompilationAlbums")
.withFields(song)
.run();
if (result.hasErrors()) {
System.out.println(result.getError());
return;
}
System.out.println(result.getResult());
The result shows you all the titles:
GraphQLResponse(
data={
Get={
CompilationAlbums=[
{title=Chapter and Verse},
{title=The Promise},
{title=Greatest Hits},
{title=Tracks},
{title=18 Tracks},
{title=The Essential Bruce Springsteen},
{title=Collection: 1973–2012},
{title=Greatest Hits}
]
}
},
errors=null)
5. Semantic Search
The whole purpose of embedding the documents is to verify whether you can search the documents. In order to search, you also need to make use of the GraphQL API. Different search operators are available. Just like in the previous blog, 5 questions are asked about the data.
- on which album was “adam raised a cain” originally released?
The answer is “Darkness on the Edge of Town”. - what is the highest chart position of “Greetings from Asbury Park, N.J.” in the US?
This answer is #60. - what is the highest chart position of the album “tracks” in canada?
The album did not have a chart position in Canada. - in which year was “Highway Patrolman” released?
The answer is 1982. - who produced “all or nothin’ at all”?
The answer is Jon Landau, Chuck Plotkin, Bruce Springsteen and Roy Bittan.
In the source code, you provide the class name and the corresponding fields. This information is added in a static class for each collection. The code contains the following:
- Create a connection to Weaviate.
- Add the fields of the class and also add two additional fields, the certainty and the distance.
- Embed the question using a
NearTextArgument
. - Search the collection via the GraphQL API, limit the result to 1.
- Print the result.
private static void askQuestion(String className, Field[] fields, String question) {
Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
Field additional = Field.builder()
.name("_additional")
.fields(Field.builder().name("certainty").build(), // only supported if distance==cosine
Field.builder().name("distance").build() // always supported
).build();
Field[] allFields = Arrays.copyOf(fields, fields.length + 1);
allFields[fields.length] = additional;
// Embed the question
NearTextArgument nearText = NearTextArgument.builder()
.concepts(new String[]{question})
.build();
Result<GraphQLResponse> result = client.graphQL().get()
.withClassName(className)
.withFields(allFields)
.withNearText(nearText)
.withLimit(1)
.run();
if (result.hasErrors()) {
System.out.println(result.getError());
return;
}
System.out.println(result.getResult());
}
Invoke this method for the five questions.
askQuestion(Song.NAME, Song.getFields(), "on which album was \"adam raised a cain\" originally released?");
askQuestion(StudioAlbum.NAME, StudioAlbum.getFields(), "what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?");
askQuestion(CompilationAlbum.NAME, CompilationAlbum.getFields(), "what is the highest chart position of the album \"tracks\" in canada?");
askQuestion(Song.NAME, Song.getFields(), "in which year was \"Highway Patrolman\" released?");
askQuestion(Song.NAME, Song.getFields(), "who produced \"all or nothin' at all?\"");
The result is amazing, for all five questions the correct data object is returned.
GraphQLResponse(
data={
Get={
Songs=[
{_additional={certainty=0.7534831166267395, distance=0.49303377},
originalRelease=Darkness on the Edge of Town,
producers=Jon Landau Bruce Springsteen Steven Van Zandt (assistant),
song="Adam Raised a Cain", writers=Bruce Springsteen, year=1978}
]
}
},
errors=null)
GraphQLResponse(
data={
Get={
StudioAlbums=[
{_additional={certainty=0.803815484046936, distance=0.39236903},
aUS=71,
cAN=—,
gER=—,
iRE=—,
nLD=—,
nOR=—,
nZ=—,
sWE=35,
title=Greetings from Asbury Park,N.J., uK=41, uS=60}
]
}
},
errors=null)
GraphQLResponse(
data={
Get={
CompilationAlbums=[
{_additional={certainty=0.7434340119361877, distance=0.513132},
aUS=97,
cAN=—,
gER=63,
iRE=—,
nLD=36,
nOR=4,
nZ=—,
sWE=11,
title=Tracks,
uK=50,
uS=27}
]
}
},
errors=null)
GraphQLResponse(
data={
Get={
Songs=[
{_additional={certainty=0.743279218673706, distance=0.51344156},
originalRelease=Nebraska,
producers=Bruce Springsteen,
song="Highway Patrolman",
writers=Bruce Springsteen,
year=1982}
]
}
},
errors=null)
GraphQLResponse(
data={
Get={
Songs=[
{_additional={certainty=0.7136414051055908, distance=0.5727172},
originalRelease=Human Touch,
producers=Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan,
song="All or Nothin' at All",
writers=Bruce Springsteen,
year=1992}
]
}
},
errors=null)
6. Explore Collections
The semantic search implementation assumed that you knew in which collection to search the answer. Most of the time, you do not know which collection to search for. The explore function can help in order to search across multiple collections. There are some limitations to the use of the explore function:
- Only one vectorizer module may be enabled.
- The vector search must be
nearText
ornearVector
.
The askQuestion method becomes the following. Just like in the previous paragraph, you want to return some additional, more generic fields of the collection. The question is embedded in a NearTextArgument
and the collections are explored.
private static void askQuestion(String question) {
Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
ExploreFields[] fields = new ExploreFields[]{
ExploreFields.CERTAINTY, // only supported if distance==cosine
ExploreFields.DISTANCE, // always supported
ExploreFields.BEACON,
ExploreFields.CLASS_NAME
};
NearTextArgument nearText = NearTextArgument.builder().concepts(new String[]{question}).build();
Result<GraphQLResponse> result = client.graphQL().explore()
.withFields(fields)
.withNearText(nearText)
.run();
if (result.hasErrors()) {
System.out.println(result.getError());
return;
}
System.out.println(result.getResult());
}
Running this code returns an error. A bug is reported, because a vague error is returned.
GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
However, in order to circumvent this error, it would be interesting to verify whether the correct answer returns the highest certainty over all collections. Therefore, for each question every collection is queried. The complete code can be found here, below only the code for question 1 is shown. The askQuestion
implementation is the one used in the Semantic Search paragraph.
private static void question1() {
askQuestion(Song.NAME, Song.getFields(), "on which album was \"adam raised a cain\" originally released?");
askQuestion(StudioAlbum.NAME, StudioAlbum.getFields(), "on which album was \"adam raised a cain\" originally released?");
askQuestion(CompilationAlbum.NAME, CompilationAlbum.getFields(), "on which album was \"adam raised a cain\" originally released?");
}
Running this code returns the following output.
GraphQLResponse(data={Get={Songs=[{_additional={certainty=0.7534831166267395, distance=0.49303377}, originalRelease=Darkness on the Edge of Town, producers=Jon Landau Bruce Springsteen Steven Van Zandt (assistant), song="Adam Raised a Cain", writers=Bruce Springsteen, year=1978}]}}, errors=null)
GraphQLResponse(data={Get={StudioAlbums=[{_additional={certainty=0.657206118106842, distance=0.68558776}, aUS=9, cAN=7, gER=—, iRE=73, nLD=4, nOR=12, nZ=11, sWE=9, title=Darkness on the Edge of Town, uK=14, uS=5}]}}, errors=null)
GraphQLResponse(data={Get={CompilationAlbums=[{_additional={certainty=0.6488107144832611, distance=0.7023786}, aUS=97, cAN=—, gER=63, iRE=—, nLD=36, nOR=4, nZ=—, sWE=11, title=Tracks, uK=50, uS=27}]}}, errors=null)
The interesting parts here are the certainties:
- Collection Songs has a certainty of 0.75
- Collection StudioAlbums has a certainty of 0.62
- Collection CompilationAlbums has a certainty of 0.64
The correct answer can be found in the collection of songs that has the highest certainty. So, this is great. When you verify this for the other questions, you will see that the collection containing the correct answer, always has the highest certainty.
Conclusion
In this post, you transformed the source documents in order to fit in a vector database. The semantic search results are amazing. In the previous posts, it was kind of a struggle to retrieve the correct answers to the questions. By restructuring the data and by only using a vector semantic search a 100% score of correct answers has been achieved.
Published at DZone with permission of Gunter Rotsaert, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments