Enhancing GenAI Results With the Nearest Neighbor Algorithm
Nearest neighbor algorithms power the foundation for vector search functionality. So, how exactly does the nearest neighbor enhance generative AI?
Join the DZone community and get the full member experience.
Join For FreeForget about artificial intelligence (AI) and all that fancy math for a moment. Let’s talk about cheese. Specifically when you are creating a charcuterie board. If you’re not familiar, the United State’s version of a charcuterie board is (literally) a board of wood or stone with a spread of meats, cheeses, and other tasty bits. If you’re doing it right, each meat on the board has been meticulously paired with a specific cheese to complement flavor, texture, and appearance (we eat with our eyes, you know). Creating a board is as much an art as it is a culinary delight.
What makes one cheese better than another? What makes one board better than another? A few distinct characteristics can categorize all cheeses. And one can use these characteristics to craft the perfect board. You could even follow a theme like “cheddar,” “goat’s milk,” or “high-contrast.”
If you’ve spent any time in machine learning (or AI), the term “categorized” probably tipped you off to where we are headed with the cheese board example. In the ML world, information about each cheese would be referred to as the data. The different characteristics of a cheese are known as features. You use a nearest-neighbor algorithm to analyze the data and features to find good cheese pairings.
What Is the Nearest Neighbor Algorithm?
Let’s say you wanted to build an AI application that takes the description of a board and finds complementing cheeses for it. A complement would be a cheese that shares similar characteristics. Cheeses that share similar characteristics will be our definition of the nearest neighbor. Cheddar cheeses share similar attributes, like their texture and how “smelly” they are. Thus, they are neighbors.
On the road to using AI to find complementing cheese, we are going to need a large amount of data about cheese. So, we will index cheese.com. This is considered factual information about cheese but also includes many opinionated discussions about cheese. All this data together will be a wealth of information for making decisions.
There will be no “data management” of the stored information. We won’t have cheesemongers combing over new cheese data tagging each entry as a “good fit for these themes” or “complements these other cheeses.” That's the job of the nearest neighbor algorithm.
How Does the Nearest Neighbor Algorithm Work?
A cheese expert would consider all the characteristics together to classify a given cheese. The nearest neighbor algorithm does something similar but in a natural language processing (NLP) way. It doesn’t know what the word “smelly” means. Instead, it compares words (or phrases) from different cheeses against one another. Depending on how similar they are, a probability is returned. That similarity of a word or phrase is known as semantic similarity. That is a core attribute of all nearest-neighbor algorithms.
The return from the nearest neighbor algorithm will never be definitive: “I am positive these cheeses are a perfect fit.” It will be a probability that the two cheeses are a good fit as a number with a bunch of decimals from zero (0) to one (1) (zero being don’t ever put these cheeses next to one another or a civil war will break out).
The nearest neighbor algorithm analyzes all the data on every request. Classification, categorization, and everything in between will happen at the time of search (i.e., just-in-time results). The search needs to be able to handle an unknown amount of data and an unknown amount of users at any given second. That’s fancy talk for saying it needs to be really, really fast. If you’ve ever attempted text comparisons on a large dataset, you will know that it’s anything but performant. To overcome this, the text is converted to a collection of numbers called vectors. Computers are very good at comparing numbers ;).
The nearest neighbor algorithm plots all vectors in a multi-dimensional space and uses each of the points to find a neighboring point that is nearest. Different types of nearest-neighbor algorithms consider a neighboring point differently (more on that later).
Continuing with our example application, we gathered a bunch of data about cheese as unstructured text (individual documents) in an S3 bucket. Next, each document needs to be converted to a numeric value.
The act of converting text to numerics is known as tokenization. Typically, the numeric value is quite a few numbers, like 1562 for each text. We won’t get too deep into the vectorization process here, but you can learn more in our “What are vector embeddings?” guide.
With the cheese data “vectorized” and stored in a vector database, now we can calculate complementing cheeses (aka nearest neighbors). First, we would take the description provided as input and generate its vectors just like the cheese data was. Those generated vectors will be the context for calculating where their nearest neighbors are.
Each vector in the provided description represents something about a cheese, so another cheese that has the same (or very close) vectors would be a complementing cheese. Say we provided the description “I want a board that includes brie and pepper jack” to the application. Disregard the stop words like “I,” “want,” “that,” etc. Those are discarded. Vectorize the words “board,” “brie,” and “pepper jack.” Anything in the database that has vectors similar to those words is probably a neighbor - complementing cheeses. The search would hopefully return suggestions like cheddar, feta, and maybe Colby. It all depends on how cheese.com describes brie and pepper jack and how others discuss the two cheeses.
With the basics of nearest neighbor down, let’s look at the different algorithm types and some common conundrums the calculation runs into.
Popular Ways to Calculate Nearest Neighbor
Finding the nearest neighbor is the process of plotting all the vectors in all their dimensions and then comparing a context collection of vectors to them. Using a simple coordinate system, you can mathematically measure how far one point is from another (known as their distance).
The typical American neighborhood is made up of connecting streets and cul-de-sacs. Along every street is a house with an address. When someone speaks of their “neighbors,” they could reference the house next door, or they could be talking about a house on the other side of the neighborhood. The context is the boundaries of the neighborhood. Neighborhoods come in all shapes and sizes, so you need some reference or context when someone says, “My neighbor.”
Depending on the precision your app needs when calculating the nearest neighbor, you choose the best fitting algorithm (a.k .a. how to establish boundaries).
K-Nearest Neighbors (KNN)
The goal of KNN is usually to classify some piece of data against a large set of labeled data. Labeled data means a decision about what each item in the data set has been previously made. In the above example, the data had no labels, which is called unsupervised data. You didn’t know what each piece of text said or how it represented something about cheese. We could have gone through that data and added a label of what cheese was being talked about. Then, it would be supervised data and a good candidate for KNN classification.
What if we had a picture of cheese and needed an AI application to figure out what family it was a part of? You could vectorize a whole bunch of cheese pictures with family labels on each and use those to compare to your picture’s vectors.
The “K” in KNN is a representation of bounds, meaning how many pictures of cheese are you willing to consider? Once those pictures are found, what family has the majority in the group? That’s a simple way to classify something like text or a picture against supervised data.
The return from KNN is a prediction of how well the provided data fits the existing data label. There are usually two values returned: a percentage and a classifier. Our AI application would then need to decide if the percentage is strong enough to apply the given classification or if some other action needs to be taken to continue.
Approximate Nearest Neighbor (ANN)
The cheese board example above used ANN to find complementing cheeses to the given description of cheese. It’s one of the most common uses of nearest neighbor, mostly because it works well on non-labeled (unsupervised) data. Today’s AI applications try to use as much data as possible to make informed decisions. Still, it is such an investment of time and effort to label everything that it’s easier to adapt your algorithm(s).
The “approximate” in ANN should tip you off to the precision of the algorithm. The return will approximate what data is closely related to the input, but hallucinations are real, so be careful.
Fixed Radius Nearest Neighbor
The “K” in K-nearest neighbors is a bound of how many points in space you are willing to consider before finding the majority. The focus is the number of points. Fixed radius is an extended approach to KNN where you are looking at the number of points, but it is limited to a certain distance. This comes with the probability that no points may be found. But if the application is willing to accept that and can create a new classification, then it’s a quick way to limit the number of data points to consider. Limiting the number of points to consider is an easy way to speed up the overall calculation.
The radius is usually provided as a fixed value, along with the context value. The context value is a vector, and the radius is a measure of distance from that vector.
Partitioning With k-Dimensional Tree (k-d tree)
When data is tokenized (converted to vectors), the number of dimensions is chosen. You choose the number of dimensions based on how accurate you need a search to be. But, like all things, there is a trade-off. The more dimensions an embedding has, the longer it’s going to take to compute a nearest neighbor. You have to find a balance between the two for your application.
When you have a lot of dimensions (possibly thousands or more), the k-d tree comes in quite handy. After the vectors are plotted in a single space (before the comparison of nearest neighbor distance takes place), the k-d tree splits the single space into a number of spaces (called partitions). The number of spaces is the “K” part of the name. How the spaces are sorted (so their shared context is not lost) is an implementation choice (median-finding sort).
Ideally, the number of leaves that make up each space in the tree is balanced. This makes for uniform, predictable search performance.
The power of using the k-d tree with the nearest neighbor is that the algorithm is given a sense of proximity while traversing the tree when an initial neighbor is found. With that knowledge, it can choose to not search large portions of the tree because it knows those leaves are too far. That can really speed up a search.
Published at DZone with permission of David Dieruf. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments