Using Machine Learning to Detect Dupes: Some Real-Life Examples
Today we would like to take a look at some interesting uses of machine learning to catch duplicates in all kinds of environments.
Join the DZone community and get the full member experience.
Join For FreeAs companies collect more and more data about their customers, an increased amount of duplicate information starts appearing in the data as well, causing a lot of confusion among internal teams. Since it would be impossible to manually go through all of the data and delete the duplicates, companies have come up with machine learning solutions that perform such work for them. Today we would like to take a look at some interesting uses of machine learning to catch duplicates in all kinds of environments. Before we dive right in, let’s take a look at how machine learning systems work.
How Do Machine Learning Systems Identify Duplicates?
When a person looks at an image or two strings of data it would be fairly easy for them to determine whether or not the images or strings are duplicates. However, how would you train a machine to spot such duplicates? Perhaps a good starting point would be to identify all of the similarities, but then you would need to explain exactly what 'similar' means. Are there gradations to similarities? In order to overcome such challenges, researchers use string metrics to train machine learning models.
There are many string metrics to choose from. The following is a list of some of the most frequently used string metrics:
- Hamming Distance: This method counts the number of substitutions that are required to turn one string into another.
- Levenshtein Distance: This string metrics expands on the Hamming Distance by allowing operations such as deletion and insertion in addition to substitution
- Jaro-Winkler Distance: a string metric measuring an edit distance between two sequences
- Learnable Distance: This one takes into consideration that different edit operations have varying significance in different domains.
- Sørensen–Dice coefficient: This one measures how similar two strings are in terms of the number of common bigrams (a bigram is a pair of adjacent letters in the string).
All of this may sound complicated, so let’s take a look at some real-world examples of using machine learning for deduplication.
Detecting Duplicate Questions in Quora and Reddit
Quora and Reddit are two very popular platforms used by millions of people all over the world. However, this poses a significant issue of duplicate questions and posts since one user is most likely unaware that somebody else already asked the same question. Since it is not possible to manually sift through all of the questions, they use machine learning to identify them. As an example, let’s use the questions below:
- How do I book cheap hotel rooms?
- What are the best ways to find cheap hotel deals?
In order to train their machine learning algorithms to identify whether or not these questions are duplicates, Quora uses a massive dataset consisting of 404,290 question pairs and a test set of 2,345,795 question pairs. The reason that so many questions are needed is that so many factors need to be considered such as capitalization, abbreviations, and the ground truth. All of this is meant to find high-quality answers to questions, resulting in a better experience for all Quora Users across the board.
Deduping Ads on Craigslist
Craigslist is another very popular platform for posting all kinds of advertisements, but very often sellers will make changes to the ads if they are not satisfied with the ad’s performance.
A machine learning system will be able to use the string metrics we mentioned earlier to determine the distance between each string and the operations necessary to turn one string into another. It will then be able to flag all of the duplicate ads.
Deduping Lines of Code
Even people who are not IT professionals have heard of GitHub, a popular resource where developers can host, share, and discover software. Since there are more than 190 million repositories on GitHub and more than 40 million users, it is pretty easy for duplicate lines of code to appear. In fact, research into this issue shows that 93% of JavaScript on GitHub is duplicate. There are many different classifications of duplicates ranging from completely identical to those that are semantically similar but syntactically different. GitHub relies on machine learning to parse through all the code submitted by the users and detect the duplicates that are either exactly the same or perform the same functions.
Using Machine Learning to Dedupe Salesforce
Machine learning is a much better alternative to the traditional rule-based approach used to dedupe Salesforce. It is much more effective in identifying fuzzy duplicates since it is not possible to create a rule for every possible scenario. What machine learning does is take the string metrics mentioned above and many other ones as well and then learn to replicate the human thought process. The way it works is the system presents you with a pair of records and you can either label them as unique or duplicates. In doing so, the system automatically learns from your decisions to better adjust the deduplication algorithms to better match your data.
Another big benefit of the machine learning approach is that it is much more scalable. Consider that if you start with a modest number of records, such as 50,000, adding another 5,000 will require 250,000,000 comparisons to be made. To extrapolate further, if a computer manages to compare 10,000 records per second (which will require enormous computational power), it would still take almost seven hours to do the full comparison and identify duplicates:
250,000,000 / 10,000 / 3,600 = 6.94 hours
Comparisons / Comparisons per Second / Seconds in an Hour
Machine learning takes a much smarter approach by blocking together records with specific similarities and only checking these blocks for duplicates. This results in significantly fewer comparisons that need to be made thus saving you a lot of time.
The Role of Machine Learning In Deduplication Will Only Increase
Duplicate data can carry severe ramifications for your CRM or database which is why there has been a range of techniques developed to identify duplicates. As we already mentioned, it is much more efficient and smarter than creating filters which are time-consuming and ultimately a futile effort. Therefore, start using machine learning to dedupe your data today.
Published at DZone with permission of Ilya Dudkin. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments