How Do AI Systems Identify Duplicate Data?
A discussion of AI concepts, such as comparing records in a database, and how these techniques can be used in conjunction with Salesforce.
Join the DZone community and get the full member experience.
Join For FreeWhen you compare two Salesforce records, or any other CRM for that matter, side-by-side, you can easily determine whether or not they are duplicates. However, even if you have a small number of records, let’s say less than 100,000, it would be almost impossible to sift through them one by one by one, and perform such a comparison. This is why companies have developed various tools that automate such processes, but, to do a good job, the machines need to be able to recognize all of the similarities and differences between the records. In this article, we will take a closer look at some of the methods used by data scientists to train machine learning systems to identify duplicates.
How Can Machine Learning Systems Compare and Contrast Records?
One of the main tools researchers use is string metrics. This is when you take two strings of data and return a that is low if the strings are similar and high if they are different. How does this work in practice? Well, let’s take a look at the two records below:
First Name |
Last Name |
Company Name |
|
Ron |
Burgundy |
ron.burgundy@acme.com |
Acme |
Ronald |
burgundy |
ron.burgundy@acme.com |
Acme Corp |
If a human were to look at these two records, it would be pretty obvious that these are duplicates. However, machines rely on string metrics to replicate the human thought process, which is what AI is all about. One of the most famous string metrics is the Hamming distance which measures the number of substitutions that need to be made in order to turn one string into another. For example, if we return to the two records above, there would only need to be one substitution made to turn “burgundy” into “Burgundy,” therefore the Hamming distance would be 1.
There are many other string metrics that measure the similarity between two strings and what separates each one is the operations they allow. For example, we mentioned the Hamming distance, but this string metric only allows substitutions, meaning that it can only be applied to strings that are of equal length. Something like the Levenshtein distance allows for deletion insertion and substitution.
How Can All of This Be Used to Dedupe Salesforce?
There are a couple of ways an AI system can approach Salesforce deduplication. One of the ways is the blocking method, which is illustrated below:
Record 1 |
Record 2 |
Ron Burgundy, ron.burgundy@acme.com, Acme |
Ronald burgundy,ron.burgundy@acme.com Acme Corp |
Such blocking methodology is what makes this approach scalable. The way it works is that whenever you upload new records into your Salesforce, the system will automatically block together records that look “similar.” This can be something like the first three letters of the first name or any other criteria.
This is very beneficial because it reduces the number of comparisons that need to be made. For example, let’s say that you have 100,000 records in your Salesforce and you would like to upload an Excel spreadsheet that contains 50,000 records. The traditional rule-based deduplication apps would need to compare each new record with existing ones meaning that there would need to be 5,000,000,000 comparisons done (100,000 x 50,000). Imagine how long this would take and how much it increases the probability of an error. Also, we need to keep in mind that 100,000 records is a fairly modest number of Salesforce records. There are lots of organizations that have hundreds of thousands or even millions of records. Therefore the traditional approach is simply not very scalable when trying to accommodate such models.
The other option would be to compare each field individually:
Record 1 |
Record 2 |
|
First Name |
Ron |
Ronald |
Last Name |
Burgundy |
burgundy |
ron.burgundy@acme.com |
ron.burgundy@acme.com |
|
Company |
Acme |
Acme Corp |
Once the system has blocked together “similar” records, it will then proceed to analyze each record field by field. This is where all of the string metrics we talked about earlier will come into play. In addition to this, the system will assign each field a particular “weight” or importance. For example, let’s say that for your dataset, the “Email” field is the most important. You can either adjust the algorithms yourself or when you label records as duplicates (or not) the system will automatically learn the correct weights. This is called Active Learning and is more preferable since the system can precisely calculate the importance of one field over another.
What Are the Advantages of the Machine Learning Approach?
The biggest benefit machine learning can offer is that it does all of the work for you. The Active Learning aspect we described in the previous section will apply all of the necessary weights to each field automatically. What this means is that there is no complicated setup process or rules to create. Let’s look at the following scenario. Imagine that one of the sales reps discovers a duplicate and notifies the Salesforce admin about the problem. The Salesforce admin will then proceed to create a rule that will prevent such duplicates from occurring in the future. This process would have to be repeated over and over again every time a new duplicate is discovered making such a process unsustainable.
Also, we need to remember that the built-on deduplication in Salesforce is also rule-based, it’s just very limited. For example, you are only able to merge three records at a time, there is no support for custom objects, and a lot of other limitations. Machine learning is just the smarter way to go since rule creation is simple automation, whereas AI and machine learning try to recreate the human thought process. More about the differences between machine learning and automation are discussed in this article. It would not make sense to choose a deduplication product that simply expands Salesforce’s functionality, instead of fixing the entire process. This is why the machine learning approach is the best way to go.
Published at DZone with permission of Ilya Dudkin. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments