Matchcode Optimization:First Component
Concepts for Efficient Matching: First Component Combinations
Once a workflow is tested to be stable and secure, the two most important concerns for data processing engineers when implementing a process in production are
- Throughput – How fast will it take to process the data?
- Accuracy – Does the output meet the expectations for accuracy?
Choosing the right matching strategy for MatchUp seeks to satisfy both criteria, but in many cases a trade-off must be made as more granular matching strategies required to detect inexact records as duplicates come at a cost of processing speed, or record throughput.
During processing, a matchcode key is generated as a representation for each record, to be compared to the keys of other records. Ideally, every record’s key would be compared to every other record's key. This, however, is not practical in all but very trivial applications because the number of comparisons grows geometrically with the number of records processed. For example, a record set of 100 records requires 4,950 comparisons (99 + 98 +...). A larger set of 10,000 records requires 49,995,000 comparisons (9,999 + 9,998 +...). Large record sets would take prohibitive amounts of time to process.
To give you a mechanism to process large amounts of data and reduce comparisons without affecting accuracy, MatchUp uses the concept of neighborhood sorting, or clustering, to place records in sub groups of potential matches, thus cutting the total number of comparisons during the deduping process.
In many cases, this will be all or part of the ZIP/Postal Code. So what MatchUp does is only compare records that are (in this example) in the same ZIP or Postal Code. On the average (in the US, using 5-digit ZIP codes), this will cut the average number of comparisons per record by a factor of thousands. This requires that the zip code component is enabled in all used columns (or matchcode conditions).
This concept is known as break grouping, clustering, partitioning, or neighborhood sorting. It is very likely that most, if not all other deduping programs have used some form of clustering method.
Here is an example set of matchcode keys using ZIP/Postal Code (5 characters), Last Name(4), First Name(2), Street Number(3), Street Name(5):
02346BERNMA49 GARD 02346BERNMA49 GARD 02357STARBR18 DAME 02357MILLLI123MAIN 03212STARMA18 DAME
When the deduping engine encounters this set of matchcode keys, it compares all the keys in “02346” (2 keys), then “02357” (2 keys), and finally “03212” (1 key). For this small set, 10 comparisons are turned into 2.
In reality, MatchUp’s clustering engine is a bit more complicated than this, but this description will aid in understanding its mechanics.
If the second component in the matchcode is also configured in all used matchcode combinations this increases the number of characters in a records matchkey which can be used to sub-divide, or cluster, records into more efficient sub-groups and further reduce the number of comparisons.
A second deduping engine, the Intersecting deduper, allows you to create matching strategies with rule sets completely independent of each other. This eliminates having to run multiple passes, but with a great speed penalty, and is recommended only for real time deduping or very small data sets.
Often when users have unverified and or incomplete addresses, they set up a logically accurate but very slow matching strategy:
Component Size Fuzzy Blank 1 2 3 4 ZIP/PC 5 No Yes X X City 12 No Yes X X State 2 No Yes X X Street # 5 No Yes X X Street Name 5 No No X X PO Box 10 No No X X Last Name 5 No Yes X X X X
A customer use case shows that verifying your addresses can allow you to turn a 58 hour process into a 4 hour process, by satisfying first component combination conditions:
Component Size Fuzzy Blank 1 2 ZIP/PC 5 No Yes X X Street # 5 No Yes X Street Name 5 No No X PO Box 10 No No X Last Name 5 No Yes X X
Since there’s another component which also satisfies first component combination conditions, dragging it up in the matchcode component order can make the process run even faster – without changing the logic in identifying duplicates.
Component Size Fuzzy Blank 1 2 ZIP/PC 5 No Yes X X Last Name 5 No Yes X X Street # 5 No Yes X Street Name 5 No No X PO Box 10 No No X