MatchUp Object:Hybrid Deduping
The Hybrid deduper differs from the Incremental and Read/Write dedupers in that it does not maintain a key file of its own. It is up to the developer to maintain a list of match keys to use for deduping operations. This increases the flexibility of the Hybrid deduper but at the expense of programming complexity.
The main advantage of Hybrid deduping is that it allows the developer to build smaller lists of match keys on the fly and quickly compare records to a small subset of the database.
Clustering
The concept of Clustering, outlined in the first chapter, is essential to the Hybrid deduper. Unlike the other dedupers, where the clustering is taking place behind the scenes, the Hybrid deduper allows the developer to use clustering to compare a record against only a small portion of a list.
The Hybrid deduper uses the concept of a cluster size, which is the maximum number of characters at the beginning of a key that can be used to group a number of keys into smaller groups that can be compared against each other. For example, a cluster size of 5 means that the first five characters of a match key are used to create the clusters.
In other words, only the records where the first five characters of the match key for one record are identical to the first five characters of the match key for another record are considered when performing a Hybrid deduping operation.
Key Maintenance
Unlike the other interfaces, the Hybrid deduper does not automatically handle the read/write operations to a key file. While this forces the developer to do more work, it allows a great deal of flexibility in how match keys are stored and handled.
In the previous example, with a cluster size of 5, if the match keys are stored in a field within a SQL database, a cluster could be built quickly by performing a SELECT query where the first five characters of the match key field matches the first five characters of the match key for the new record.
While this gives the developer far more flexibility, it also requires a great deal more coding and a greater understanding of certain MatchUp concepts.
Hybrid Order of Operations
Using the Hybrid deduper is not as straightforward as the other interfaces, as it puts greater burden on the developer to handle storage and management of match keys.
This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Hybrid deduper.
- Initialize the Hybrid deduper.
- Create field mappings.
- Build a master list of keys.
- Build a match key for the new address record.
- Build the cluster list.
- Compare the match key to the cluster list.