MatchUp Object:Hybrid Deduping

From Melissa Data Wiki
Jump to navigation Jump to search


The Hybrid deduper differs from the Incremental and Read/Write dedupers in that it does not maintain a key file of its own. It is up to the developer to maintain a list of match keys to use for deduping operations. This increases the flexibility of the Hybrid deduper but at the expense of programming complexity.

The main advantage of Hybrid deduping is that it allows the developer to build smaller lists of match keys on the fly and quickly compare records to a small subset of the database.

Clustering

The concept of Clustering, outlined in the first chapter, is essential to the Hybrid deduper. Unlike the other dedupers, where the clustering is taking place behind the scenes, the Hybrid deduper allows the developer to use clustering to compare a record against only a small portion of a list.

The Hybrid deduper uses the concept of a cluster size, which is the maximum number of characters at the beginning of a key that can be used to group a number of keys into smaller groups that can be compared against each other. For example, a cluster size of 5 means that the first five characters of a match key are used to create the clusters.

In other words, only the records where the first five characters of the match key for one record are identical to the first five characters of the match key for another record are considered when performing a Hybrid deduping operation.

Key Maintenance

Unlike the other interfaces, the Hybrid deduper does not automatically handle the read/write operations to a key file. While this forces the developer to do more work, it allows a great deal of flexibility in how match keys are stored and handled.

In the previous example, with a cluster size of 5, if the match keys are stored in a field within a SQL database, a cluster could be built quickly by performing a SELECT query where the first five characters of the match key field matches the first five characters of the match key for the new record.

While this gives the developer far more flexibility, it also requires a great deal more coding and a greater understanding of certain MatchUp concepts.

Hybrid Order of Operations

Using the Hybrid deduper is not as straightforward as the other interfaces, as it puts greater burden on the developer to handle storage and management of match keys.

This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Hybrid deduper.

  1. Initialize the Hybrid deduper.
  2. After creating an instance of the Hybrid deduper, point the object toward its supporting data file, select a matchcode to use, and initialize these files.
  3. Create field mappings.
  4. In order to build keys to compare, the Hybrid deduper needs to know which types of data the program will be passing to the deduper and in what order.
  5. Build a master list of keys.
  6. Each record must have a match key so the Hybrid deduper can select a cluster of records or check for duplicates. This consists of passing the data used in record comparison from each record to the deduper in the same order used when creating a field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the Hybrid deduper uses this information to generate a match key.
  7. Build a match key for the new address record.
  8. Repeat the step above to create a match key for the record to be compared against the cluster.
  9. Build the cluster list.
  10. Cycle through the master key list, extract only those records where the first part of the match key equals the first part of the match key for the new record.
  11. Compare the match key to the cluster list.
  12. Loop through the cluster key file for any keys that match the new record. If it finds a match, the CompareKey function indicates a match.