MatchUp Object:Read/Write Deduping

From Melissa Data Wiki
Jump to navigation Jump to search


Read/Write deduping is usually used for processing entire lists. It works in a manner similar to the way that the MatchUp software products does. A calling program passes an entire list to the Read/Write deduping engine one record at a time. When the entire list has been passed, the calling program tells the API to process the records. Then, the calling program retrieves each record, along with additional deduplication information, from the Read/Write deduper.

Read/Write deduping consists of the following steps:

  1. One by one, the program sends a series of record data (ZIP/PC, Name, Address, etc.) to the MatchUp API.
  2. When completely done (1), the program sends a “process” command to the API.
  3. The program retrieves the results for each record with deduplication information.

Order of Output Records

The program will send records in a particular sequence, either in record (raw) order, or maybe in a more sophisticated manner (by ZIP/PC, record type, and so on). MatchUp Object will not return the records in the same order. By default, records are output in cluster order. This order will be loosely based on the matchcode. For example, if the matchcode has Zip5 as its first component, output records will be more or less sorted by ZIP Code (but the developer should not count on this). If the application called the SetGroupSorting function, records in the same dupe group will be adjacent. Otherwise, duplicate records may or may not be adjacent (though they usually are near each other).

If a certain sequence is important (for example, records ordered in the same sequence they were input), sort the results after MatchUp Object has processed the data.

Data Lifetime

A Read/Write deduping session is relatively short-lived. Although the actual action of reading and writing records may take time (hours or days), the process is strictly defined into three distinct steps. The key file does not persist beyond this point. Because of this, Read/Write deduping is not usually the choice for ongoing or online processes.

Record Identity

Because MatchUp Object does not read or write directly to the database, some mechanism must be provided so that the application can match each record back to the original data source. The SetUserInfo function allows the application to pass an unique identifier for each record.

Read/Write Order of Operations

Using the Read/Write deduper is pretty straight forward. This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Read/Write deduper.

  1. Initialize the Read/Write deduper.
  2. After creating an instance of the Read/Write deduper, point the object toward its supporting data file, select a matchcode and key file to use, and initialize these files.
  3. Create field mappings.
  4. In order to build a key to be written to the key file, the Read/Write deduper needs to know which types of data the application will be passing to the deduper and in what order.
  5. Read the records from the database.
  6. Loop through the master database and get the data fields needed to build a key, according to the mappings defined in step 2.
  7. Build a match key for each record.
  8. This consists of passing the actual data to the deduper in the same order used when creating the field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the deduper uses this information to generate a match key.
  9. Write each match key to the key file.
  10. The WriteRecord function stores each match key in a temporary key file.
  11. Process the keys.
  12. After building the keys, calling the Process function loops through the keys and compares them to each other.
  13. Loop through the records and read the deduping data for each one.
  14. The ReadRecord function loops through the entire set of deduped records and allows the application to read information on the record’s duplicate/unique status, the number of duplicates for each record and the record dupe group.