MatchUp Object:Best Practices: Difference between revisions

From Melissa Data Wiki
Jump to navigation Jump to search
Tim (talk | contribs)
No edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[MatchUp Object|Back to MatchUp Object Main Page]]
[[MatchUp Object|← MatchUp Object]]


'''Matchup Object Best Practices contains support recommendations when performance is not optimal'''
{{CustomTOC}}
 
Matchup Object Best Practices contains support recommendations when performance is not optimal.
 
==Intersecting Deduper==
When a matchcode that circumvents the first component restrictions (not used in all combinations, or fuzzy algorithm applied) is used, throughput should be expected to be significantly slower. This can also cause stability issues. When processing large amounts of records, we do not recommend using this type of matchcode. Test thoroughly with small amounts of records before scaling up to larger data sets and a production environment. Using the Hybrid Deduper or small amounts of data will not show the problem. For more info: [[MatchUp Object:Matchcode Combinations|Matchcode Combinations]].
 
 
==Matchcodes with Fuzzy Algorithms==
Since fuzzy algorithms can exponentially slow down a process or raise stability issues for enterprise-level processes, we recommend that you establish acceptable throughput benchmarks with a standard exact matchcode. Then make small incremental changes that progress to the desired matching strategies. Given the quality of data and amount of records, use of certain matchcode properties may be impractical to achieve desired speeds.




==Optimizing Speed: General==
==Optimizing Speed: General==
BP_MUXX_001 <br>
#Network data traffic: We recommend that the source data to be processed be local with respect to the installed Melissa Data program. Network permissions, throughput, and in some cases, MatchUp's need to access record 'x' to complete consolidation with record 'y', are all potential sources of a slower process.
#Network data traffic: We recommend that the source data to be processed be local with respect to the installed Melissa Data
#Source datatype :Some database or file types can be read by the calling language or IDE more efficiently than others. Matching your environment to the most efficient file type requires trial and error testing by the developer.
program. Network permissions, throughput, and in some cases, MatchUp's need to access record 'x' to complete
#Hardware :It goes without saying that the more hardware you dedicate to a process, the faster it will run. However, many processes can not take advantage of additional hardware, or show diminishing returns. For example, varied zip code demographics may be able to use multi-processors to process individual clusters of records, but a database of the same zip code may not. Additionally, for the above factors, hardware may not be the overriding factor governing a fast process, ie. a good matchcode may be the most important factor.
consolidation with record 'y', are all potential sources of a slower process.
 
#Source datatype :Some database or file types can be read by the calling language or IDE more efficiently than others. Matching your
environment to the most efficient file type requires trial and error testing by the developer.


#Hardware :It goes without saying that the more hardware you dedicate to a process, the faster it will run. However,
many processes can not take advantage of additional hardware, or show diminishing returns. For example,
varied zip code demographics may be able to use multi-processors to process individual clusters of records,
but a database of the same zip code may not. Additionally, for the above factors, hardware may not be the
overriding factor governing a fast process, ie. a good matchcode may be the most important factor.


==Optimizing Speed: Matchcodes==
==Optimizing Speed: Matchcodes==
BP_MUXX_002 <br>
#Matchcodes: Components: Before MatchUp dedupes, it clusters records into groups of possible matches. If your matchcode does not have any components in every used combination, it can not place records into those sub group clusters. In general, the greater number of components used in every combination, the faster the process will be.
1. Matchcodes: Components: Before MatchUp dedupes, it clusters records into groups of possible matches. If
#Matchcode: Fuzzy: MatchUp has an extensive list of Fuzzy Options. Some are performed during the key building process (ie. Soundex) and do not slow the process down. Others are performed on the constructed matchkeys (ie. Near, Jaro, etc.) and therefore slow down the process. If the latter types are required by your process, place them in the component order below an exact component which is also used in every combination if possible.
your matchcode does not have any components in every used combination, it can not place records into those
sub group clusters. In general, the greater number of components used in every combination, the faster the
process will be.
 
2. Matchcode: Fuzzy: MatchUp has an extensive list of Fuzzy Options. Some are performed during the key
building process (ie. Soundex) and do not slow the process down. Others are performed on the constructed
matchkeys (ie. Near, Jaro, etc.) and therefore slow down the process. If the latter types are required by your
process, place them in the component order below an exact component which is also used in every
combintation if possible.




==Order of Components in Matchcode==
==Order of Components in Matchcode==
BP_MUXX_003 <br>
Although the Matchcode Editor interface lets you place the components in any order, the Object does have a few restrictions when calling the AddMapping methods. Namely, Address Line AddMappings must be called last, even if you have added another component after the Address matchcode components. Calling AddMappings in the wrong order will throw an error, therefore when using the matchcode editor, place your address components last. The exception would be rare cases where address components are used in every specified column, but a different component is not used in all combinations (specified columns).
Although the Matchcode Editor interface lets you place the components in any order, the Object does have a few restrictions when calling the AddMapping methods. Namely, Address Line AddMappings must be called last, even if you have added another component after the Address matchcode components. Calling AddMappings in the wrong order will throw an error, therefore when using the matchcode editor, place your address components last. The exception would be rare cases where address components are used in every specified column, but a different component is not used in all combinations (specified columns).
==Back up your Matchcode database==
If you create your own matching strategies, you should occasionally back up this file - in the event that someone changes a matchcode or it becomes corrupted.
For the Object and SSIS, and Contact Zone, this file is named mdMatchUp.mc
For the MatchUp Software version, the file is named DTake.mc
An example of good backup practice would be mdMatchUp_20140123.mc, allowing you to see the original matchcode used in processes before Jan 23, 2014




==Using Efficient SetUserInfo==
==Using Efficient SetUserInfo==
BP_MUOB_001 <br>
By default, SetUserInfo, the unique identifier attached to built match key is 1024 bytes, allowing the
By default, SetUserInfo, the unique identifier attached to built match key is 1024 bytes, allowing the
developer to pass an advanced custom identifier, or even source data to the key file. While this can have
developer to pass an advanced custom identifier, or even source data to the key file. While this can have
data handling advantages, this will cause the key file and temporary sort files to grow much larger than
data handling advantages, this will cause the key file and temporary sort files to grow much larger than
needed for most jobs, and will slow down the process. A new reserve funcion has been added, allowing the
needed for most jobs, and will slow down the process. A new reserve funcion has been added, allowing the
user to override the default UserInfo size. Example...
user to override the default UserInfo size. For Example:


<pre>
ReadWrite->SetReserved("UserInfoSize","12");
ReadWrite->SetReserved("UserInfoSize","12");
</pre>


Our tests have shown this to reduce the key and temporary disk storage usage to decrease by a factor of
Our tests have shown this to reduce the key and temporary disk storage usage to decrease by a factor of
10 and the processing time to decrease by as much as 60%
10 and the processing time to decrease by as much as 60%
To determine if you have the necessary Update Build 2072 or newer, Programmatically:
<pre>
printf(" BUILD NUMBER: %s\n",mdMUReadWriteGetBuildNumber(ReadWrite));
</pre>




==Keep Work File Location Local==
==Keep Work File Location Local==
BP_MUXX_004 <br>
MatchUp uses this location to store the process key file as well as temporary sorting files.
MatchUp uses this location to store the process key file as well as temporary sorting files. <br>
 
By default, Windows will store these files in the temp directory of the logged in User. For *nix
By default, Windows will store these files in the temp directory of the logged in User. For *nix
platforms, the directory where the executable is being ran. <br>
platforms, the directory where the executable is being ran.
 
Although users can override this location, we do not recommned it, unless you are pointing this location to a fast local drive with plenty of writable disk space and full read write permissions.
Although users can override this location, we do not recommned it, unless you are pointing this location to a fast local drive with plenty of writable disk space and full read write permissions.




[[MatchUp Object|Back to MatchUp Object Main Page]]
[[Category:MatchUp Object]]

Latest revision as of 20:17, 25 May 2018

← MatchUp Object


Matchup Object Best Practices contains support recommendations when performance is not optimal.

Intersecting Deduper

When a matchcode that circumvents the first component restrictions (not used in all combinations, or fuzzy algorithm applied) is used, throughput should be expected to be significantly slower. This can also cause stability issues. When processing large amounts of records, we do not recommend using this type of matchcode. Test thoroughly with small amounts of records before scaling up to larger data sets and a production environment. Using the Hybrid Deduper or small amounts of data will not show the problem. For more info: Matchcode Combinations.


Matchcodes with Fuzzy Algorithms

Since fuzzy algorithms can exponentially slow down a process or raise stability issues for enterprise-level processes, we recommend that you establish acceptable throughput benchmarks with a standard exact matchcode. Then make small incremental changes that progress to the desired matching strategies. Given the quality of data and amount of records, use of certain matchcode properties may be impractical to achieve desired speeds.


Optimizing Speed: General

  1. Network data traffic: We recommend that the source data to be processed be local with respect to the installed Melissa Data program. Network permissions, throughput, and in some cases, MatchUp's need to access record 'x' to complete consolidation with record 'y', are all potential sources of a slower process.
  2. Source datatype :Some database or file types can be read by the calling language or IDE more efficiently than others. Matching your environment to the most efficient file type requires trial and error testing by the developer.
  3. Hardware :It goes without saying that the more hardware you dedicate to a process, the faster it will run. However, many processes can not take advantage of additional hardware, or show diminishing returns. For example, varied zip code demographics may be able to use multi-processors to process individual clusters of records, but a database of the same zip code may not. Additionally, for the above factors, hardware may not be the overriding factor governing a fast process, ie. a good matchcode may be the most important factor.


Optimizing Speed: Matchcodes

  1. Matchcodes: Components: Before MatchUp dedupes, it clusters records into groups of possible matches. If your matchcode does not have any components in every used combination, it can not place records into those sub group clusters. In general, the greater number of components used in every combination, the faster the process will be.
  2. Matchcode: Fuzzy: MatchUp has an extensive list of Fuzzy Options. Some are performed during the key building process (ie. Soundex) and do not slow the process down. Others are performed on the constructed matchkeys (ie. Near, Jaro, etc.) and therefore slow down the process. If the latter types are required by your process, place them in the component order below an exact component which is also used in every combination if possible.


Order of Components in Matchcode

Although the Matchcode Editor interface lets you place the components in any order, the Object does have a few restrictions when calling the AddMapping methods. Namely, Address Line AddMappings must be called last, even if you have added another component after the Address matchcode components. Calling AddMappings in the wrong order will throw an error, therefore when using the matchcode editor, place your address components last. The exception would be rare cases where address components are used in every specified column, but a different component is not used in all combinations (specified columns).


Back up your Matchcode database

If you create your own matching strategies, you should occasionally back up this file - in the event that someone changes a matchcode or it becomes corrupted.

For the Object and SSIS, and Contact Zone, this file is named mdMatchUp.mc

For the MatchUp Software version, the file is named DTake.mc

An example of good backup practice would be mdMatchUp_20140123.mc, allowing you to see the original matchcode used in processes before Jan 23, 2014


Using Efficient SetUserInfo

By default, SetUserInfo, the unique identifier attached to built match key is 1024 bytes, allowing the developer to pass an advanced custom identifier, or even source data to the key file. While this can have data handling advantages, this will cause the key file and temporary sort files to grow much larger than needed for most jobs, and will slow down the process. A new reserve funcion has been added, allowing the user to override the default UserInfo size. For Example:

ReadWrite->SetReserved("UserInfoSize","12");

Our tests have shown this to reduce the key and temporary disk storage usage to decrease by a factor of 10 and the processing time to decrease by as much as 60%

To determine if you have the necessary Update Build 2072 or newer, Programmatically:

printf(" BUILD NUMBER: %s\n",mdMUReadWriteGetBuildNumber(ReadWrite));


Keep Work File Location Local

MatchUp uses this location to store the process key file as well as temporary sorting files.

By default, Windows will store these files in the temp directory of the logged in User. For *nix platforms, the directory where the executable is being ran.

Although users can override this location, we do not recommned it, unless you are pointing this location to a fast local drive with plenty of writable disk space and full read write permissions.