MatchUp Hub:Data Considerations: Difference between revisions
No edit summary |
No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
==Data Considerations== | ==Data Considerations== | ||
After making sure your environment is setup correctly to run MatchUp ([[MatchUp Hub:Environment#Evaluation Areas|Environment Evaluation Areas]]) and your matchcode has been evaluated and optimized ([[ | After making sure your environment is setup correctly to run MatchUp ([[MatchUp Hub:Environment#Evaluation Areas|Environment Evaluation Areas]]) and your matchcode has been evaluated and optimized ([[Matchcode Optimization|Matchcode Optimization]]) Users can still experience problems with slow processing speeds due to bad data. | ||
[[File:MCO_DataConsiderations.png|link=]] | [[File:MCO_DataConsiderations.png|link=]] | ||
In MatchUp, clustering is made possible when we have at least one component common to all used matchcode combinations. Since ZIP5 is used in all matchcode combinations, built keys will be grouped into different clusters based on that datatype. Therefore, if you have a database which contains uneven distribution of ZIP codes, as in the table below, changing your matchcode to include LAST NAME, for example, would create better clustering. | In MatchUp, clustering is made possible when we have at least one component common to all used matchcode combinations. Since ZIP5 is used in all matchcode combinations, built keys will be grouped into different clusters based on that datatype. Therefore, if you have a database which contains uneven distribution of ZIP codes, as in the table below, changing your matchcode to include LAST NAME, for example, would create better clustering. | ||
Line 24: | Line 26: | ||
In the case of Table 2, checking your data and an identifying an extensive amount of NULL values can also be a source of clustering issues. You can check this by using one of our Profiler products to check for NULL/Empty values, as well incorrect data types in columns. Passing your data through an address verification service in order to correct empty field values can help fix bad zip/addresses. For other NULL data types we suggest an alternate matchcode before deduping or splitting the data into multiple threads. | In the case of Table 2, checking your data and an identifying an extensive amount of NULL values can also be a source of clustering issues. You can check this by using one of our [[Profiler Object|Profiler products]] to check for NULL/Empty values, as well incorrect data types in columns. Passing your data through an address verification service in order to correct empty field values can help fix bad zip/addresses. For other NULL data types we suggest an alternate matchcode before deduping or splitting the data into multiple threads. | ||
===Table 2=== | ===Table 2=== |
Latest revision as of 14:44, 2 October 2018
Data Considerations
After making sure your environment is setup correctly to run MatchUp (Environment Evaluation Areas) and your matchcode has been evaluated and optimized (Matchcode Optimization) Users can still experience problems with slow processing speeds due to bad data.
In MatchUp, clustering is made possible when we have at least one component common to all used matchcode combinations. Since ZIP5 is used in all matchcode combinations, built keys will be grouped into different clusters based on that datatype. Therefore, if you have a database which contains uneven distribution of ZIP codes, as in the table below, changing your matchcode to include LAST NAME, for example, would create better clustering.
Table 1
RECID LAST NAME ADDRESS CITY ZIP 1 Jones 12 Main Street Boston 02125 2 Smith 57 Maple Lane Boston 02125 3 Connor 34 Summer Street Boston 02125 4 Williams 1 Oak Drive Boston 02125 n *** *** *** 02125
In the case of Table 2, checking your data and an identifying an extensive amount of NULL values can also be a source of clustering issues. You can check this by using one of our Profiler products to check for NULL/Empty values, as well incorrect data types in columns. Passing your data through an address verification service in order to correct empty field values can help fix bad zip/addresses. For other NULL data types we suggest an alternate matchcode before deduping or splitting the data into multiple threads.
Table 2
RECID LAST NAME ADDRESS CITY ZIP 1 Jones 12 Main Street Boston NULL 2 Smith 57 Maple Lane Boston NULL 3 Connor 34 Summer Street Boston NULL 4 Williams 1 Oak Drive Boston NULL n *** *** *** NULL
Table 3 shows a good example of data that has first been standardized and verified before processing with matchup. Due to the clustering aspect of ZIP codes, the data below will be grouped into two sections which will provide much better processing speeds.
Table 3
RECID LAST NAME ADDRESS CITY ZIP 1 Jones 12 Main Street Boston 02125 2 Smith 57 Maple Road Boston 02125 3 Connors 3 Summer Circle Boston 02121 4 Williams 17 Oak Drive Boston 02121 n *** *** *** ***