Difference between revisions of "FAQ:SSIS:Data Quality Components"

From Melissa Data Wiki
Jump to navigation Jump to search
Line 1: Line 1:
[[SSIS:Data Quality Components|← SSIS:Data Quality Components]]
==Can we save current settings for DQC for future?==
==Can we save current settings for DQC for future?==

Revision as of 21:14, 7 October 2014

← SSIS:Data Quality Components

Can we save current settings for DQC for future?

Yes, Melissa Data SSIS includes the feature that you can save currents settings for future use. Under File Tab you can save current setting in a .cfg file in local your drive. Under same File tab click "Open" and upload the .cfg file.

Speed Facts for CVC?

EVC is an enterprise component so performance of EVC is highly dependent on machine configuration. Operating System, RAM, Processor or using network drive for data input or output, make significant impact on performance of EVC. In addition performance also depends upon 2 more main factors:

How you are using EVC?

It is highly recommended for EVC users, select the only those features i.e. LACSLink(Address Add-on) DNS Lookup(Email, Add-on), which are required. Unnecessary use of add-on and feature will effect the performance.

Type of Data?

If the input data is really corrupted or having lot of errors, will effect the performance.

We have additional 2 different scenarios for using EVC, which will help you to analyse the throughput of the component in different situations. For more information you can click on following link: SSIS Speed Scenarios

In which format Data Quality Components(DQC) accept input?

You can use any format as input for Melissa Data SSIS components. It can be .csv, .xls, .dbf, xlsx, SQL database etc. In some cases you might not see any column in drop down options of MD SSIS component, in this case you can use "Data Conversion"(in-built component in BIDS) to change the data type of particular missing column to "DT_WSTR".

How we can update individual Melissa Data Object rather than updating whole CVC?

EVC component installs all Melissa Data Objects in "Binn" folder under DTS folder of SQL Server. If you are using EVC in SQL Server 2008 and need to update only Address Object grab latest mdAddr.dll and replace older mdAddr.dll in following location:
C:\\Program Files\\Microsoft SQL Server\\100\\DTS\\PipelineComponents
Same way you can update other Melissa Data Objects i.e for Email Object replace mdEmail.dll, for GeoCoder mdGeo.dll, etc.
If you are installed EVC on SQL Server 2005, the location of Melissa Data Object will be in following location:
C:\\Program Files\\Microsoft SQL Server\\90\\DTS\\PipelineComponents

Microsoft Fuzzy Lookup vs. The Data Matching and Cleansing Component?

The Microsoft transforms are only available with SQL Server enterprise edition which is a significant cost in itself if your organization doesn’t have it already.
All of the following transforms are SQL Server Enterprise edition only:
  • Data Mining Training Destination
  • Data Mining Query Component
  • Fuzzy Grouping
  • Fuzzy Lookup
  • Term Extraction
  • Term Lookup
  • Dimension Processing Destination
  • Partition Processing Destination

The MD Fuzzy Matching Component allows more granular control over matching whereas the SSIS native fuzzy matching does not allow any control of its algorithms or access to multiple algorithms. There are only two configurable parameters with the stock Fuzzy lookup and grouping components ie: similarity threshold and maximum number of results per input record.

The MD Fuzzy Matching component has 16 configurable matching algorithms some of them developed specifically by us such as the unique Keyboard distance for misspellings. All industry standard matching algorithms are supported, and probabilistic match settings can be configured for each column of data and the rules can be daisy chained for maximum catch all situations. Also the component has the ability to ignore common data quality issues when matching. ie; whitespaces, irregular characters, etc.

The Melissa Data Fuzzy Matching Component can also inherently cleanse your data through regular expressions. Regular expressions can be built and tested through an expression builder which is also built in to the component. Cleansing your data prior to matching is a crucial step in order to get the most accurate results for matching.

The component also is able to automatically split up the results into 3 destinations: Matches, Possible Matches and Non-Matches. Based off the percentage score between 2 records, the component will re-direct the results to the appropriate destination according to the percent thresholds you’ve set in the component. Also the Fuzzy SSIS component has full metadata (source and compare scores for multiple algorithms) for tracking record lineages.

One more thing to bear in mind is that, the stock Fuzzy Lookup component in SSIS requires that your reference table be a SQL Table. The Fuzzy Matching Component does not limit you to a SQL Table Reference Database.

The DMCC suite also comes with the MatchUP component, which is specially designed for the unique set of matching problems with Address, Name and Company data fields. MatchUp recognizes combinations, inverse names, Acronyms, Nick names, numerical streets such as 12th vs. twelfth, and so on.

What Versions of SQL Server are your Components Supported in?

Melissa Data currently supports 2008 and 2012 versions of SQL Server. We also provide components for SQL Server 2005 as part of the SSIS Installation, however, it is no longer supported.