60 likes | 196 Views
Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto Ochandio DSCI 5240 November Dec 7, 2005. Problem Definition. Exponential growth in data capture leads to data fragmentation .
E N D
Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto Ochandio DSCI 5240 November Dec 7, 2005
Problem Definition • Exponential growth in data capture leads to data fragmentation. • POS customer tracking • Corporate Data Warehouse • Advanced Analytics • Increased popularity of personalized messages. • Prohibitive attitudinal data costs.
Data Fusion Overview • Data Fusionisthe combination of information from different sources. • Also known as: Micro Data Set Merging, Statistical Record Linkage, and Multi-Source Imputation • Example: • Demographic and psychographic data aggregated at geographical level. • Same characteristics for people in the same region. • Motivation: • Algorithms can create generalized fusions providing richer data sets for use in applications or future data mining projects.
Data Fusion Terminology • Recipient, Donor, Fused Variables, Common Variables, Critical Common Variables CommonVariables + = FusedVariables Recipient Donor Fused Dataset
Data Fusion Algorithm • Find best Donor elements that match the Recipient element. • Ensure Critical Variable exact match. • Limit Donor element usage. • Use averages from the Donor set to estimate the Fused variables for the Recipient set. + = Recipient Donor Fused Dataset
Conclusion • Data Fusion increases the value of Data Mining by creating more data to mine while reducing costs and ensuring the best matches possible without over-representing elements in the Donor set.