1 / 36

Data INtegration and Error: Big Data From the 1930’s to Now

Data INtegration and Error: Big Data From the 1930’s to Now. Contents. Big Data in the 1930’s and why that matters now TV measurement and Return Path Data (STB ) Interesting questions for understanding error. BIG Data 1930’s style. Probability Sampling 1930’s STYLE.

Download Presentation

Data INtegration and Error: Big Data From the 1930’s to Now

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data INtegration and Error:Big Data From the 1930’s to Now

  2. Contents Big Data in the 1930’s and why that matters now TV measurement and Return Path Data (STB) Interesting questions for understanding error

  3. BIG Data 1930’s style

  4. Probability Sampling 1930’s STYLE

  5. Evolution of Statistical Concepts IN RESEARCH Since the 1950’s: weighting, probability models, imputation techniques, data fusion, time series analyses, hybrid (Big Data/sample integration) Early days: Novel, non-scientific 1930’s: Scientific sampling

  6. Nielsen and audience measurement 1923: Nielsen Founded 1950: Introduces TV Audience Measurement Current technology: People Meter • Electronic measurement • Probability samples • All people and sets in home measured Nielsen Ratings are the currency for US TV advertising

  7. The Changing TV Environment Fragmentation of Viewing Choices Proliferation of Devices Increasing Population Diversity

  8. Research Data - Statistical Tools From: Sample/Measure/Project (Panel Data) To: Sample/Measure/Project + Integrate - Data Fusion - Probability Modeling - Calibration - Predictive Modeling Using Multiple Panels, Census Data, Surveys

  9. What STB and Panels Can Give Us + • Panels • Completeness of Audience Measurement • RESEARCH PRODUCTS STB Large convenience samples, stable results DATA = • In combination, STB + Panels offer the possibility of stable, UNBIASED RESEARCH

  10. STB GAPS and Bias Total Survey Error • Data Quality/coverage/ timeliness/representativeness 2. Set Activity (On/Off/Other Source) 3. Household Characteristics 4. Persons viewing (including visitors in the home) 5. Other Viewing Activity STB Bias Bias Standard Error Standard Error People Meter STB + People Meter? Bias Standard Error

  11. STB Data Quality – example analyses • Good… • Not so good… Program junction spikes Machine Reboot Activity

  12. Are We improving the measurement? • Transparency and validation at each step and overall 2. Total Survey Error Total Survey Error Bias Standard Error

  13. Assessing Integration error Input Error (GIGO) Matching Error Statistical Error Validity Levels Multiple Database error compounding

  14. Assessing Integration errors Most problems remain but some can be mitigated through integration • Input Error (GIGO) • Coverage Gaps, Definitional problems, Input Errors etc • But possible improvement through integration weighting effects

  15. Assessing Integration errors • Multiple databases may have correlated errors – that may be preferable to random errors since overall effect is restricted to a smaller group (eg new householders in some address lists) • Matching Error (eg address matching) • Good – correct match, Bad – no match, Ugly – incorrect match • Trade-off between match rates and error rates

  16. Statistical Error (Sample-based Imputation) Model bias leads to attenuation (regression to mean) Individual data point bias can be undetectable due to sampling error

  17. separating model bias and sampling error Z-tests on each comparison and evaluation of Z-score distributions Deviation from expected distribution gives bias estimate

  18. Statistical Error - Multiple Data Sets Hub and Spoke Sequential TV TV 2 1 2 1 Web Buy Web Buy 2 Comparison with Single Source Data: Nielsen National People Meter TV and Internet matched with Credit Card Purchase Data

  19. Accuracy Test Hub and Spoke Sequential TV TV R = 0.67 R = 0.44 Web Buy Web Buy R = 0.4 R = 0.5 Correlation of 8 product categories with 14 TV Networks and 60 Websites

  20. Sequential vs Hub and Spoke • Unless the Hub has all the relevant linking information, a sequential approach gives better results • In our example, we captured interactions between web and purchase behavior through the sequential fusion • However sequential fusions can fall down with too many data-sets as error compounds.

  21. Validity Levels – Individual vs Aggregated Aggregate Prediction • Imputation methods can reliably predict aggregate level behavior given good predictive variables • Eg 90% Accuracy (10% regression to mean) for TV audience estimates by product users • Errors compound with multiple sources but extent varies by case Individual Prediction • IDEAL SCENARIO: You can predict every individual’s behavior • REALITY With most Imputation methods we can do better than random but rarely can we get close to 100% accuracy. • Eg ~40% improvement on random when predicting product users based on cookies. ie 14% of online ad impressions delivered to product users rather than 10%

  22. CONCLUSION Data Everywhere! Data quality and relevance is essential Integration brings insights and error Statistical Integrity is as important now as it was in the 1930’s

  23. APPENDIX

  24. AD Effectiveness - more complicated TBD... PUR-CHASE TBD... HUB: Matching info Website visit TBD... TBD... TBD... TBD... Imagine a data set of 10,000 people for whom you have tracked exposure to a brand’s website and subsequent purchase of that brand. In our initial thought experiment, 76% converted.

  25. A Basic Experiment Now imagine that you have measurement error in 10% of your cases. We ran a simulation of 1000 datasets which had incorrect data on site visits in 10% of cases. The difference between the original conversion rate and that in the 1000 error ridden test cases is about 8.5%. SD is xx.

  26. A Basic Experiment TBD... PUR-CHASE! TBD... HUB: Matching info Website visit TBD... TBD... Saw TV ad TBD... What happens when we add another data set?

  27. More Data – Same error Given two types of ad exposure data to measure, the impact of error in a single data source should be less... Imagine that you have measurement error in 10% of your cases for one data source – the same error as in previous experiment. As expected, conversion values are closer to our error-free data set. SD =

  28. More Data – MORE error Next, we introduced error into the TV data set as well. Worsening of performance SD is xx. But it looks more additive than exponential.

  29. More Data – EVEN MORE error Next, we imagined combining 6 data sets, each with 10% error. WHAT DO WE SEE?

  30. matching error • Introducing 10% matching error (to first only, both and second only data sets) suggests that the impact is negligible over conversion in error free data. • Suggests the quality of data is more important than the matching quality. In any data combination, there is an additional source of error – mismatches to the HUB or identity variable. Mispelled names can lead to false negatives. Non-deterministic matching can lead to false positives.

  31. ASIDE: THe importance of WEight Here, TV data was heavily weighted toward exposure. That overwhelmed any error from website visit data. Indeed, it appeared to counterbalance it.

  32. ASIDE: THe importance of correlation Weaker correlation between webvisit and purchase (xx) Strong correlation between webvisit and purchase (xx) The greater the correlation between the dependent and independent variable, the greater the impact of error.

  33. WHAT DO WE KNOW THUS FAR? • Still more work to do certainly. But we have formed certain hypotheses: • When combining multiple data sets, the error appears additive. • Error rates being equal, the underlying aspects of the data are more likely to impact the outcome than the combination. • It is important, however, to qualify basic relatedness between each independent variable and the dependent outcome. This argues for a hub and spoke approach to data combination. • SO how did these hypotheses fare in a quick test using real world data? (next slide on your recent error work)

  34. Combining Data Sets There are two basic paths to integrating data A serial integration: (A+B)+C Each data set resulting from an integration is smaller than either original source due to non-matches. Data Source A Data Source B Data SourceA+B + = Data Source C Data SourceA+B Data Source A+B+C + =

  35. Combining data sets TBD... TBD... TBD... HUB: Matching info TBD. TBD... TBD... TBD... TBD... Another approach is a hub and spoke model: (A+B)+(A+C)...etc. While the final integrated set is still reduced due to non- matches, the error from each match to the HUB is known.

  36. AD Effectiveness - more complicated TBD... PUR-CHASE TBD... HUB: Matching info TBD. TBD... TBD... TBD... TBD... Ad effectiveness captures the correlation between exposure to advertising and subsequent purchase of a product. When someone who sees an ad buys a product, we say they have CONVERTED.

More Related