Why Big Data Is Not All It’s Cracked Up To Be. Peter H. Westfall Paul Whitfield Horn Professor of Statistics, Texas Tech University. What Big Data i s Cracked Up to Be.
An Image/Link below is provided (as is) to download presentationDownload Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.Content is provided to you AS IS for your information and personal use only. Download presentation by click this link.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.During download, if you can't get a presentation, the file might be deleted by the publisher.
E N D
Presentation Transcript
Why Big Data Is Not All It’s Cracked Up To Be
Peter H. Westfall Paul Whitfield Horn Professor of Statistics, Texas Tech University
What Big Data is Cracked Up to Be The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. -The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, By Chris Anderson (2008), Wired
A Response That I Agree With … crucially important caveats are needed when using such datasets: caveats that, worryingly, seem to be frequently overlooked. Mark Graham’s reply, Datablog, 2012, from the blog entry “Big data and the End of Theory?”
Competing Views of Data Statistician Paradigm: Probabilistic view p(y | x, q)random DATA observed data Decisions under uncertainty ---------------------------------------------------------------------- “Data Scientist” Paradigm: Deterministic view data crunched data crunched data Decisions (uncertainty?)
Resistance to Probabilistic Views of Big Data “Data scientists” have limited training in probability and resist it. Other resistance: ‘Population data’: allis known from the data, nothing random Everything is statistically significant, but often meaningless because N is so large.
Why Probabilistic Modeling is Needed with Big Data Processes define big data, not vice versa: Population Big data becomes small data when sliced 3. The data you really need are not there 4. Really, n = 1 even with big data
1. Processes Define Big Data, Not Vice Versa Oldest living earthling is 124.5 (say). But 124.5 is biologically irrelevant. Big data are probabilistic; the “population model” fails. And it fails spectacularly with sliced data.
2. Slicing BIG DATA Gives small data Less Risk?
3. “Big Data” does not mean “Right Data” Ex: Credit scoring. DATA are outcomes of accepted applicants. Y = Repay/Not X’s = personal financial measures Logistic regression!
What You Haveand What You Want Have: Want:
Probabilistic MethodsNeeded For Selection Bias Imputation (Reject inference) Bivariate probit (Heckman’s selection model) … Standard statistical methods. Big data or not.
4. Really, n=1 Even With Big DATA ? ? ? ? ? ? ? ? ? ? ? “Space”, s ? DATA ? ? ? Past Present Future Time, t
Probabilistic Models for DATA Production p (y|x,t1,s1) p(y|x,t2,s1) “Space”, s p(y|x,t4,s3) p(y|x,t3,s2) Past Present Future Time, t
The Big Data Estimate of p(y|x,t,s)
And even after we estimate p(y|x,t,s) … An inconvenient truth: An even more inconvenient truth: p(y|x,t1,s1) p(y|x,t2,s2) Even with n = (really BIG data!), the sample size is just 1 for generalizing to other spatio-temporal instances
Quantifying Generalizability The extent to which instance “A” generalizes to instance “B” is measured by the distance from A to B. Example: 30% of employees in company “A” need laptops. Generalizability to Company “B” = |30% - (B percentage)|
Quantifying Generalizability Fundamental notion: There is variation between instances (sb2), and there is variation within instances (s2) Generalizability is a function of both variances. Big data can reduce generalizability error within instances, not between.
When is sb2 “Big”? When generalizing from mouse to man When generalizing from before to after the housing crisis sb2 is small when generalizing from human biology today to human biology tomorrow
Conclusions Probability predicts data that are not there. Big data: Not all it’s cracked up to be because the data you need are typically not there. Probabilistic modeling is needed. “There is no greater increase in sample size than the increase from one to two.” -John Tukey.