60 likes | 272 Views
There Is No Big Data* * Unless You Are Big Brother or Big Tech. Zachary G. Ives University of Pennsylvania and Inc. (visiting for 2 more weeks). We’ve All Heard the Story…. Google has multi PB - EB of data Facebook 10PB data warehouse (Parikh keynote)
E N D
There Is No Big Data* *Unless You Are Big Brother or Big Tech Zachary G. Ives University of Pennsylvania and Inc. (visiting for 2 more weeks)
We’ve All Heard the Story… • Google has multi PB - EB of data • Facebook 10PB data warehouse (Parikh keynote) • Walmart 500TB data warehouse in 2004 • "Data is becoming so huge, we in academia need to invent new BIG DATA capabilities!” • … Marketers gave us a guide: ~4 “V”s • … We latched onto volume + velocity
But Wait – Google has Jeff Dean etc.!Why Do They Need Us to Handle Scale? BigTech are solving scale themselves, and leading the way. They have real data, real workloads, real $$, real machines. MapReduce, Pregel, F1, Millwheel, Puma, Presto, …
Worse: The Problem Is Usually NotToo Much Data to Handle… Rowstron+ 12: even Big Tech data isn’t always BIG • Analytics clusters @ Microsoft, median job < 14GB • Median Facebook job < 100GB What about academia, science, or “medium tech” data? • A genome • A giant Twitter crawl • Wikipedia with all history and languages Single server-sized… But not what we want to look at: most of the data we want to process needs complementary data we don’t own!!! 3G bases 284M edges, 53M entities 30M pages
Big Data is a Product, Not a Source Current focus: BIG? • Not just “variety” – proprietary “little data”: • Specialized science data • Individual observations • E-commerce data • …
“Growing” small dataBIG DATA Many issues in “big data integration” • How do we exploit existing knowledge? • How do we take advantage of scale, of user populations, and history? But: How do we convince “small data” owners they WANT to be BIG DATA? • Need to measure impact of data Next step beyond provenance, responsibility, … • Need to incentivizecontributions Credit, badges, h-index, $$$, … • Need user DRM or DUAs, not just corporate DRM, EULAs, and SLAs “I am getting bigger and bigger”