Big Data: Analytics Platforms

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch

Why Big Data? • because bigger is smarter • answer tough questions • because we can • push the limits and good things will happen

bigger = smarter? • Yes! • tolerate errors • discover the long tail and corner cases • machine learning works much better

bigger = smarter? • Yes! • tolerate errors • discover the long tail and corner cases • machine learning works much better • But! • more data, more error (e.g., semantic heterogeneity) • with enough data you can prove anything • still need humans to ask right questions

Fundamental Problem of Big Data • There is no ground truth • gets more complicated with self-fulfilling prophecies • e.g., stock market predictions change behavior of people • e.g., Web search engines determine behavior of people

Fundamental Problem of Big Data • There is no ground truth • gets more complicated with self-fulfilling prophecies • Hard to debug: takes human out of the loop • Example: How to play lottery in Napoli • Step 1: You visit “oracles” who predict numbers to play • Step 2: You visit “interpreters” who explain predictions • Step 3: After you lost, “analysts” tell you that “oracles” and “interpreters” were right and that it was your fault. • [Luciano de Crescenzo: Thus SpakeBellavista]

Why Big Data? • because bigger is smarter • answer tough questions • because we can • push the limits and good things will happen

Because we can… Really? • Yes! • all data is digitally born • storage capacity is increasing • counting is embarrassingly parallel

Because we can… Really? • Yes! • all data is digitally born • storage capacity is increasing • counting is embarrassingly parallel • But, • data grows faster than energy on chip • value / cost tradeoff unknown • ownership of data unclear (aggregate vs. individual) • I believe that all these “but’s” can be addressed

Utiliy & Cost Functions of Data Utility Cost Noise / Error Noise / Error

Utiliy & Cost Functions of Data Utility Cost curated curated malicious random random malicious Noise / Error Noise / Error

Best Utility/Cost Tradeoff Utility Cost malicious malicious Noise / Error Noise / Error

What is good enough? Utility Cost curated curated Noise / Error Noise / Error

What about platforms? • Relational Databases • great for 20% of the data • not great for 80% of the data • Hadoop • great for nothing • good enough for (almost) everything (if tweaked)

Why is Hadoop so popular? • availability: open source and free • proven technology: nothing new & simple • works for all data and queries • branding: the big guys use it • it has the right abstractions • MR abstracts “counting” (= machine learning) • it is an eco-system - it is NOT a platform • HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, … • relational database systems • turned into a platform depending on app / problem

Example: Amadeus Log Service • HDFS for compressed logs • HBase to index by timestamp and session id • SOLR for full text search • Hadoop (MR) for usage stats & disasters • Oracle to store meta-data (e.g., user information) • Disclaimer: under construction & evaluation!!! • current production system is proprietary

Some things Hadoop got wrong? • performance: huge start-up time & overheads • productivity: e.g., joins, configuration knobs • SLAs: no response time guarantees, no real time • Essentially ignored 40 years of DB research 

Some things Hadoop got right • scales without (much) thinking • moves the computation to the data • fault tolerance, load balance, …

How to improve on Hadoop • Option 1: Push our knowledge into Hadoop? • implement joins, recursion, … • Option 2: Push Hadoop into RDBMS? • build a Hadoop-enabled database system • Option 3: Build new Hadoop components • real-time, etc. • Option 4: Patterns to compose components • log service, machine learning, … • but, do not build a “super-Hadoop”

Conclusion • Focus on “because we can…” part • help data scientists to make everything work • Stick to our guns • develop clever algorithms & data structures • develop modeling tools and languages • develop abstractions for data, errors, failures, … • develop “glue”; get the plumbing right • Package our results correctly • find the right abstractions (=> APIs of building blocks)

Big Data: Analytics Platforms