130 likes | 348 Views
Intel “Big Data” Science and Technology Center Michael Stonebraker. Context. Intel held a national “beauty contest” to locate their next S & T center MIT won, with a “Big Data” proposal 160 proposals $2.5M per year for 3-5 years plus 5 Intel scientists 20 PIs, half at MIT.
E N D
Intel “Big Data” Science and Technology CenterMichael Stonebraker
Context • Intel held a national “beauty contest” to locate their next S & T center • MIT won, with a “Big Data” proposal • 160 proposals • $2.5M per year for 3-5 years plus 5 Intel scientists • 20 PIs, half at MIT
Big Data Means What? • Volume too large • Stupid analytics (i.e. SQL) • solved by commercial data warehouse products • Smart analytics (predictive modelling, machine learning, …) • Velocity too big • Drink from a firehose • Variety too large • Data integration problem • And what does this mean to computer architecture!
Big Data Means What? • Volume too large – smart analytics • Array data bases • Parallel algo • Integration of linear algebra • Scalable vis • Velocity too big • Main memory DBs • And what does this mean to computer architecture! • Many core • Son-of-flash • Xeon Phi
Array Data Bases • Elasticity in SciDB • Query optimizer for SciDB • Genomics benchmark • Run on SciDB, SciDB +Phi, column stores, row stores, MadLib, Hadoop • Graphs as sparse arrays • EarthDB
Scalable Algo • Parallelizing locality sensitive hashing • Other algo people are going to work in other areas • Pick your favorite algo, parallelize and make scale • Scalable Julia
Integration of Linear Algebra • Hardly anybody can beat BLAS/Lapack/Scalapack • 10 ** 5 difference between Python and Intel-optimized C++ • If you write operation X, chances are you will lose to Jack Dongarra by an order of magnitude • Don’t fight the wizard
Integration of Linear Algebra • DBMS + Scalapack • Federation required • Resource manager required • Recoverable Scalapack required • Someday • A common storage format • Would make ACID much easier, …
Visualization • Resolution reduction • Using “explain” • Choose the rendering automatically • Decision tree • Smart prefetch • Integrate with SciDB backend and Stanford visualizer front end
High Velocity • Big pattern – little state • Find me a “banana” followed within 10 msec by a strawberry • Historically CEP • Big state – little pattern • Assemble my global real-time risk • Main memory DBMS
High Velocity • Lots of commonality between CEP and MM DBMS • We are adding queues/windows to H-Store • It’s clear we will do ACID – CEP as fast as CEP • I predict the death of CEP
High Velocity – Other Predictions • Death of Aries • Command logging much faster than data logging • Death of disk-oriented OLTP data bases • H-store with anti-caching is wildly faster than MySQL with or without MemcacheD • Trying an emulator for “son of flash” • Will make MM DBMSs even more attractive
Many Core • 1000 cores will give major heartburn to all system software • Traditional DBMSs will collapse • DBMSs cannot have shared data structures • H-Store approach • Move the computation • Hardware-supported “move” • New concurrency control algorithms (revival of Dora?)