big data and the cloud: programming futures

big data and the cloud:programming futures joehellerstein

roadmap • status report • analytics • scalable systems • research • calm <~ bloom • dp

In the Days of Kings and Priests • Computers and Data: Crown Jewels • Executives depend on computers • But cannot work with them directly • The DBA “Priesthood” • And their Acronymia:EDW, BI, OLAP • The “architected” EDW “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse, 2005

New Realities • The quest for knowledge used to begin with grand theories. • Now it begins with massive amounts of data. • Welcome to the Petabyte Age. • TB disks < $100 • Everything is data • Rise of data-driven culture • Very publicly espoused by Google, Wired, etc. • Sloan Digital Sky Survey, Terraserver, etc.

The New Practitioners “Looking for a career where your services will be in high demand? … Provide a scarce, complementary service to something that is getting ubiquitous and cheap. the sexy job in the next ten years will be statisticians • So what’s ubiquitous and cheap? Data.And what is complementary to data? Analysis. Hal Varian, UC Berkeley, Chief Economist @ Google

The New Practitioners • Aggressively Datavorous • Statistically savvy • Diverse in training, tools

MAD Skills [Cohen, et al. VLDB 09] • Magnetic • attract data and practitioners • Agile • rapid iteration: ingest, analyze, productionalize • Deep • sophisticated analytics in Big Data

Dev tools 4 analytics: reality • Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life

Dev tools 4 analytics: reality • Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life • focus on Deepnot enough on Agile • not enough on Magnetic • dp visualization software development data product management collaboration/ networking

Analytics Coding Landscape • Single-node stat packages (R, Matlab, SAS, etc.) • domain-specific languages for linear algebra and statistics • diverse set of open libraries (e.g. CRAN library for R) • scalability limits: in-core, no parallelism • MapReduce ecosystem • Google, MS Dryad, Hadoop open source • low-level single-node coding (Java), easy data-parallelization • SQL-like convenience languages above (Hive, Pig) • emerging open analytics toolkits (Mahout, Pregel) • SQL + extensions (user-defined functions) • more powerful than many realize • declarative coding, easy data-parallelism • poor support for extension developers (varies by vendor) • emerging open analytics toolkits (MADlib, Hazy)

Analytics Takeaways • little real dev difference between mapreduce and SQL • hadoop has more energetic dev tools development • SQL provides more breadth (of function, install base, HR) • lines are blurring • serious barrier: porting the R/SAS/Matlab ecosystem • will take a decade to develop data-parallel equivalent to CRAN • algorithmic challenge, not just a coding challenge • no shortcut here (at least for MMP) • in sum • analytics will be a “swiss army knife” approach for years to come • think portfolio. foster community, open libraries (MADlib/Mahout)

big systems c. 2011 • features: • data-centric • distributed • highly available • scalable/elastic • lots of new/custom code • programming is becoming hard2 • (parallelism + asynchrony + failure) × (software engineering)

root cause of hardness • order is pervasive in the von neumann model • state: an ordered array of cells • logic: an ordered array of instructions • terrible match for distributed systems

typical solution: shared storage • distributed storage replaces RAM • imposes/enforces order • e.g. via transactions or other consistency mechanisms • shift: data-centric development • storage is not persistence — it is a programming model • this has always been true • the cloud makes it pervasive

Dropping ACID? • early exposition: the “transaction concept” [Gray VLDB 1981] • many think distributed ACID transactions are infeasible today • cross-site transactions ⇒ coordination • ⇒ waiting • ⇒ queue buildups • ⇒ unpredictable problems • a major lesson of Internet companies: Brewer’s “CAP theorem” • though implications being revisited • by now, this lesson is kool-aid in the open source community…

NoSQL • “not only SQL” • really not about SQL per se. • focus on two things: • distributed storage with “loose consistency”, not ACID. • data models that are simpler than SQL schemas • key/value stores, documents • i.e. similar to distributed memory! • examples • BigTable (Google), Hbase (Yahoo/Hadoop), Cassandra (Facebook/DataStax), Sherpa (Yahoo), Dynamo (Amazon), Voldemort (LinkedIn), … • cloud services (AppEngine, Azure)

Homework puzzle Given: • use storage layer for distributed coordination (order) • use NoSQL’s loose consistency for availability Q: how do programmers reason about order and correctness?

Homework puzzle Given: • use storage layer for distributed coordination (order) • use NoSQL’s loose consistency for availability Q: how do programmers reason about order and correctness? A: very carefully.

correctness? ACID loose consistency app-specific correctness via design maxims semantic assertions custom compensation • general correctness via theoretical foundations • read/write: serializability • coordination/consensus concerns: hard to trust, test concerns: latency, availability

the shift application logic application logic system infrastructure system infrastructure quicksand theoretical foundation

a vacuum here • state of the art: each app reasons about consistency • e.g. by making use of a locking service (a la Apache Zookeeper) • e.g. by reasoning about “eventual consistency” of the storage system • this is, arguably, hard3 • (sweng) * (distribution) * (false abstractions) • don’t take my word for it • Gunawi’sFATE uncovered 16 fault-recovery bugs in Hadoop FS [NSDI ’11] • focus on storage systemsnot enough on developers • CALM <~ bloom

calm <~ bloom disorderly programming for distributed systems

BOOM team ras bodik joe hellerstein peter alvaro neil conway bill marczak haryadi gunawi thibaud hottelier

desire: best of both worlds application logic • theoretical foundation for correctness under loose consistency • embodiment of theory in a programming framework theoretical foundation system infrastructure quicksand

our approach • disorderly programming • state: unordered collections • logic: unordered statements • implications • default: partitioning, concurrency • ordering (data, logic) explicit, special-case • but can this make ordering decisions simpler?

progress • CALM consistency (maxims ⇒ theorems) • Bloom language (theorems ⇒ programming)

CALM

monotonic code non-monotonic code monotonicity • info accumulation • the more you know, the more you know • e.g. map, filter, join • belief revision • new inputs canchange your mind;need to “seal” input • e.g. counts, state update

an aside gamblers?

intuition • counting requires waiting

intuition • counting requires waiting • waiting requires counting

CALM Theorem • CALM: consistency as logical monotonicity • monotonic code ⇒ eventually consistent • non-monotonic ⇒ coordinate only at non-monotonic points of order • conjectures at pods 2010 conference [Hellerstein, SIGMOD Record 2010] • formulationsand theorems in 2011 [Ameloot,et al., PODS 2011]

practical implications • compiler can identify non-monotonic “points of order” • inject coordination code • or mark uncoordinated results as “tainted” • compiler can help programmer think about coordination costs • easy to do this with the right language…

<~ bloom

background: BOOM Analytics • 2005-2010:designed a distributed logic language called Overlog • 2009-2010: rebuilt Hadoop File System and scheduler in Overlog • no kidding – API-compatible with Hadoop, comparable performance • win 1: Orders Of Magnitude smaller, 4 person-months dev time • win 2 (more important) : evolvability • fixed HDFS single point of failure via Paxos-in-Overlog (6 person-weeks) • fixed HDFS scaling limits via state partitioning (1 day!) [Alvaro et al., Eurosys 2010]

we became greedy for more • time to build a language for real programmers. approach: • craft a disorderly DSL for distributed systems • embed in popular host languages. (I chose ruby first.) • embody the CALM theorem in programmer tools • identify points of order in code • synthesize coordination logic, or inject “taint” tracking • high-level analysis/debuggers to pinpoint tricky ordering issues <~ bloom

bud (bloom under development) • bloom embedded as a DSL in ruby • domain-specific code analysis tools • alpha released April, 2011 at http://bloom-lang.net • goodies • code analysis tools • library/example sandbox • EC2 deployment utilities • % gem install bud

http://bloom-lang.net

classic example: shopping cart • replicated, a la Amazon Dynamo • challenge: guarantee eventual consistency of replicas • maxim: use commutative operations • easier said than done! • Bloom/CALM paper shows compiler analysis (i.e. proofs) of the design maxims for correctness, efficiency [Alvaro, et al. CIDR 2011]

conclusion • CALM theorem • what is coordination for? non-monotonicity. • pinpoint non-monotonic points of order • coordination or taint tracking • Bloom • declarative, disorderly DSL for distributed programming • bud: organic Ruby embedding • CALM analysis of monotonicity • synthesize coordination/compensation • released to the dev community this spring • “friends-and-family” alpha at http://bloom-lang.net

influence propagation…? • Technology Review TR10 2010: • “The question that we ask is simple: is the technology likely to change the world?” • Fortune Magazine 2010 Top in Tech: • “Some of our choices may surprise you.” • Twittersphere: • “Read this. Read this now.”

more? http://bloom-lang.net http://boom.cs.berkeley.edu thanks to: Microsoft Research Yahoo! Research IBM Research NSF AFOSR Consensus in Logic [Alvaro, et al. NetDB 2009] BOOM Analytics [Alvaro, et al., Eurosys 2010] Declarative Imperative [Hellerstein, SIGMOD Record 3/2010]CALM + Bloom [Alvaro, et al. CIDR 2011]

dp= datapeople facilitating interactions between people and data throughout the analytic lifecycle. http://deepresearch.org

dp Jeff HeerStanford Joe Hellerstein Berkeley Tapan Parikh Berkeley ManeeshAgrawala Berkeley Sean Diana Ravi Kandel MacLean Parikh Kuang Nicholas WesleyChen Kong Willett

dp wrangler intelligent data xformation commentspacesocial data analysis usher/shreddr first-mile data entry socialflows mining, visualizing & browsing email madlib parallel in-database analytics

Remember the missing pieces! visualization software development data product management collaboration/ networking

big data and the cloud: programming futures