240 likes | 258 Views
This talk discusses the challenges and solutions of real-time analytics on big data streams, covering topics such as data streams, data-flow parallelism, and real-time analytics. It explores use cases in various industries and the limitations of current computational power and parallelization. The cost of synchronization and the impact on latency and scalability are also discussed.
E N D
Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab
Science: sensors; feedback for experiment control • Monitoring, log analysis, root cause analysis for failures • Finance: algorithmic trading, risk management, OLTP • Commerce: OLTP • Web: Twitter, Facebook; search frontends (Google), personalized <anything>, clickstream analysis Real-time data processing use-cases
Big Data 3V: Volume, Velocity, Variety • This talk focuses on velocity and volume • Continuous data analysis • Stream monitoring & mining; enforcing policies/security. • Timely response required (low latencies!) • Performance: high throughput and low latencies! Real-time Analytics on Big Data
Paths to (real-time) performance Parallelization Small data (seriously!) Incrementalization (online/anytime) Specialization
Current data growth outpaces Moore’s law. • Sequential CPU performance does not grow anymore (already for three Intel processor generations). • Logical states need time to stabilize. • Moore’s law to fail by 2020: Only a few (2?) die shrinkage iterations left. • Limitation on number of cores. • Dennard scaling (the true motor of Moore’s law) has ended • Energy cost and cooling problems! • More computational power will always be more expensive! Comp. Arch. not to the rescue
Computer architecture • Failure of Dennard’s law: Parallelization is expensive! • Computational complexity theory • There are inherently sequential problems: NC<PTIME • Fundamental impossibilities in distributed computing: • Distributed computation requires synchronization. • Distributed consensus has a minimum latency dictated by spatial distance of compute nodes (and other factors). • msecs in LAN, 100s of msecs in WAN. Speed of light! • Max # of synchronous computation steps per second, no matter how much parallel hardware available. Parallelization is no silver bullet
Reminder: Two-Phase Commit CoordinatorSubordinate Send prepare Force-write prepare record Send yes or no Wait for all responses Force-write commit or abort Send commit or abort Force-write abort or commit Send ACK Wait for all ACKs Write end record
The cost of synchronization • 2PC: Minimum latency two network roundtrips. • Latency limits Xact throughput:Throughput #Xacts/second = 1/(cost of one run of 2PC) • Lausanne-Shenzhen:9491km * 4; 126ms@speed of light. • Wide-area/cross data-center consistency – 8 Xacts/s ?!? • Consensus >= 2-phase commit.
Xact throughput local vs. cross-datacenter (slide by P.Bailis)
The cost of synchronization • for every network latency, there is a maximum scale at which a synchronized system can run. • Jim Gray: In the late 80ies, for 2PC, optimal throughput was reached with ~50 machines (above that, throughput decreases). • Today the number is higher, but not by much.
Latency and scaling of synchronized parallel systems • SIMD in CPU: does not need to scale, but is implemented in hardware and on one die: ok • Cache coherence in multi-socket servers: a headache for computer architects! • Linear algebra in HPC • very special and easy problem, superfast interconnects, locality • but scaling remains a challenge for designers of supercomputer and linear algebra routines. • Scaling of ad-hoc problems on supercomputers: an open problem, ad-hoc solutions. • Consistency inside data center (Xacts, MPI syncs): <10000 Hz • Cross-data center consistency ~10 Hz • Latency of heterogeneous local batch jobs: <2Hz (HPC, Mapreduce, Spark)
Latency and scaling of synchronized parallel systems #2 Scaling of batch systems: <2Hz HPC jobs <1Hz(?) Map/reduce 0.1 Hz (Synchronziation via disk, slow scheduling) Spark <2 Hz; “Spark Streaming” Note: Hadoop efficiency: takes 80-100 nodes to match single-core performance.
The cost of synchronization Machine 1 Machine 2 a := 1 b := 1 sync a += a+b 2 sync 3 b += a+b sync a += a+b 5 no sync 5 b += a+b Distributed computation needs synchronization. Low-latency stream processing? Asynchronous messageforwarding– data-flow parallelism
Assume we compute each statement once. • Different machines handle statements • Don’t compute until you have received all the msgs you need. • Works! • But requiressynchronized ts oninput stream. • One stream sourceor synch of streamsources! Does streaming/message passing defeat the 2PC lower bound? Machine (1, j) Machine (2, j) a := 1 b := 1 Machine (i, 1) Machine (i, 2) a’ += a+b 2 3 b’’ += a’+b’ Machine (i, 3) a’’ += a’+b’’ 5 Machine (i, 4) 5 Machine (i, 5) b’’’ += a’’+b’’
Repeatedly compute values. • Each msg has a (creation) epoch timestamp • Multiple msg can share timestamp. • Works in this case! • Computes only sums oftwo objects. We knowwhen we have receivedall the msgs we need tomake progress! Does streaming/message passing defeat the 2PC lower bound? Machine (1, j) Machine (2, j) a := 1 b := 1 Machine (i, 1) Machine (i, 2) a’ += a+b 2 3 b’’ += a’+b’ Machine (i, 3) a’’ += a’+b’’ 5 Machine (i, 4) 5 Machine (i, 5) b’’’ += a’’+b’’
Repeatedly compute values. • Each msgs has a (creation) epoch timestamp • Multiple msg can share timestamp. • Notify when no moremessages of a particular tsare to come from a sender. • Requires to wait fornotify() from all sources. • Synch again! • If there is a cyclical dep(same vals read as written),2PC is back! Does streaming/message passing defeat the 2PC lower bound? Machine (1, j) Machine (2, j) a := 1 b := 1 Machine (i, 1) Machine (i, 2) a’ += a+b 2 3 b’’ += a’+b’ Machine (i, 3) a’’ += a’+b’’ 5 Machine (i, 4) 5 Machine (i, 5) b’’’ += a’’+b’’
Streaming+Iteration: Structured time [Naiad] (t) (t, t’) (t) A B E (t, t’+1) (t, t’) t C D (t, t’) (t, t’)
Data-flow parallel systems Popular for real-time analytics. Most popular framework: Apache Storm / Twitter Heron Simple programming model (“bolts” – analogous to map-reduce mappers). Requires nonblocking operators: e.g. symmetric hash-join vs. sort-merge join.
Latency and Data Skew • Data skew: uneven parallelization. • Reasons: • Bad initial parallelization • Uneven blowup of intermediate results • bad hashing in (map/reduce) reshuffling • Occurs both in batch systems such as map-reduce and • Fixes: skew resilient repartitioning, load shedding, … • Node failures look similar.
Paths to (real-time) performance Parallelization Small data (seriously!) <cut> Incrementalization (online/anytime) Specialization
Paths to (real-time) performance Parallelization Small data (seriously!) Incrementalization (online/anytime) <cut> Specialization
Paths to (real-time) performance • Parallelization • Small data (seriously!) • Incrementalization (online/anytime) • Specialization • Hardware: GPU, FPGA, … • Software: compilation • Lots of activity both on the hw and sw fronts at EPFLÐZ
Summary • The classical batch job is getting a lot of competition. People need low-latency for a variety of reasons. • Part of a cloud/OpenStack deployment will be used for low-latency work. • Latency is a problem! Virtualization, MS vs Amazon • Distributed computation at low latencies has fundamental limits. • Very hard systems problems, huge design space to explore. • Incrementalization can give asymptotic efficiency improvements. By many orders of magnitude in practice.