450 likes | 598 Views
Dryad and dataflow systems. Michael Isard misard@microsoft.com Microsoft Research 4 th June, 2014. Talk outline. Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions. Computation on large datasets. Performance mostly efficient resource use
E N D
Dryad anddataflow systems Michael Isard misard@microsoft.com Microsoft Research 4thJune, 2014
Talk outline • Why is dataflow so useful? • What is Dryad? • An engineering sweet spot • Beyond Dryad • Conclusions
Computation on large datasets • Performance mostly efficient resource use • Locality • Data placed correctly in memory hierarchy • Scheduling • Get enough work done before being interrupted • Decompose into independent batches • Parallel computation • Control communication and synchronization • Distributed computation • Writes must be explicitly shared
Computational model • Vertices are independent • State and scheduling • Dataflow very powerful • Explicit batching and communication Outputs Processing vertices Channels Inputs
Why dataflow now? • Collection-oriented programming model • Operations on collections of objects • Turn spurious (unordered) for into foreach • Not every for is foreach • Aggregation (sum, count, max, etc.) • Grouping • Join, Zip • Iteration • LINQ since ca 2008, now Spark via Scala, Java
Given some lines of text, find the most commonly occurring words. Read the lines from a file Split each line into its constituent words Count how many times each word appears Find the words with the highest counts Well-chosen syntactic sugar blue red red,2 intSortKey(KeyValuePair<string,int> x) { return x.count; } intSortKey(void* x) { return (KeyValuePair<string,int>*)x->count; } blue blue Collection<KeyValuePair<string,int>> blue blue,4 yellow Type inference yellow yellow,3 yellow red • var lines = FS.ReadAsLines(inputFileName); • var words = lines.SelectMany(x => x.Split(‘ ‘)); • var counts = words.CountInGroups(); • var highest = • counts.OrderByDescending(x => x.count).Take(10); Lambda expressions FooCollectionFooTake(FooCollection c, int count) { … } Collection<T> Take(this Collection<T> c, int count) { … } Generics and extension methods
Collections compile to dataflow • Each operator specifies a single data-parallel step • Communication between steps explicit • Collections reference collections, not individual objects! • Communication under control of the system • Partition, pipeline, exchange automatically • LINQ innovation: embedded user-defined functions varwords = lines.SelectMany(x => x.Split(‘ ‘)); • Very expressive • Programmer ‘naturally’ writes pure functions
Distributed sorting set varsorted = set.OrderBy(x => x.key) sample compute histogram range partition by key sort locally sorted
Quiet revolution in parallelism • Programming model is more attractive • Simpler, more concise, readable, maintainable • Program is easier to optimize • Programmer separates computation and communication • System can re-order, distribute, batch, etc. etc.
Talk outline • Why is dataflow so useful? • What is Dryad? • An engineering sweet spot • Beyond Dryad • Conclusions
What is Dryad? • General-purpose DAG execution engine ca 2005 • Cited as inspiration for e.g. Hyracks, Tez • Engine behind Microsoft Cosmos/SCOPE • Initially MSN Search/Bing, now used throughout MSFT • Core of research batch cluster environment ca 2009 • DryadLINQ • Quincy scheduler • TidyFS
What Dryad does • Abstracts cluster resources • Set of computers, network topology, etc. • Recovers from transient failures • Rerun computations on machine or network fault • Speculate duplicates for slow computations • Schedules a local DAG of work at each vertex
Scheduling and fault tolerance • DAG makes things easy • Schedule from source to sink in any order • Re-execute subgraph on failure • Execute “duplicates” for slow vertices
Scheduling and fault tolerance • DAG makes things easy • Schedule from source to sink in any order • Re-execute subgraph on failure • Execute “duplicates” for slow vertices
Scheduling and fault tolerance • DAG makes things easy • Schedule from source to sink in any order • Re-execute subgraph on failure • Execute “duplicates” for slow vertices
Scheduling and fault tolerance • DAG makes things easy • Schedule from source to sink in any order • Re-execute subgraph on failure • Execute “duplicates” for slow vertices
Resources are virtualized • Each graph vertex is a process • Writes outputs to disk (usually) • Reads inputs from upstream nodes’ output files • Graph generally larger than cluster RAM • 1TB partitioned input, 250MB part size, 4000 parts • Cluster is shared • Don’t size program for exact cluster • Use whatever share of resources are available
Integrated system • Collection-oriented programming model (LINQ) • Partitioned file system (TidyFS) • Manages replication and distribution of large data • Cluster scheduler (Quincy) • Jointly schedule multiple jobs at a time • Fine-grain multiplexing between jobs • Balance locality and fairness • Monitoring and debugging (Artemis) • Within job and across jobs
Dryad Cluster Scheduling Scheduler R
Dryad Cluster Scheduling Scheduler R R
Dryad features • Well-tested at scales up to 15k cluster computers • In heavy production use for 8 years • Dataflow graph is mutable at runtime • Repartition to avoid skew • Specialize matrices dense/sparse • Harden fault-tolerance
Talk outline • Why is dataflow so useful? • What is Dryad? • An engineering sweet spot • Beyond Dryad • Conclusions
Stateless DAG dataflow • MapReduce, Dryad, Spark, … • Stateless vertex constraint hampers performance • Iteration and streaming overheads • Why does this design keep repeating?
Software engineering • Fault tolerance well understood • E.g., Chandy-Lamport, rollback recovery, etc. • Basic mechanism: checkpoint plus log • Stateless DAG: no checkpoint! • Programming model “tricked” user • All communication on typed channels • Only channel data needs to be persisted • Fault tolerance comes without programmer effort • Even with UDFs
Talk outline • Why is dataflow so useful? • What is Dryad? • An engineering sweet spot • Beyond Dryad • Conclusions
What about statefuldataflow? • Naiad • Add state to vertices • Support streaming and iteration • Opportunities • Much lower latency • Can model mutable state with dataflow • Challenges • Scheduling • Coordination • Fault tolerance
Batch processing Stream processing Graph processing Timely dataflow
BatchingStreaming vs. (synchronous) (asynchronous) • No coordination needed • Aggregation is difficult • Requires coordination • Supports aggregation
Batch DAG execution Central coordinator
Streaming DAG execution
Streaming DAG execution Inline coordination
Batch iteration Central coordinator
Streaming iteration
Messages B.SendBy(edge, message, time) B C D C.OnRecv(edge, message, time) Messages are delivered asynchronously
Notifications C.SendBy(_, _, time) D.NotifyAt(time) B C D D.OnRecv(_, _, time) D.OnNotify(time) Notifications support batching No more messages at time or earlier
Coordination in timely dataflow • Local scheduling with global progress tracking • Coordination with a shared counter, not a scheduler • Efficient, scalable implementation
Interactive graph analysis #x 32K tweets/s @y ⋈ max ⋈ In 10 queries/s z? ⋈
32 8-core 2.1 GHz AMD Opteron 16 GB RAM per server Gigabit Ethernet Query latency Max: 140 ms 99th percentile: 70 ms Median: 5.2 ms
Mutable state • In batch DAG systems collections are immutable • Functional definition in terms of preceding subgraph • Adding streaming or iteration introduces mutability • Collection varies as function of epoch, loop iteration
Key-value store as dataflow var lookup = data.join(query, d => d.key, q => q.key) • Modeled random access with dataflow… • Add/remove key is streaming update to data • Look up key is streaming update to query • High throughput requires batching • But that was true anyway, in general
What can’t dataflow do? • Programming model for mutable state? • Not as intuitive as functional collection manipulation • Policies for placement still primitive • Hash everything and hope • Great research opportunities • Intersection of OS, network, runtime, language
Talk outline • Why is dataflow so useful? • What is Dryad? • An engineering sweet spot • Beyond Dryad • Conclusions
Conclusions • Dataflow is a great structuring principle • We know good programming models • We know how to write high-performance systems • Dataflow is the status quo for batch processing • Mutable state is the current research frontier Apache 2.0 licensed source on GitHub http://research.microsoft.com/en-us/um/siliconvalley/projects/BigDataDev/