330 likes | 558 Views
Dryad and DryaLINQ. Dryad and DryadLINQ. Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation. Dryad. General-purpose execution environment for distributed, data-parallel applications
E N D
Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation
Dryad • General-purpose execution environment for distributed, data-parallel applications • Focus on simplicity, reliability, scalability, efficiency and not latency, unreliable networks • Automatic management of scheduling, distribution, fault tolerance • Exploits Data Parallelism
Dryad • Computations expressed as a Directed Acyclic Graph • Jobs executed on vertices • Edges are communication channels • Each vertex has several input and output edges • Data transport mechanisms: Files, TCP pipes, shared memory FIFOs
Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs
Dryad vs. MapReduce, Parallel DB • More control to developer than MapReduce • MapReduce aims at simplicity at the expense of generality and performance • Computation Graph is implicit in Parallel DB
Dryad System Architecture • Job manager – coordinates jobs, constructs graph • Name server – exposes computers with network topology • Daemons run on each computer in the cluster
Job (Graph) Construction • Using graph operators implemented in C++ to describe the graph (from simpler sub graphs).
Job Execution • Job manager not currently fault tolerant • Vertices may be scheduled multiple times due to failures • Each execution versioned • Execution record kept- including versions of incoming vertices • Outputs are uniquely named (versioned) • Final outputs selected if job completes • Non-file communication (TCP pipe, Shared Memory FIFO) may cascade failures • Vertices specify hard constraints or preferences for set of computers required • Scheduling is greedy assuming only one job
Policy Managers R R R R Stage R Connection R-X X X X X Stage X R-X Manager X Manager R manager Job Manager
Cluster network topology top-level switch top-of-rack switch rack
Dynamic Aggregation S S S S S S T static S S S S S S # 1 # 2 # 1 # 3 # 3 # 2 rack # A A A # 1 # 2 # 3 T dynamic
SkyServer DB Query • 3-way join to find gravitational lens effect • Table U: (objId, color) 11.8GB • Table N: (objId, neighborId) 41.8GB • Find neighboring stars with similar colors: • Join U+N to find T = U.color,N.neighborId where U.objId = N.objId • Join U+T to find U.objId where U.objId = T.neighborID and U.color ≈ T.color
H n Y Y [distinct] [merge outputs] select u.color,n.neighborobjid from u join n where u.objid = n.objid select u.objid from u join <temp> where u.objid = <temp>.neighborobjid and |u.color - <temp>.color| < d (u.color,n.neighborobjid) [re-partition by n.neighborobjid] [order by n.neighborobjid] U U u: objid, color n: objid, neighborobjid [partition by objid] 4n S S 4n M M n D D n X X U N U N SkyServer DB query • Took SQL plan • Manually coded in Dryad • Manually partitioned data
Optimization H n Y Y U U 4n S S 4n M M n D D n X X U N U N Y U S S S S M M M M D X U N
Optimization H n Y Y U U 4n S S 4n M M n D D n X X U N U N Y U S S S S M M M M D X U N
16.0 Dryad In-Memory 14.0 Dryad Two-pass 12.0 SQLServer 2005 10.0 Speed-up 8.0 6.0 4.0 2.0 0.0 0 2 4 6 8 10 Number of Computers
High level Programming Languages • Nebula – limited to existing binaries • SSIS – SQLServer workflow engine, distributed • DryadLINQ – Supports both imperative and declarative operations on datasets
Dryad/DryadLINQ • Decoupling of Dryad and DryadLINQ • Dryad: execution engine (given DAG, do scheduling and fault tolerance) • DryadLINQ: programming model (given query, generate DAG)
DryadLINQ • Exploits LINQ (Relational queries integrated in C#) to provide a hybrid of imperative and declarative programming • LINQ has a design choice that is easy to express computations also giving runtime leeway implementing them. • Sequential program composed of LINQ expressions • Performs side-effect free transformations on datasets • Written and Debugged using .NET development tools • More general than distributed SQL • Programs can be automatically optimized and efficiently executed on large cluster
DryadLINQ • Serialization for dryad are provided by High level software layers like DrayLINQ • DrayLINQ preserves the LINQ programming model and defines new operators and datatypes for data parallel programming
DryadLINQ Data Model .Net objects Partition Partitioned Table • Data Model is distributed implementation of LINQ Collections • Each Dataset is distributed (disjoint) across the cluster • Partitioned table exposes metadata information • type, partition, compression scheme, serialization, etc.
DrayLINQ Constructs • Expressions must be side-effect free • Allows programmer to specify annotations (hints) to guide optimization • Operators • Hash Partition • Range Partition • Apply: Allows arbitrary streaming computations • Fork: Takes single input and generates multiple output datasets
System Implementation • Execution Plan Graph: Starts by converting raw LINQ expressions into EPG • DryadLINQ Optimizations • Static Optimizations • Dynamic Optimizations • Code Generation: Uses dynamic code generation to automatically synthesize LINQ code to be run at the Drayad vertex
Conclusions • Goal: Use a compute cluster as if it is a single computer • Dryad/DryadLINQ represent a significant step • Requires close collaborations across many fields of computing, including • Distributed systems • Distributed and parallel databases • Programming language design and analysis
References • Dryad: Distributed Data-parallel Programs from Sequential Building Blocks (Michael Isard, MihaiBudiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly March 2007) • DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (Yuan Yu, Michael Isard, Dennis Fetterly, MihaiBudiu, ÚlfarErlingsson, Pradeep Kumar Gunda, and Jon CurreyDecember 2008)