190 likes | 438 Views
Dryad: Distributed Data-Parallel Programs for Sequential Building Blocks. Presented by: Theodoros Ioannou. Why Dryad. Efficient way for parallel and distributed applications Take advantage of the multicore servers Data parallelism Motivation: GPUs, Map Reduce and Parallel DBs.
E N D
Dryad: Distributed Data-Parallel Programs for Sequential Building Blocks • Presented by: Theodoros Ioannou
Why Dryad • Efficient way for parallel and distributed applications • Take advantage of the multicore servers • Data parallelism • Motivation: GPUs, Map Reduce and Parallel DBs
What is Dryad • Data flow graph with • Vertices and • Channels • Execute vertices and communicate through channels
What is Dryad(cont’d) • Vertices: Sequential Programs given by the programer • Channels: File, TCP pipe, Shared-memory FIFO
Differences from other systems • Allows developers to define communication between vertices • More difficult - Provides better options • A programer can master it in a few weeks • Not as restrictive as MapReduce • Multiple Input and Output • Scales from multicore computers to clusters (~1800 machines)
System Overview • Everything based on the communication flaw • Every vertex runs on a CPU of the cluster • Channels are the data flows between the vertexes • Logical communication graph • Mapped to physical resources at run-time
SQL Example “It finds all the objects in the database that have neighboring objects within 30 arc seconds such that at least one of the neighbors has a color similar to the primary object’s color.”
Execution • Input - The datafile is a distributed file • The graph is dynamically changed because of the positions of datafile partitions • Output - The result is again a distributed file • The scheduler on the JM keeps history of each vertex • On fail, the job is terminated • Replication of vertexes to void that • Use versioning to get the right result • Only fail if it re-run for more than a threshold
Execution(cont’d) • JM assumes it is the only job running on the cluster • Uses greedy algorithm • Vertex programs are deterministic • Same result whenever you run them • If it fails the JM is notified or get a heartbeat timeout • If using FIFO or pipes, kill all the connected vertexes and re-execute all of them
Execution(cont’d) • Run vertexes on the machines (or cluster) as close as possible to the data they use • Because the JM can not know the amount of intermediate data - need for dynamic solution
Experiments • First: SQL Query to Dryad application (Compare to SQL Server - varies the number of machines used) • Second: Simple MapReduce data-mining operation to Dryad application (10.2 TB date and 1800 machines) • Use horizontal partitioning of data, pipelined parallelism within processes and inter-partition exchange operations to move partial results
Shortcomings -Future work • Programer can manipulate inter-process communication - Deadlocks • Programer should know the physical resources of the system - breaks abstraction • Assumption of one job on the cluster - Only one job running • SQL Experiment - Less capabilities from the SQL Server • MapReduce Experiment - Only to show that their system works “sufficiently well” for handling those cases - No results about it • Use statistics for resource prediction before execution of a known program - “we may be able to...” • Sacrifice simplicity - more relaxed kind of code compared with the MapReduce