220 likes | 489 Views
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Presented by Asma’a Nassar Supervised by Dr. Amer Al- badarneh. Out Line. Introduction Differences from other systems System Overview System Organization Schema SQL Example and how mapping it by Dryad
E N D
Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks Presented by Asma’aNassar Supervised by Dr. Amer Al-badarneh
Out Line • Introduction • Differences from other systems • System Overview • System Organization Schema • SQL Example and how mapping it by Dryad • Execution. • Experiments. • Results.
Introduction • What is Dryad? • Data flow graph with • Vertices and • Channels • Execute vertices and communicate through channels
Cont…. • Vertices: Sequential Programs given by the programer • The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. • Channels: File, TCP pipe, Shared-memory FIFO
Cont…. • Why Dryad? • Efficient way for parallel and distributed applications. • Take advantage of the multicore servers. • Data parallelism. • scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. • Motivation: GPUs, Map Reduce and Parallel DBs
Cont… • The application can discover the size and placement of data at run time. • modify the graph as the computation progresses to make efficient use of the available resources. • Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers.
Cont…. Concurrent application scheduling the use of computers and their CPUs a large distributed The Dryad execution To solve the difficult problems of creating recovering from communication or computer failures transporting data between vertices
Differences from other systems • Allows developers to define communication between vertices. • More difficult - Provides better options. • A programmer can master it in a few weeks. • Not as restrictive as Map Reduce. • Multiple Input and Output. • Scales from multicore computers to clusters (~1800 machines).
System Overview • Everything based on the communication flaw • Every vertex runs on a CPU of the cluster • Channels are the data flows between the vertexes • Logical communication graph • Mapped to physical resources at run-time
SQL Example “It finds all the objects in the database that have neighboring objects within 30 arc seconds such that at least one of the neighbors has a color similar to the primary object’s color.” select distinct p.objID from photoObjAll p join neighbors n — call this join “X” on p.objID = n.objID and n.objID < n.neighborObjID and p.mode = 1 join photoObjAll l — call this join “Y” on l.objid = n.neighborObjID and l.mode = 1 and abs((p.u-p.g)-(l.u-l.g))<0.05 and abs((p.g-p.r)-(l.g-l.r))<0.05 and abs((p.r-p.i)-(l.r-l.i))<0.05 and abs((p.i-p.z)-(l.i-l.z))<0.05
How cam mapped SQL query into the Dryad computation
Execution • Dryad includes a runtime library that is responsible for setting up and executing vertices as part of a distributed computation. • Input - The data file is a distributed file. • The graph is dynamically changed because of the positions of data file partitions. • Output - The result is again a distributed file.
Execution(cont’d) • The scheduler on the JM keeps history of each vertex • On fail, the job is terminated. • Replication of vertexes to void that. • Use versioning to get the right result • Only fail if it re-run for more than a threshold .
Execution(cont’d) • JM assumes it is the only job running on the cluster. • Uses greedy algorithm . • Vertex programs are deterministic • Same result whenever you run them. • If it fails the JM is notified or get a heartbeat timeout. • If using FIFO or pipes, kill all the connected vertexes and re-execute all of them.
Execution(cont’d) • Run vertexes on the machines (or cluster) as close as possible to the data they use. • Because the JM can not know the amount of intermediate data - need for dynamic solution.
Experiments • First: SQL Query to Dryad application (Compare to SQL Server - varies the number of machines used). • Second: Simple MapReduce data-mining operation to Dryad application (10.2 TB date and 1800 machines). • Use horizontal partitioning of data, pipelined parallelism within processes and inter-partition exchange operations to move partial results.