1 / 48

DryadLINQ: General-Purpose Distributed Data-Parallel Computing

DryadLINQ is a high-performance distributed computing engine that simplifies data-parallel programming, enabling scalability across clusters without requiring knowledge of concurrent programming. It bridges Dryad and LINQ, translating LINQ programs into Dryad job execution plans.

Download Presentation

DryadLINQ: General-Purpose Distributed Data-Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microsoft DryadLINQ --Jinling Li

  2. What’s DryadLINQ? • A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language.[1] • Data-Parallel Computing (large data example) • High-Level Language • DryadLINQ=Dryad+LINQ Dryad LINQ

  3. DryadLINQ=Dryad+LINQ Figure source: [1]

  4. Outline • Dryad • LINQ • DryadLINQ • DryadLINQ in Machine Learning • Strengths and Weaknesses

  5. Dryad • Microsoft Dryad is a high-performance, general-purpose distributed computing engine that handles some of the most difficult aspects of cluster-based distributed computing. [2] • Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs.[2] • A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.[2]

  6. Dryad System Architecture • The job manager contains the application-specific code to construct the job’s communication graph along with library code to schedule the work across the available resources. [2] • The name server is used to enumerate all the available computers. The name server also exposes the position of each computer within the network topology . [2] Figure source:[2]

  7. Dryad System Architecture • The job manager (JM) consults the name server (NS) to discover the list of available computers. It maintains the job graph and schedules running vertices (V) as computers become available using the daemon (D) as a proxy. [2] • The first time a vertex (V) is executed on a computer its binary is sent from the job manager to the daemon and subsequently it is executed from a cache. [2] Figure source: [2]

  8. Dryad Computational Model • The basic computational model for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. [3] Figure source: [3]

  9. Software Stack • Dryad is mostly used as middleware below a high-level language layer and low-level internal cluster infrastructure. [3] Figure source:[3]

  10. Below Dryad • Below Dryad is a cluster-management system that supports some low-level actions like starting a process on a remote computer, and one or more distributed storage systems that support partitioned files. [4] • Most of the Dryad development has been done on top of a Microsoft internal cluster infrastructure called Cosmos that was developed by the Bing product group. [4]

  11. Outline • Dryad • Introduction • Computational Model • System Architecture • Software Stack • LINQ • DryadLINQ • DryadLINQ in Machine Learning • Strengths and Weaknesses

  12. LINQ • LINQ=Language Integrated Queries • A Microsoft .NET Framework component that adds native data querying capabilities to .NET languages • Comprises a set of operators to manipulate collections of .Net objects and bridges the gap between the world of objects and the world of data. [5] Figure source: [5]

  13. LINQ • LINQ adds high level declarative data manipulation to many of the .NET programming languages, including C#, Visual Basic and F#. [5] • LINQ datasets are .NET collections. Technically, a .NET collection of values of type T is a data type which implements the predefined interface IEnumerable<T>[5] • Another interface IQueryable<T> represents a query (i.e., a computation) that can produce a collection with elements of type T. [5]

  14. LINQ operators

  15. A Simple Example: Word Count • Count word frequency in a set of documents [6] • vardocuments = GetDocuments(); • var words = documents.SelectMany (document => document.Words); • vargroups = words.GroupBy(word=>word); • var counts = groups.Select • (group => new WordCount(group.Key, group.Count()));

  16. Outline • Dryad • LINQ • Introduction • Interfaces • Operators • DryadLINQ • DryadLINQ in Machine Learning • Strengths and Weaknesses

  17. DryadLINQ • DryadLINQbridges the gap between Dryad and LINQ layer. • DryadLINQ translates programs written in LINQ into Dryad job execution plans that can be executed on a cluster by Dryad, and transparently returns the results to the host application[5]. Figure source: [5]

  18. Example: LINQ operators Figure source: [5]

  19. 2D Piping • The Dryad job execution plans generated by DryadLINQ are composable: the output of one graph can become the input of another one. In fact, this is exactly how complex LINQ queries are translated: each operator is translated to a graph independently, and the graphs are then concatenated. [7] Figure source: [7]

  20. A Simple Example: Word Count • Count word frequency in a set of documents[6]: • vardocuments = GetDocuments(); • var words = documents.SelectMany (document => document.Words); • vargroups = words.GroupBy(word=>word); • var counts = groups.Select • (group => new WordCount(group.Key, group.Count()));

  21. Word Count in DryadLINQ • Count word frequency in a set of documents[6]: • var documents = DryadLinq.GetTable<Document> • (“file://docs.txt”); • var words = documents.SelectMany • (document => document.Words); • var groups = words.GroupBy(word=>word); • var counts = groups.Select • (group => new WordCount(group.Key, group.Count()));

  22. Distributed Execution of Word Count Figure source: [6]

  23. Another Example: extract Ulfar’s favorite web pages from many web log files

  24. DryadLINQ Execution Overview Figure source: [2]

  25. DryadLINQ=Dryad+LINQ Figure source: [8]

  26. Outline • Dryad • LINQ • DryadLINQ • Introduction • DryadLINQ = Compiles LINQ to Dryad • LINQ operators and other examples • DryadLINQ in Machine Learning • Strengths and Weaknesses

  27. DryadLINQ in Machine Learning

  28. Real-life Application: XBox Figure source: [9]

  29. Example: k-means[5] • K-means in LINQ

  30. K-means in DryadLINQ[5] • How to implement the GroupBy operation at the heart of the k-means aggregation? • DryadLINQ generates a job execution plan that uses two-level aggregation: each computer builds local groups with the local data and only sends the aggregated information about these groups to the next stage; the next stage computes the actual centroid. [5]

  31. K-means in DryadLINQ Figure source: [5]

  32. Example: Decision Tree[5] • Represent a decision tree with a dictionary that maps tree node indices (integer values) to attribute indices in the attribute array • The most common algorithm to induce a decision tree starts from an empty tree and a set of records with class labels and attributes with values. • The algorithm repeatedly extends the tree by grouping records by their current location under the partial tree, and for each such group determining the attribute resulting in the greatest reduction in conditional entropy. • records.GroupBy(record => TreeWalk(record, tree)) .Select(group => FindBestAttribute(group));

  33. Decision Tree Induction in DryadLINQ[5]

  34. Decision Tree Induction in DryadLINQ • Each iteration through the loop invokes a query returning the list of attribute indices that are best for each of the leaves in the old tree. the tree variable is updated on the client computer, and retransmitted to the cluster by DryadLINQ with each iteration. [5]

  35. Decision Tree Induction in DryadLINQ[5] Figure source: [5]

  36. Example: Singular Value Decomposition[5] • The Singular Value Decomposition (SVD) lies at the heart of several large scale data analyses: principal components analysis, collaborative filtering, image segmentation, among many others. [5] • The SVD of a n*m matrix A is a decomposition such that U and V are both orthonormal And is a diagonal matrix with non-negative entries. [5] • Example

  37. Singular Value Decomposition in DryadLINQ[5]

  38. Singular Value Decomposition in DryadLINQ[5]

  39. Singular Value Decomposition in DryadLINQ • Figure source: [5] Figure source: [5]

  40. Outline • Dryad • LINQ • DryadLINQ • DryadLINQin Machine Learning • K-means • DecisionTree • Singular Value Decomposition • Strengths and Weaknesses

  41. Strengths & Weaknesses • DryadLINQ has the following features [8]: • Declarative programming • Automatic parallelization • Integration with Visual Studio • Integration with .Net • Job graph optimizations • Conciseness

  42. Strengths & Weaknesses • While DryadLINQ is a great tool to program clusters, there is a price to pay too for the convenience that it provides[5]. • Efficiency • managed code (C#) is not always as efficient as native code • Debugging • the experience of debugging a cluster program remains more painful than debugging a single-computer program • Transparency • In most cases one needs to have some understanding of the operation of the compiler and particularly of the job execution plans generated to avoid egregious mistakes

  43. You can have it! • Dryad+DryadLINQ available for download • Academic license • Commercial evaluation license • Runs on Windows HPC platform

  44. Conclusion Figure source: [1]

  45. Thank you!

  46. Reference • [1] Microsoft Research webpage: Dryad and DryadLINQ for Data Intensive Research http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx • [2] DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level LanguageYuan Yu et al. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008. • [3] Microsoft Research webpage: Dryadhttp://research.microsoft.com/en-us/projects/dryad/ • [4] Dryad: Distributed Data-Parallel Programs from Sequential Building BlocksMichael Isard, MihaiBudiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropeanConference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007 • [5] Large-Scale Machine Learning using DryadLINQ, chapter in Scaling Up Machine Learning, Frank McSherry, Yuan Yu, MihaiBudiu, Michael Isard, and Dennis Fetterly, Cambridge University Press, December 2011 • [6] DryadLINQ: A System for General-Purpose Distributed Data-Parallel ComputingPresentation by Yuan Yu at OSDI, December, 2008 • [7] Cluster Computing with DryadLINQPresentation by MihaiBudiu at Palo Alto Research Center CSL Colloquium, Palo Alto, CA May 8, 2008 • [8] Microsoft Research webpage: DryadLINQ http://research.microsoft.com/en-us/projects/dryadlinq/ • [9] A Machine-Learning toolking in DryadLINQPresentation slides in PowerPoint by MihaiBudiu and KannanAchan.

  47. Parallel Runtimes – DryadLINQ vs. Hadoop

  48. Expirement by Indiana University Bloomington

More Related