710 likes | 852 Views
Cluster Computing with DryadLINQ. Mihai Budiu, MSR-SVC PARC, May 8 2008. Aknowledgments. MSR SVC and ISRC SVC Michael Isard, Yuan Yu, Andrew Birrell , Dennis Fetterly Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey. Computer Evolution. ?. 1961. 2008. 2040. Computer Evolution. ?.
E N D
Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008
Aknowledgments MSR SVC and ISRC SVC Michael Isard, Yuan Yu, Andrew Birrell, Dennis Fetterly Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey
Computer Evolution ? 1961 2008 2040
Computer Evolution ? ENIAC 194330 tons 200kW Datacenter 2008500,000 ft2 40MW 2040
Layers Applications Programming Languages and APIs Resource Management Scheduling Distributed Execution Operating System Caching and Synchronization Storage Identity & Security Networking
The Rest of This Talk Machine Learning Large Vectors DryadLINQ Dryad Distributed Filesystem CIFS/NTFS Cluster Services Windows Server Windows Server Windows Server Windows Server
TeraSort • How fast can you sort 1010 100-byte records (1Tb)? • Sequential scan/disk = 4.6 hours • Current record: 435 seconds (7.2 min)cluster of 40 Itanium2, 2520 SAN disks • Code: 3300 lines of C • Our result: 349 seconds (5.8 min)cluster of 240 AMD64 (quad) machines, 920 disks • Code: 17 lines of LINQ
Outline • Introduction • Dryad • DryadLINQ • Building on DryadLINQ
Outline • Introduction • Dryad • deployed since 2006 • many thousands of machines • analyzes many petabytes of data/day • DryadLINQ • Building on DryadLINQ
Design Space Grid Internet Data- parallel Dryad Search Shared memory Private data center Transaction HPC Latency Throughput
Data Partitioning DATA RAM DATA
2-D Piping • Unix Pipes: 1-D grep | sed | sort | awk | perl • Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50
Dryad = Execution Layer Job (application) Pipeline ≈ Dryad Shell Cluster Machine
Virtualized 2-D Pipelines • 2D DAG • multi-machine • virtualized
Dryad Job Structure Channels Inputfiles Stage Outputfiles sort grep awk sed perl sort grep awk sed grep sort Vertices (processes)
Channels • Finite Streams of items • distributed filesystem files (persistent) • SMB/NTFS files (temporary) • TCP pipes (inter-machine) • memory FIFOs (intra-machine) X Items M
Architecture data plane Files, TCP, FIFO, Network job schedule V V V NS PD PD PD control plane Job manager cluster
Dynamic Graph Rewriting X[0] X[1] X[3] X[2] X’[2] Slow vertex Duplicatevertex Completed vertices Duplication Policy = f(running times, data volumes)
Dynamic Aggregation S S S S S S T static S S S S S S # 1 # 2 # 1 # 3 # 3 # 2 rack # A A A # 1 # 2 # 3 T dynamic
Data-Parallel Computation Parallel Databases Map-Reduce Application Dryad Execution GFSBigTable Storage
Outline • Introduction • Dryad • DryadLINQ • Building on Dryad
DryadLINQ Dryad
LINQ Collection<T> collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ = LINQ + Dryad Collection<T> collection; boolIsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Vertexcode Queryplan (Dryad job) Data collection C# C# C# C# results
Data Model C# objects Partition Collection
Query Providers Client machine DryadLINQ C# Data center Distributed query plan Invoke Query Expr Query ToDryadTable Input Tables JM Dryad Execution Output DryadTable C# Objects Results Output Tables (11) foreach
Example: Histogram public static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k) { var words = input.SelectMany(x => x.line.Split(' ')); vargroups = words.GroupBy(x => x); varcounts = groups.Select(x => new Pair(x.Key, x.Count())); varordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top; }
Histogram Plan SelectMany HashDistribute Merge GroupBy Select OrderByDescendingTake MergeSort Take
Map-Reduce in DryadLINQ public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result; }
Map-Reduce Plan map M M M M M M M Q Q Q Q Q Q Q sort groupby M G1 G1 G1 G1 G1 G1 G1 map R R R R R R R reduce D distribute D D D D D D D G (1) (2) (3) R mergesort MS MS MS MS MS groupby partial aggregation X G2 G2 G2 G2 G2 reduce R R R R R X X X mergesort MS MS groupby G2 G2 reduce R R reduce S S S S S S consumer A A A X X T
Distributed Sorting in DryadLINQ public static IQueryable<TSource> DSort<TSource, TKey>(this IQueryable<TSource> source, Expression<Func<TSource, TKey>> keySelector, intpcount) { var samples = source.Apply(x => Sampling(x)); var keys = samples.Apply(x => ComputeKeys(x, pcount)); var parts = source.RangePartition(keySelector, keys); return parts.OrderBy(keySelector); }
Distributed Sorting Plan DS DS DS DS DS H H H O D D D D D (1) (2) (3) M M M M M S S S S S
Outline • Introduction • Dryad • DryadLINQ • Building on DryadLINQ
Machine Learning in DryadLINQ Data analysis Machine learning Large Vector DryadLINQ Dryad
Operations on Large Vectors: Map 1 T f U f preserves partitioning T f U
Map 2 (Pairwise) T f U V T U f V
Map 3 (Vector-Scalar) T f U V T U f V 47
Reduce (Fold) f U U U U f f f U U U f U
Linear Algebra T T V = U , ,
Linear Regression • Data • Find • S.t.