270 likes | 384 Views
DryadLINQ: Computer Vision (among other things) on a cluster. ECCV AC workshop 14 th June, 2008. Michael Isard Microsoft Research, Silicon Valley. Parallel programming, yada yada. Intel claims we will all have many-core, etc. “This algorithm is easily parallelizable”
E N D
DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop14th June, 2008 Michael Isard Microsoft Research, Silicon Valley
Parallel programming, yada yada • Intel claims we will all have many-core, etc. • “This algorithm is easily parallelizable” • Not “we implemented a parallel version” • Historically, low-latency fine-grain parallelism • Shared-memory SMP (threads, locks, etc.) • MPI (finite-element analysis, etc.) • But also data-parallel! • We have lots of data now (video, the web) • But most people still use their laptops/toy data • Even “big” systems use tens of computers
Why do people use Matlab? • Parallel programming tedious and complex • Distributed programming even worse • Perl scripts, manual management of data, … • Matlab is easy (or at least popular) • Relatively few high-level constructs • System “does the right thing” • Programmers willing to put up with a lot • We want similarly low barrier to entry • Familiar languages, legacy codebase, etc.
What are we doing? • When single-computer processing runs out of steam • Web-scale processing of terabytes of data • Infeasible without a big cluster • Network log-mining, machine learning • Multi-week job → 4 hours on 250 computers • 1-hour iteration → 3.5 minutes on 4 computers
A typical data-intensive query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
Steps in the query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object. Go through logentries and keep only entries that are accesses by ulfar. Group ulfar’s accesses according to what page they correspond to. For each page, count the occurrences. Sort the pages ulfar has accessed according to access frequency.
Serial execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; For each line in logs, do… For each entry in logentries, do.. Sort entries in user by page. Then iterate over sorted list, counting the occurrences of each page as you go. Re-sort entries in access by page frequency.
Parallel execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
Linear Regression Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 9
Execution Graph X[0] X[1] X[2] Y[0] Y[1] Y[2] X×XT X×XT X×XT Y×XT Y×XT Y×XT Σ Σ [ ]-1 * 10 A
DryadLINQ • Programmer writes sequential C# code • Rich type system, libraries, modules, loops… • System can figure out data-parallelism • Sees declarative expression plans • Full control of high-level optimizations • Traditional parallel-database tricks
Dryad execution engine Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu • General-purpose execution environment for distributed, data-parallel applications • Concentrates on throughput not latency • Assumes private data center • Automatic management of scheduling, distribution, fault tolerance, etc. • Well tested over two years on clusters of thousands of computers
Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs
Scheduler state machine • Scheduling a DAG • Vertex can run anywhere once all its inputs are ready • Constraints/hints place it near its inputs • Fault tolerance • If A fails, run it again • If A’s inputs are gone, run upstream vertices again (recursively) • If A is slow, run another copy elsewhere and use output from whichever finishes first
Static/dynamic optimizations • Static optimizer builds execution graph • Dynamic optimizer mutates running graph • Picks number of partitions when size is known • Builds aggregation trees based on locality
LINQ • Constructs/type system in .NET v3.5 • Operators to manipulate datasets • Data elements are arbitrary .NET types • Traditional relational operators • Select, Join, Aggregate, etc. • Extensible • Add new operators • Add new implementations
DryadLINQ Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey • Automatically distribute a LINQ program • Few Dryad-specific extensions • Same source program runs on single-core through multi-core up to cluster
A complete DryadLINQ program public class LogEntry { public string user; public string ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } } public class UserPageCount { public string user; public string page; public int count; public UserPageCount(string user, string page, int count) { this.user = user; this.page = page; this.count = count; } } DryadDataContext ddc = new DryadDataContext(“fs://logfile”); DryadTable<string> logs = ddc.GetTable<string>(); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToDryadTable(“fs://results”)
DryadLINQ: From LINQ to Dryad Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); select where logs
How does it work? • Sequential code “operates” on datasets • But really just builds an expression graph • Lazy evaluation • When a result is retrieved • Entire graph is handed to DryadLINQ • Optimizer builds efficient DAG • Program is executed on cluster
Terasort • 10 billion 100-byte records (1012 bytes) • 240 computers, 960 disks • 349 secs • Comparable with record public struct TeraRecord : IComparable<TeraRecord> { public const int RecordSize = 100; public const int KeySize = 10; public byte[] content; public int CompareTo(TeraRecord rec) { for (int i = 0; i < KeySize; i++) { int cmp = this.content[i] - rec.content[i]; if (cmp != 0) return cmp; } return 0; } public static TeraRecord Read(DryadBinaryReader rd) { TeraRecord rec; rec.content = rd.ReadBytes(RecordSize); return rec; } public static int Write(DryadBinaryWriter wr, TeraRecord rec) { return wr.WriteBytes(rec.content); } } class Terasort { public static void Main(string[] args) DryadDataContext ddc = new DryadDataContext(@"file://\\svc-yuanbyu-00\dryad\terasort"); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>("sherwood-sort2.pt"); var q = records.OrderBy(x => x); q.ToDryadPartitionedTable("sherwood-sort2.pt"); } }
Machine Learning in DryadLINQ Kannan Achan, Mihai Budiu Data analysis Machine learning Large Vector DryadLINQ Dryad 22
Linear Regression Code Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 23
Expectation Maximization • 160 lines • 3 iterations shown 24
Computer vision • Ongoing • Epitomes, features for image search, … • Anecdotal evidence • Nebojsa Jojic, Anitha Kannan • Tutorial from Mihai • Anitha implemented Probabilistic Image Map algorithm in an afternoon
Continuing research • Application-level research • What can we write with DryadLINQ? • System-level research • Performance, usability, etc. • Lots of interest from learning/vision researchers