Large Scale Data Processing with DryadLINQ

Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ

Outline • Brief introduction to TidyFS • Preparing/loading data onto a cluster • Desirable properties in a Dryad cluster • Detailed description of several IR algorithms

TidyFSgoals • A simple distributed filesystem that provides the abstractions necessary for data parallel computations • High performance, reliable, scalable service • Workload • High throughput, sequential IO, write once • Cluster machines working in parallel • Terasort • 240 machines reading at 240 MB/s = 56 GB/s • 240 machines writing at 160 MB/s = 37 GB/s

TidyFS Names • Stream: a sequence of partitions • i.e. tidyfs://dryadlinqusers/fetterly/clueweb09-English • Can have leases for temp files or cleanup from crashes • Partition: • Immutable • 64 bit identifier • Can be a member of multiple streams • Stored as NTFS file on cluster machines • Multiple replicas of each partition can be stored Stream-1 Part 1 Part 2 Part 3 Part 4

Preparation of Data • Often substantially harder than it appears • Issues: • Data format • Distribution of data • Network bandwidth • Generating synthetic datasets is sometimes useful

Data Prep – Format • Text records are simplest • Caveat – information that is not in the line • e.g. - if a line number encodes information • Binary records often require custom code to load to cluster • Serialization/de-serialization code generated by DryadLINQ uses C# Reflection

Custom Deserialization Code public classUrlDocIdScoreQuery{publicstringqueryId;publicstringurl;publicstringdocId;publicstringqueryString;publicdouble score;publicstaticUrlDocIdScoreQuery Read(DryadBinaryReader reader) {UrlDocIdScoreQuery rec = newUrlDocIdScoreQuery();rec.queryId = ReadAnyString(reader);rec.queryString = ReadAnyString(reader); rec.url = ReadAnyString(reader);rec.docId = ReadAnyString(reader);rec.score = reader.ReadDouble();return rec; } public staticstringReadAnyString(DryadBinaryReaderdbr) {…} }

Data Prep - Loading • DryadLINQ job • Often needs a dummy input anchor • Custom program • Write records to TidyFS partitions • “SneakerNet” often a good option

Data Loading - DryadLINQ • Need input “anchor” to run on cluster • Generate or use existing stream • Sample: IEnumerable<Entry> GenerateEntries(Random x, intnumItems) { for (int i = 0; i < numItems; i++) { // code to generate records yield return record; } }

Data Gen - DryadLINQ • Need input “anchor” to run on cluster • Generate or use existing stream • Sample: IEnumerable<Entry> GenerateEntries(Random x, intnumItems) { for (int i = 0; i < numItems; i++) { // code to generate records yield return record; } }

DryadLINQ Job varstreamname = "tidyfs://datasets/anchor”; varos = @"tidyfs://msri/teamname/data?compression=" + CompressionScheme.GZipFast; var r = PartitionedTable.Get<int>(streamname) .Take(1).SelectMany(x => Enumerable.Range(0, partitions)).HashPartition(x => x, partitions).Select(x => new Random(x)).SelectMany(x => GenerateEntries(x, numItems)).ToPartitionedTable(os);

Data Loading - Databases • Bulk copy into files • Use queries to produce multiple files • Perform queries within DryadLINQ UDF IEnumerable<Entry> PerformQuery(string queryArg) { var results = “select * from …”; foreach (var record in results) { yield return record; } }

Building a cluster • Overall goal – a high-throughput system • Not latency sensitive • More slower computers often better than fewer faster computers • Multiple cores better that frequency • Multiple disks – increase throughput • Sufficient RAM

Networking a Cluster • Network topology – medium to large clusters • Attempt to maximize cross rack bandwidth • Two tier topology • Rack switches and core switches • Port aggregation • Bond multiple connections together • 1 GbE or 10 GbE

Cluster Software • Runs on Windows HPC Server 2008 • Academic Release • For non-commercial use • Commercial License

DryadLINQ IR Toolkit • Library that uses DryadLINQ • Source code for a number of IR algorithms • Text retrieval - BM25/BM25F • Link based ranking - PageRank/SALSA-SETR • Text processing - Shingle based duplicate detection • Designed to work well with ClueWeb09 collection • Including preprocessing the data to load the cluster • Available from http://research.microsoft.com/dryadlinqir/

ClueWeb09 Collection • Collected/Distributed by CMU • 1 billion web pages crawled in Jan/Feb 2009 • 10 different languages • en, zh, es, ja, de, fr, ko, it, pt, ar • 5 TB, compressed - 25 TB, uncompressed • Available to research community • Dataset available for your projects • Web graph, 503m English web pages

Example: Term Frequencies Count term frequencies in a set of documents: vardocs = new PartitionedTable<Doc>(“tidyfs://dennis/docs”); varwords=docs.SelectMany(doc => doc.words); vargroups=words.GroupBy(word => word); varcounts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToPartitionedTable(“tidyfs://dennis/counts.txt”); IN SM GB S OUT metadata doc => doc.words word => word g => new … metadata

Distributed Execution of Term Freq LINQ expression Dryad execution IN SM DryadLINQ GB S OUT

Execution Plan for Term Frequency SelectMany SM Sort Q SM pipelined GroupBy GB GB C Count (1) Distribute S D Mergesort MS pipelined GB GroupBy Sum Sum

Execution Plan for Term Frequency SM SM SM SM Q Q Q Q SM GB GB GB GB GB C C C C (2) (1) S D D D D MS MS MS MS GB GB GB GB Sum Sum Sum Sum

BM25 “Grep” • For batch evaluation of queries calculating BM25 is just a select operation string queryTermDocFreqURLLocal = @"E:\TREC\query-doc-freqs.txt"; Dictionary<string, int> dfs = GetDocFreqs(queryTermDocFreqURLLocal); PartitionedTable<InitialWordRecord> initialWords = PartitionedTable.Get<InitialWordRecord>(initialWordsURL); var BM25s = from doc in initialWords select ComputeDocBM25(queries, doc, dfs); BM25s.ToPartitionedTable(“tidyfs://dennis/scoredDocs”);

PageRank Ranks web pages by propagating scores along hyperlink structure Each iteration as an SQL query: Join edges with ranks Distribute rank on edges GroupBy edge destination Aggregate into ranks. Repeat.

One PageRank Step in DryadLINQ // one step of pagerank: dispersing and re-accumulating rank public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates varupdates =frompageinpages joinrankinranksonpage.nameequalsrank.name selectpage.Disperse(rank); // re-accumulate. return fromlistinupdates fromrankinlist grouprank.rankbyrank.nameintog select new Rank(g.Key, g.Sum()); }

A Complete DryadLINQ Program public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates varupdates =frompageinpages joinrankinranksonpage.nameequalsrank.name selectpage.Disperse(rank); // re-accumulate. return fromlistinupdates fromrankinlist grouprank.rankbyrank.nameintog select new Rank(g.Key, g.Sum()); } public structPage { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rankrank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } } public structRank { public UInt64 name; public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } } var pages = DryadLinq.GetTable<Page>(“tidyfs://pages.txt”); // repeat the iterative computation several times var ranks = pages.Select(page=>new Rank(page.name, 1.0)); for (intiter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); } ranks.ToDryadTable<Rank>(“outputranks.txt”);

PageRank Optimizations • Benchmark PageRank on 954m page graph • Naïve approach – 10 iter ~3.5 hours 1.2TB • Apply several optimizations • Change data distribution • Pre-group pages by host • Renaming host groups with dense names • Cull out leaf nodes • Pre-aggregate ranks for each host • Final version – 10 iter 11.5 min 116 GB

Tactics for Improving Performance • Loop unrolling • Reduce data movement • Improve data locality • Choose what to Group

Gotchas • Non-deterministic output • E.g. RNG in user defined function • Writing to shared state

Schedule for Today • 9:30 – 10:00 Meet with team, finalize project • 10:30-12:00 Work on projects, discuss approach with a speaker

Backup Slides

Cluster Configuration Head Node TidyFS Servers Cluster machines running tasks and TidyFS storage service

How a Dryad job reads from TidyFS TidyFS Service Schedule Vertex Part 1 List Partitions in Stream Job Manager Part 1, Machine 1 Part 2, Machine 2 Schedule Vertex Part 2 D:\tidyfs\0001.data Get Read Path Machine 1, Part 1 Machine 1 … D:\tidyfs\0002.data Get Read Path Machine 2, Part 2 Machine 2 Cluster Machines

How a Dryad job writes to TidyFS TidyFS Service Str1_v1 Part1 Schedule Vertex 1 Str1_v2 Part 2 Job Manager Schedule Vertex 2 Part 1 create Str1_v1 Machine 1 … Part 2 Machine 2 create Str1_v2 Cluster Machines

How a Dryad job writes to TidyFS Str1 TidyFS Service Str1_v1 Part1 Delete Streams str1_v1, str1_v2 Str1_v2 Part 2 Job Manager Create Str1 ConcatenateStreams (str1, str1_v1, str1_v2) D:\tidyfs\0001.data Machine 1 AddPartitionInfo (Part 1, Machine 1, Size, Fingerprint, …) GetWritePath Machine 1, Part 1 … Completed Machine 2 AddPartitionInfo (Part 2, Machine 2, Size, Fingerprint, …) GetWritePath Machine2, Part 2 D:\tidyfs\0002.data Completed Cluster Machines

Large Scale Data Processing with DryadLINQ

Large Scale Data Processing with DryadLINQ

Presentation Transcript

Mass Data Processing Technology on Large Scale Clusters

Large-scale Machine Learning using DryadLINQ

Large-scale Processing with MapReduce

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Large Scale Data Visualization with VisIt

Sailfish: A Framework For Large Scale Data Processing

Large-Scale Data Processing with MapReduce

Large scale genomic data mining

Large-scale Machine Learning using DryadLINQ

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

Large-Scale Iterative Data Processing CS525 Big Data Analytics

Data Indexing for Stateful , Large-scale Data Processing

Large scale data processing

Large Scale Data Integration

Large Scale Data Analytics

large scale data analysis

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters