Implementing Scalable Tree-based Algorithms using MRNet

Implementing Scalable Tree-based Algorithms using MRNet Ting ChenMark Cowlishaw

Introduction • Background on MRNet • Our Applications • Reverse index • Online queries • Description of Experiments • Progress • Next steps

Background: MRNet [Roth, Arnold, Miller03] • Tree-Based Overlay Network • Nodes of a distributed application are arranged in a tree-structure • Leaves producing data that is aggregated and filtered by higher levels of the tree • Separation of programs and their running tree topology • A TBON program can run on tree-network of different topologies

Reverse indexes for Keyword Queries • A keyword query is a list of words: < w1,w2, ...,wn > . • A typical reverse index is a list of words • Each word points to a list of document IDs containing the word. • A document ID list can either be sorted by document names or by the number of times the word appears in the document. Wisconsin Badgers

On-line queries: N-best frequency (N=2) DocC: 7 DocB: 3 DocD: 0 DocF: 6 DocA: 5 DocC: 7 DocF: 6 DocD: 0 DocB: 3 DocA: 5 DocE: 2

Objective / Experiments • How does tree topology affect application performance • Macro-benchmarks: throughput/time-to-completion for index building and response time for on-line queries • Micro-benchmarks: the amount of total IO, I/O performed for each node, data transferred • Scale-Up and Speed-Up curves with the increase of cluster nodes • Is Multiple-level ( > 2) helpful?

Progress - data • Experiment Data • All Wikipedia documents • 8GB of data • 4 Million documents • Probably enough for a tree with 32 leaves • Data in Wiki-text format

Progress – Design / Implementation • Message Formats • Document/keyword delivery • Acknowledgment • keys / statistics • Messages for Microbenchmarks • Shared Classes • DocumentStatistics (back end) [in test] • StatisticsEntry (all) [in test] • StatisticsList (all) • Familiarity with MRNet Toolset • Debugging MRNet Programs

Next Steps • Deploy at medium scale (~32 Nodes) • Experiments to determine fan-out • Maximize throughput (bytes/second) • Minimize time-to-completion • Microbenchmarks • Collected and filtered using MRNet • Idle time • Messages per second, total message traffic

Futures • Replace N-best with distance from median • Add support for new document types • XML • HTML • Generalize to other MapReduce[Dean04] Applications • More realistic relevance ranking • Reverse hyperlink count • Collection term vectors

Implementing Scalable Tree-based Algorithms using MRNet

Implementing Scalable Tree-based Algorithms using MRNet

Presentation Transcript

TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility

Scalable Decision Tree SPRINT

Implementing Spanning Tree

The MRNet Tree-based Overlay Network

Status of Krell Tools Built using Dyninst/MRNet

Implementing a Scalable Multiarea Network OSPF-Based Solution

Implementing Communication-Avoiding Algorithms

Scalable Regression Tree Learning on Hadoop using OpenPlanet

Tree-Based Density Clustering using Graphics Processors

Vision Based Scalable Workflow Development using Web Services

Parsimony based algorithms for phylogenetic tree construction

Tree-based Overlay Networks for Scalable Applications and Analysis

Implementing a Scalable Multiarea Network OSPF-Based Solution

Implementing a Scalable Multiarea Network OSPF-Based Solution

Implementing a Scalable Multiarea Network OSPF-Based Solution

Scalable Skyline Computation Using Object-based Space Partitioning

Implementing continuous improvement using genetic algorithms

Scalable Failure Recovery for Tree-based Overlay Networks

Implementing a Scalable Multiarea Network OSPF-Based Solution

MRNet: From Scalable Performance to Scalable Reliability

Implementing a Scalable Multiarea Network OSPF-Based Solution

Implementing Spanning Tree