100 likes | 188 Views
Implementing Scalable Tree-based Algorithms using MRNet. Ting Chen Mark Cowlishaw. Introduction. Background on MRNet Our Applications Reverse index Online queries Description of Experiments Progress Next steps. Background: MRNet [Roth, Arnold, Miller03]. Tree-Based Overlay Network
E N D
Implementing Scalable Tree-based Algorithms using MRNet Ting ChenMark Cowlishaw
Introduction • Background on MRNet • Our Applications • Reverse index • Online queries • Description of Experiments • Progress • Next steps
Background: MRNet [Roth, Arnold, Miller03] • Tree-Based Overlay Network • Nodes of a distributed application are arranged in a tree-structure • Leaves producing data that is aggregated and filtered by higher levels of the tree • Separation of programs and their running tree topology • A TBON program can run on tree-network of different topologies
Reverse indexes for Keyword Queries • A keyword query is a list of words: < w1,w2, ...,wn > . • A typical reverse index is a list of words • Each word points to a list of document IDs containing the word. • A document ID list can either be sorted by document names or by the number of times the word appears in the document. Wisconsin Badgers
On-line queries: N-best frequency (N=2) DocC: 7 DocB: 3 DocD: 0 DocF: 6 DocA: 5 DocC: 7 DocF: 6 DocD: 0 DocB: 3 DocA: 5 DocE: 2
Objective / Experiments • How does tree topology affect application performance • Macro-benchmarks: throughput/time-to-completion for index building and response time for on-line queries • Micro-benchmarks: the amount of total IO, I/O performed for each node, data transferred • Scale-Up and Speed-Up curves with the increase of cluster nodes • Is Multiple-level ( > 2) helpful?
Progress - data • Experiment Data • All Wikipedia documents • 8GB of data • 4 Million documents • Probably enough for a tree with 32 leaves • Data in Wiki-text format
Progress – Design / Implementation • Message Formats • Document/keyword delivery • Acknowledgment • keys / statistics • Messages for Microbenchmarks • Shared Classes • DocumentStatistics (back end) [in test] • StatisticsEntry (all) [in test] • StatisticsList (all) • Familiarity with MRNet Toolset • Debugging MRNet Programs
Next Steps • Deploy at medium scale (~32 Nodes) • Experiments to determine fan-out • Maximize throughput (bytes/second) • Minimize time-to-completion • Microbenchmarks • Collected and filtered using MRNet • Idle time • Messages per second, total message traffic
Futures • Replace N-best with distance from median • Add support for new document types • XML • HTML • Generalize to other MapReduce[Dean04] Applications • More realistic relevance ranking • Reverse hyperlink count • Collection term vectors