Optimizing and Parallelizing Ranked Enumeration

VLDB 2011 Seattle, WA Optimizing and Parallelizing Ranked Enumeration

Background: DB Search at HebrewU search eu brussels _ demo in SIGMOD’10, implementation in SIGMOD’08, algorithms in PODS’06 • Initial implementation was too slow… • Purchased a multi-core server • Didn’t help: cores were usually idle • Due to the inherent flow of the enumeration technique we used • Needed deeper understanding of ranked enumeration to benefit from parallelization • This paper

Ranked Enumeration Problem User . . . best answer 2nd best answer 3rd best answer Huge number (e.g., 2|Problem|) of ranked answers • Examples: • Various graph optimizations • Shortest paths • Smallest spanning trees • Best perfect matchings • Top results of keyword search on DBs (graph search) • Most probable answers in probabilistic DBs • Best recommendations for schema integration (Can’t afford to instantiate all answers) • “Complexity”: • What is the delay between successive answers? • How much time to get top-k? Here

Abstract Problem Formulation … 28 31 32 A collection of objects input O = Answers a⊆O A = 21 31 27 28 17 Huge, described by a condition on A’s subsets score(a) is high  ais of high-quality score() Goal:Find top-k answers … 17 a1 a2 a3 ak

Graph Search in The Abstraction … Edges of G • Data graph G • Set Q of keywords O = Answers a⊆O A = Subtrees (edge sets) a containing all keywords in Q (w/o redundancy, see [GKS 2008]) Goal:Find top-k answers

What is the Challenge? 17 31 32 O = . . . . . . start jth answer 1st (top) answer 2nd answer ? ? • ≠ previous (j-1) answers • bestremaining answer Optimization problem Conceivably, much more complicated than top-1! How to handle these constraints? (j may be large!)

Lawler-Murty’s Procedure [Murty, 1968] [Lawler, 1972] Lawler-Murty’s gives a general reduction: Finding top-k answers then PTIME if PTIME Finding top-1 answer under simple constraints We understand optimization much better! Often, amounts to classical optimization, e.g., shortest path (but sometimes it may get involved, e.g., [KS 2006]) Other general top-k procedure: [Hamacher & Queyranne 84], very similar!

Among the Uses of Lawler-Murty’s Graph/Combinatorial Algorithms: • Shortestsimple paths[Yen 1972] • Minimum spanning trees[Gabow 1977, Katoh et al., 1981] • Best solutions in resource allocation [Katoh et al. 1981] • Best perfect matchings, best cuts[Hamacher & Queyranne 1985] • Minimum Steiner trees[KS 2006] Bioinformatics: • Yen’s algorithm to find sets of metabolites connected by chemical reactions [Takigawa & Mamitsuka 2008] Data Management: • ORDER-BY queries [KS 2006, 2007] • Graph/XML search[GKS 2008] • Generation of forms over integrated data[Talukdar et al. 2008] • Course recommendation[Parameswaran & Garcia-Molina 2009] • Querying Markov sequences[K & Ré 2010]

Lawler-Murty’s Method: Conceptual start

1. Find & Print the Top Answer In principle, at this point we should find thesecond-bestanswer But Instead… Output start

2. Partition the Remaining Answers • Inclusion constraint: “must contain ” • Exclusion constraint: “must not contain ” Partition defined by a set of simple constraints Output start

3. Find the Top of Each Set Output start

4. Find & Print the Second Answer Output start Next answer:Best among all the top answers in the partitions

5. Further Divide the Chosen Partition … and so on … (until k answers are printed) . . . Output start

34 30 Lawler-Murty’s: Actual Execution 19 24 18 Output Printed already Best of each partition best Partition Reps. + Best of Each

34 30 Lawler-Murty’s: Actual Execution 19 24 18 Output For each new partition, a task to find the best answer Partition Reps. + Best of Each

34 30 Lawler-Murty’s: Actual Execution 22 18 21 19 24 18 Output best… Partition Reps. + Best of Each

34 30 Typical Bottleneck 12 14 24 Output Partition Reps. + Best of Each

34 30 Typical Bottleneck 15 20 22 12 14 24 Output In top k? Partition Reps. + Best of Each

Progressive Upper Bound 12 • Throughout the execution, an optimization alg. often upper bounds it’s final solution’s score • Progressive: bound gets smaller in time • Often, nontrivial bounds, e.g., • Dijkstra's algorithm: distance at the top of the queue • Similarly: some Steiner-tree algorithms [DreyfusWagner72] • Viterbi algorithms: max intermediate probability • Primal-dual methods: value of dual LP solution ≤24 ≤22 ≤18 ≤14 Time

34 30 Freezing Tasks (Simplified) 12 14 24 Output Partition Reps. + Best of Each

34 30 Freezing Tasks (Simplified) 20 22 12 14 24 Output ≤24 ≤23 ≤24 ≤23 ≤22 Partition Reps. + Best of Each

34 30 Freezing Tasks (Simplified) 20 22 12 14 24 Output ≤24 ≤23 ≤20 22 > 20 Partition Reps. + Best of Each

34 30 Freezing Tasks (Simplified) 15 20 22 12 14 24 Output ≤15 ≤24 ≤23 ≤20 ≤18 ≤16 best ≤20 Partition Reps. + Best of Each

Improvement of Freezing Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Simple Lawler-Murty w/ Freezing Mondial k = 10 , 100 DBLP (part) k = 10 , 100 DBLP (full) k = 10 , 100 On average, freezing saved 56% of the running time

34 30 Straightforward Parallelization Awaiting Tasks 12 14 24 Output

34 30 Straightforward Parallelization 15 20 22 12 14 24 Output Awaiting Tasks

34 30 Straightforward Parallelization 15 20 22 Awaiting Tasks 12 14 24 Output

Not so fast… Typical: reduced 30% of running time Same for 2,3…,8 threads!

34 30 Idle Cores while Waiting Awaiting Tasks 12 14 24 Output

34 30 Idle Cores while Waiting 15 20 22 12 14 24 Output idle Awaiting Tasks

34 30 Early Popping 20 22 Awaiting Tasks 12 14 24 Output ≤22 • Skipped issues: • Thread synchronization • semaphores, locking, etc. • Correctness ≤22 ≤20 ≤23 ≤19 ≤24 22 > 20

Improvement of Early Popping Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries

Early Popping vs. (Serial) Freezing Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries • Need 4 threads to start gaining • And even then, fairly poor…

Combining Freezing & Early Popping • We discuss additional ideas and techniques to further utilize the cores • Not here, see the paper • Main speedup by combining early popping with freezing • Cores kept busy…on high-potential tasks • Thread synchronization is quite involved • At the high level, the final algorithm has the following flow:

34 30 Combining: General Idea 20 17 25 15 26 12 24 Output Computed Answers (to-print) computed answers frozen + new tasks Threads work on frozen tasks Partition Reps. as Frozen Tasks

34 30 Combining: General Idea 20 17 25 15 20 12 24 Output Computed Answers (to-print) computed answers frozen + new tasks Threads work on frozen tasks Partition Reps. as Frozen Tasks

34 30 Combining: General Idea 20 22 22 20 22 25 17 15 12 24 Output Main task just pops computed results to print … but validates: no better results by frozen tasks Computed Answers (to-print) computed answers frozen + new tasks Threads work on frozen tasks Partition Reps. as Frozen Tasks

Combined vs. (Serial) Freezing Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Mondial DBLP Now, significant gain (≈50%) already w/ 2 threads

Improvement of Combined Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory 3%-10% 4%-5% Mondial DBLP On average, with 8 threads we got 5.7% of the original running time

Conclusions • Considered Lawler-Murty’s ranked enumeration • Theoretical complexity guarantees • …but a direct implementation is very slow • Straightforward parallelization poorly utilizes cores • Ideas: progressive bounds, freezing, early popping • In the paper: additional ideas, combination of ideas • Most significant speedup by combining these ideas • Flow substantially differs from the original procedure • 20x faster on 8 cores • Test case: graph search; focus: general apps • Future: additional test cases Questions?

Optimizing and Parallelizing Ranked Enumeration

Optimizing and Parallelizing Ranked Enumeration

Presentation Transcript

Network Reconnaissance and Enumeration

Parallelizing and Optimizing Programs for GPU Acceleration using CUDA

Parallelizing MiniSat

Parallelizing Programs

Bacterial Enumeration

Parallelizing HMM Decoding

A Framework for Optimizing and Parallelizing XQuery

2009 Enumeration

Enumeration

Enumeration Data Type

Ranked Retrieval

Parallelizing Computations

Ranked Retrieval

Parallelizing METIS

Parallelizing stencil computations

Top Ranked

Enumeration Types and typedef

Structs and Enumeration

Auto-Parallelizing Option

Enumeration

Bacteria Enumeration

enumeration