Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common SubsequenceComputation using Bulk-Synchronous Parallelism Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006 P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Outline • Introduction • LLCS Computation • The BSP Model • Problem Definition and Algorithms • Standard Algorithm • Parallel Algorithm • Experiments • Experiment Setup • Predictions • Speedup P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Motivation Computing the (Length of the) Longest Common Subsequence is representative of a class of dynamic programming algorithms. Hence, we want to • Examine the suitability of high-level BSP programming for such problems • Compare different BSP libraries on different systems • See what happens when there is good sequential performance • Examine performance predictability P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Related Work • Sequential dynamic programming algorithm (Hirschberg, ’75) • Crochemore, Iliopoulos, Pinzon, Reid: A fast and practical bit-vector algorithm for the Longest Common Subsequence problem (2001) • Alves, Cáceres, Dehne: Parallel dynamic programming for solving the string editing problem on a CGM/BSP (2002). • Garcia, Myoupo, Semé: A coarse-grained multicomputer algorithm for the longest common subsequence problem (2003). P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Our Work • Combination of bit-parallel algorithms and fast BSP-style communication • A BSP performance model and predictions • Comparison using different libraries on different systems • Estimation of block size parameter before calculation for better speedup P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

The BSP Model • p identical processor/memory pairs (computing nodes) • Computation speed f on every node • Arbitrary interconnection network, latency l, bandwidth gap g P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

BSP Programs • SPMD execution, takes place in supersteps • Communication may be delayed until the end of the superstep • Time/Cost Formula : T = f ·W + g · H + l · S Bytes will be used as a base unit for communication size P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Problem Definition Let X = x1x2 . . .xmand Y = y1y2 . . .yn be two strings on a finite alphabet • Subsequence U of string X: U can be obtained by deleting zero or more elements from X i.e. U = xi1xi2 . . .xik and iq < iq+1 for all q with 1 ≤ q < k. • Strings X and Y : LCS (X, Y) is any string which is subsequence of both X and Y and has maximum possible length. • Length of these sequences: LLCS (X, Y). P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Sequential Algorithm • Dynamic programming matrix L0..m,0..n • Li,j = LLCS( x1x2…xi, y1y2…yj ) • The values in this matrix can be computed in O(mn) time and space P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Parallel Algorithm • Based on a simple parallel algorithm for grid DAG computation • Dynamic programming matrix L is partitioned into a grid of rectangular blocks of size (m/G)×(n/G) (G : grid size) • Blocks in a wavefront can be processed in parallel • Assumptions: • Strings of equal length m = n • Ratio a = G/p is an integer P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Parallel Cost Model • Input/output data distribution is block-cyclic • Can keep data for block-columns locally • Running time: • Parameter a can be used to tune performance P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Bit-Parallel Algorithms • Bit-parallel computation processes w entries of L in parallel (w : machine word size) • This leads to substantial speedup for the sequential computation phase and slightly lower communication cost per superstep. P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Systems Used Measurements on parallel machines at the Centre for Scientific Computing: • aracari: IBM cluster, 64 × 2-way SMP Pentium3 1.4 GHz/128 GB of memory (Interconnection Network: Myrinet 2000, MPI: mpich-gm) • argus: Linux cluster, 31 × 2-way SMP Pentium4 Xeon 2.6 GHz processors/62 GB of memory (Interconnection Network: 100Mbit Ethernet, MPI: mpich-p4) • skua: SGI Altix shared memory machine, 56 × Itanium-2 1.6 GHz processors / 112 GB of memory (MPI: SGI native) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

BSP Libraries Used • The Oxford BSP Toolset on top of MPI (www.bsp-worldwide.org/implmnts/oxtool/) • PUB on top of MPI (except on the SGI) (wwwcs.uni-paderborn.de/~bsp/) • A simple BSPlib implementation based on MPI(-2) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Input and Parameters • Input strings generated randomly of equal length • Predictability examined for string lengths between 8192 and 65536, grid size parameter a between 1 and 5 • Values of l, g measured by timing random permutations P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Experimental Values of f and f´ • Simple Algorithm (f) skua 0.008 ns/op 130 M op/s argus 0.016 ns/op 61 M op/s aracari 0.012 ns/op 86 M op/s • Bit-Parallel Algorithm (f´) skua 0.00022 ns/op 4.5 G op/s argus 0.00034 ns/op 2.9 G op/s aracari 0.00055 ns/op 1.8 G op/s P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

PredictionsGood results on distributed memory systems aracari/MPI – 32 Processors P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

PredictionsSlightly worse results on shared memory (skua, MPI, p=32) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Problems when Predicting Performance • Results for PUB less accurate on shared memory • Setup costs only covered by parameter l • difficult to measure • Problems on the shared memory machine when communication size is small • PUB has performance break-in when communication size reaches a certain value • Busy communication network can create ‘spikes’ P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Predictions for the Bit-Parallel Version • Good results on distributed memory systems • Results on the SGI have larger prediction error because local computations use block sizes for which f´is not stable P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Speedup Results (LLCS, aracari) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Speedup for the Bit-Parallel Version • Speedup slightly lower than for the standard version • However, overall running times for same problem sizes are shorter • Can expect parallel speedup for larger problem sizes P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Speedup for the Bit-Parallel Version argus, p=10 skua, p=32 P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Result Summary P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Summary and Outlook • Summary • High-level BSP programming is efficient for the dynamic programming problem we considered. • Implementations benefit from a low latency implementation (The Oxford BSP toolset/PUB) • Very good predictability • Outlook • Different modeling of bandwidth allows better predictions • Lower latency possible by using subgroup synchronization • Extraction of LCS possible, using post processing step or other algorithm. . . P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Presentation Transcript

Longest Common Subsequence (LCS)

Longest Common Subsequence as Private Search

Longest Common Rigid Subsequence

Longest common subsequence (LCS) Problem

Longest Common Subsequence (LCS)

MPICH2 for Windows Parallelism on (LCS) Longest Common Subsequence

Longest common subsequence

Longest Common Subsequence

Longest Common Subsequence

Longest Common Subsequence

Longest Common Subsequence

Pattern Matching Longest Common Subsequence

More dynamic programming Longest common subsequence

Longest Common Subsequence

Longest Common Subsequence

Dynamic Programming (Longest Common Subsequence)

Longest common subsequence

Dynamic programming Longest Common Subsequence

Longest Common Subsequence (LCS) - Scoring

Longest Common Subsequence

The Longest Common Subsequence Problem

Longest Common Subsequence (LCS)