1 / 25

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism. Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006. Outline. Introduction LLCS Computation The BSP Model Problem Definition and Algorithms

alicia
Download Presentation

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Longest Common SubsequenceComputation using Bulk-Synchronous Parallelism Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006 P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  2. Outline • Introduction • LLCS Computation • The BSP Model • Problem Definition and Algorithms • Standard Algorithm • Parallel Algorithm • Experiments • Experiment Setup • Predictions • Speedup P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  3. Motivation Computing the (Length of the) Longest Common Subsequence is representative of a class of dynamic programming algorithms. Hence, we want to • Examine the suitability of high-level BSP programming for such problems • Compare different BSP libraries on different systems • See what happens when there is good sequential performance • Examine performance predictability P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  4. Related Work • Sequential dynamic programming algorithm (Hirschberg, ’75) • Crochemore, Iliopoulos, Pinzon, Reid: A fast and practical bit-vector algorithm for the Longest Common Subsequence problem (2001) • Alves, Cáceres, Dehne: Parallel dynamic programming for solving the string editing problem on a CGM/BSP (2002). • Garcia, Myoupo, Semé: A coarse-grained multicomputer algorithm for the longest common subsequence problem (2003). P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  5. Our Work • Combination of bit-parallel algorithms and fast BSP-style communication • A BSP performance model and predictions • Comparison using different libraries on different systems • Estimation of block size parameter before calculation for better speedup P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  6. The BSP Model • p identical processor/memory pairs (computing nodes) • Computation speed f on every node • Arbitrary interconnection network, latency l, bandwidth gap g P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  7. BSP Programs • SPMD execution, takes place in supersteps • Communication may be delayed until the end of the superstep • Time/Cost Formula : T = f ·W + g · H + l · S Bytes will be used as a base unit for communication size P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  8. Problem Definition Let X = x1x2 . . .xmand Y = y1y2 . . .yn be two strings on a finite alphabet • Subsequence U of string X: U can be obtained by deleting zero or more elements from X i.e. U = xi1xi2 . . .xik and iq < iq+1 for all q with 1 ≤ q < k. • Strings X and Y : LCS (X, Y) is any string which is subsequence of both X and Y and has maximum possible length. • Length of these sequences: LLCS (X, Y). P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  9. Sequential Algorithm • Dynamic programming matrix L0..m,0..n • Li,j = LLCS( x1x2…xi, y1y2…yj ) • The values in this matrix can be computed in O(mn) time and space P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  10. Parallel Algorithm • Based on a simple parallel algorithm for grid DAG computation • Dynamic programming matrix L is partitioned into a grid of rectangular blocks of size (m/G)×(n/G) (G : grid size) • Blocks in a wavefront can be processed in parallel • Assumptions: • Strings of equal length m = n • Ratio a = G/p is an integer P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  11. Parallel Cost Model • Input/output data distribution is block-cyclic • Can keep data for block-columns locally • Running time: • Parameter a can be used to tune performance P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  12. Bit-Parallel Algorithms • Bit-parallel computation processes w entries of L in parallel (w : machine word size) • This leads to substantial speedup for the sequential computation phase and slightly lower communication cost per superstep. P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  13. Systems Used Measurements on parallel machines at the Centre for Scientific Computing: • aracari: IBM cluster, 64 × 2-way SMP Pentium3 1.4 GHz/128 GB of memory (Interconnection Network: Myrinet 2000, MPI: mpich-gm) • argus: Linux cluster, 31 × 2-way SMP Pentium4 Xeon 2.6 GHz processors/62 GB of memory (Interconnection Network: 100Mbit Ethernet, MPI: mpich-p4) • skua: SGI Altix shared memory machine, 56 × Itanium-2 1.6 GHz processors / 112 GB of memory (MPI: SGI native) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  14. BSP Libraries Used • The Oxford BSP Toolset on top of MPI (www.bsp-worldwide.org/implmnts/oxtool/) • PUB on top of MPI (except on the SGI) (wwwcs.uni-paderborn.de/~bsp/) • A simple BSPlib implementation based on MPI(-2) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  15. Input and Parameters • Input strings generated randomly of equal length • Predictability examined for string lengths between 8192 and 65536, grid size parameter a between 1 and 5 • Values of l, g measured by timing random permutations P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  16. Experimental Values of f and f´ • Simple Algorithm (f) skua 0.008 ns/op 130 M op/s argus 0.016 ns/op 61 M op/s aracari 0.012 ns/op 86 M op/s • Bit-Parallel Algorithm (f´) skua 0.00022 ns/op 4.5 G op/s argus 0.00034 ns/op 2.9 G op/s aracari 0.00055 ns/op 1.8 G op/s P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  17. PredictionsGood results on distributed memory systems aracari/MPI – 32 Processors P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  18. PredictionsSlightly worse results on shared memory (skua, MPI, p=32) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  19. Problems when Predicting Performance • Results for PUB less accurate on shared memory • Setup costs only covered by parameter l • difficult to measure • Problems on the shared memory machine when communication size is small • PUB has performance break-in when communication size reaches a certain value • Busy communication network can create ‘spikes’ P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  20. Predictions for the Bit-Parallel Version • Good results on distributed memory systems • Results on the SGI have larger prediction error because local computations use block sizes for which f´is not stable P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  21. Speedup Results (LLCS, aracari) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  22. Speedup for the Bit-Parallel Version • Speedup slightly lower than for the standard version • However, overall running times for same problem sizes are shorter • Can expect parallel speedup for larger problem sizes P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  23. Speedup for the Bit-Parallel Version argus, p=10 skua, p=32 P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  24. Result Summary P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

  25. Summary and Outlook • Summary • High-level BSP programming is efficient for the dynamic programming problem we considered. • Implementations benefit from a low latency implementation (The Oxford BSP toolset/PUB) • Very good predictability • Outlook • Different modeling of bandwidth allows better predictions • Lower latency possible by using subgroup synchronization • Extraction of LCS possible, using post processing step or other algorithm. . . P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism

More Related