1 / 14

A Million-Fold Speed Improvement in Genomic Repeats Detection

John W. Romein Jaap Heringa Henri E. Bal. A Million-Fold Speed Improvement in Genomic Repeats Detection. Vrije Universiteit Faculty of Sciences, Department of Computer Science Bio-Informatics Group & Computer Systems Group Amsterdam, the Netherlands. Vrije Universiteit , Amsterdam.

machiko-rin
Download Presentation

A Million-Fold Speed Improvement in Genomic Repeats Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. John W. Romein Jaap Heringa Henri E. Bal A Million-Fold Speed Improvementin Genomic Repeats Detection • Vrije Universiteit • Faculty of Sciences, Department of Computer Science • Bio-Informatics Group & Computer Systems Group • Amsterdam, the Netherlands VrijeUniversiteit, Amsterdam

  2. repeats in bio sequences • important to detect • essential for evolution • protein structure & function • diseases • hard to detect • any length • mutations • insertions/deletions  different fragment sizes • tandem and distant

  3. repro • delineates repeats • sensitive • two phases • find top alignments (slow)‏ • find repeats • replaced phase 1 • old algorithm • O(n4)  n < 2,000 • new algorithm • O(n3)  n < 60,000 • 3-level parallel: SIMD, SMP, cluster

  4. sidestep: sequence alignment • superpose two sequences (TATGCAG, TCTGAG)‏ • match symbols vertically (good: +2, bad: -1)‏ • allow gaps (-2-1*length)‏ • maximize score • compute matrix using dynamic programming

  5. sidestep: local alignment • Find sub-sequences that match well • Ignores non-matching values before and after the subsequence (by disallowing negative values) • Construct actual alignment: O(n3) time • Computing only the scores: O(n2) time • (see paper)

  6. summary • (TATGCAG, TCTGAG) => 6 • takes O(n2) time • (TATGCAG, TCTGAG) => • takes O(n3) time • Matching <junk1> TATGCAG <junk2> with <junk3> TCTGAG <junk4>gives same result as matching only the substrings TATGCAG and TCTGAG

  7. finding topalignments • red lines: top alignments • split sequence every possible way • align subsequence-pair • best is first top alignment • trick: find next best (top) alignment usingO(n2) algorithm n times; construct topalignment using O(n3) algorithm • repeat while avoiding found top alignments • user typically wants 5-30 top alignments • ordered list, do most promising alignments first • realign 3-10%

  8. performance old vs. new • sequence: longest known protein (titin)‏ • speed improvement increases with sequence length

  9. parallel alignment • parallelism within alignment • loop-carried dependency • concurrent alignments • speculative parallelism • good performance • three-level parallelism • SSE/SSE2 multimedia extensions (SIMD) • shared memory MIMD • distributed memory MIMD

  10. SIMD parallelism • multimedia extensions • 4 (SSE) or 8 (SSE2) parallel operations on consecutive 2-byte words • compiler intrinsics • compute 4 (or 8) neighboring matrices concurrently • interleaved memory layout • use fine-grained hardware for coarse-grained computation • applicable to any program that does many alignments

  11. SSE/SSE2 performance • speedups w.r.t. new algorithm • superlinear speedups • MAX operator • 8 extra mmx/xmm registers • scheduling • cache-aware alignment: 4 – 6.5 times faster

  12. MIMD parallelism • SIMD (SSE) parallelism is speculative • If a matrix (alignment) is ‘promising’, its neighbors probably also are promising • MIMD parallelism: • use dynamic task scheduling, selecting most promising tasks from a job queue • Shared memory (SMP): easy • Distributed memory: MPI, master/worker

  13. total parallel performance • SMP: 2 CPUs  2 2 times faster • cluster: 64*2 CPUs  548 – 889-fold speedup • Up to 125x faster than SSE version on 1 CPU

  14. conclusions • new algorithm >> 100 times faster • much more for longer sequences • parallel: SSE(2), SMP, cluster • SSE(2) parallelism yields superlinear speedups • 128 CPUs: 548 – 889-fold speedup • 1,000,000-fold speed improvement

More Related