A Million-Fold Speed Improvement in Genomic Repeats Detection

John W. Romein Jaap Heringa Henri E. Bal A Million-Fold Speed Improvementin Genomic Repeats Detection • Vrije Universiteit • Faculty of Sciences, Department of Computer Science • Bio-Informatics Group & Computer Systems Group • Amsterdam, the Netherlands VrijeUniversiteit, Amsterdam

repeats in bio sequences • important to detect • essential for evolution • protein structure & function • diseases • hard to detect • any length • mutations • insertions/deletions  different fragment sizes • tandem and distant

repro • delineates repeats • sensitive • two phases • find top alignments (slow)‏ • find repeats • replaced phase 1 • old algorithm • O(n4)  n < 2,000 • new algorithm • O(n3)  n < 60,000 • 3-level parallel: SIMD, SMP, cluster

sidestep: sequence alignment • superpose two sequences (TATGCAG, TCTGAG)‏ • match symbols vertically (good: +2, bad: -1)‏ • allow gaps (-2-1*length)‏ • maximize score • compute matrix using dynamic programming

sidestep: local alignment • Find sub-sequences that match well • Ignores non-matching values before and after the subsequence (by disallowing negative values) • Construct actual alignment: O(n3) time • Computing only the scores: O(n2) time • (see paper)

summary • (TATGCAG, TCTGAG) => 6 • takes O(n2) time • (TATGCAG, TCTGAG) => • takes O(n3) time • Matching <junk1> TATGCAG <junk2> with <junk3> TCTGAG <junk4>gives same result as matching only the substrings TATGCAG and TCTGAG

finding topalignments • red lines: top alignments • split sequence every possible way • align subsequence-pair • best is first top alignment • trick: find next best (top) alignment usingO(n2) algorithm n times; construct topalignment using O(n3) algorithm • repeat while avoiding found top alignments • user typically wants 5-30 top alignments • ordered list, do most promising alignments first • realign 3-10%

performance old vs. new • sequence: longest known protein (titin)‏ • speed improvement increases with sequence length

parallel alignment • parallelism within alignment • loop-carried dependency • concurrent alignments • speculative parallelism • good performance • three-level parallelism • SSE/SSE2 multimedia extensions (SIMD) • shared memory MIMD • distributed memory MIMD

SIMD parallelism • multimedia extensions • 4 (SSE) or 8 (SSE2) parallel operations on consecutive 2-byte words • compiler intrinsics • compute 4 (or 8) neighboring matrices concurrently • interleaved memory layout • use fine-grained hardware for coarse-grained computation • applicable to any program that does many alignments

SSE/SSE2 performance • speedups w.r.t. new algorithm • superlinear speedups • MAX operator • 8 extra mmx/xmm registers • scheduling • cache-aware alignment: 4 – 6.5 times faster

MIMD parallelism • SIMD (SSE) parallelism is speculative • If a matrix (alignment) is ‘promising’, its neighbors probably also are promising • MIMD parallelism: • use dynamic task scheduling, selecting most promising tasks from a job queue • Shared memory (SMP): easy • Distributed memory: MPI, master/worker

total parallel performance • SMP: 2 CPUs  2 2 times faster • cluster: 64*2 CPUs  548 – 889-fold speedup • Up to 125x faster than SSE version on 1 CPU

conclusions • new algorithm >> 100 times faster • much more for longer sequences • parallel: SSE(2), SMP, cluster • SSE(2) parallelism yields superlinear speedups • 128 CPUs: 548 – 889-fold speedup • 1,000,000-fold speed improvement

A Million-Fold Speed Improvement in Genomic Repeats Detection

A Million-Fold Speed Improvement in Genomic Repeats Detection

Presentation Transcript

Fold

Fold

Automatic Tracing of Vocal Fold Motion in High Speed Laryngeal Video

Repeats, Pseudogenes etc.

Low Copy Repeats in the Human Genome Implications for Genomic Structure

SPEED DETECTION OF MOVING VEHICLE BY USING SPEED CAMERAS

Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing

fold

Fold

Spectrin repeats

Repeats!

Telomeric repeats

Inverted terminal repeats

tnf- a (fold change)

SPEED DETECTION OF MOVING VEHICAL BY USING SPEED CAMERAS

Botnet and Spam Detection in High-Speed Networks

Speed Camera Detection

ONE IN A MILLION

Repeats