370 likes | 1.05k Views
High-throughput sequence alignment using Graphics Processing Units. Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble. Motivation. NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…
E N D
High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble
Motivation • NGS technologies produce a ton of data • AB SOLiD: 22e6 25-mers • Others are even worse… • How does 200e6 50-mers sound? • Algorithms have been pushed hard, but typically assume same workstation CPU • Wozniak and others showed S-W could be well-parallelised on special H/W. • What of other algorithms/hardware?
Motivation • GPUs have recently evolved general purpose programmability (GPGPU) • E.g.: nVidia 8800 GTX • 16 multiprocessors • 8 processors each • => 128 stream processors • 768MB onboard • 1.35GHz clock • Almost a year old now…
Short GPU Overview • Highly parallel execution (hundreds of simultaneous operations) • Hundreds of gigaflops per chip! • Large on-board memories (up to 2GB) • Limitations: • No recursion (no stacks) • Each multiprocessor’s constituent processors execute same instruction • Thread Divergence due to conditionals hurts… • No direct host memory access • Small caches (locality is key) • High memory latency • No dynamic memory allocation (why one would ever do that, I don’t know)
Short GPU Overview • GPGPU environments • Previously had to reduce problems to graphics primitives… no more • Simplified C-like programming • Paper has very little detail, but they make it sound enticingly simple… • Each processor runs the same ‘kernel’
Muh-muh-muh… MUMmer! • Maximal Unique Match • Find longest match for each subsequence of a read (of reasonable length) • Employs Suffix Trees
MUMmerGPU • Plug-and-play replacement for MUMmer • MUMmer is not ‘arithmetic intensive’ • Is the GPU a good fit? • Six-step process • 1) Build Suffix Tree of reference genome (Ukkonen’s alg. – O(n)) on host CPU • 2) Suffix Tree -> GPU Memory • 3) Queries -> GPU Memory • 4) Kick off the GPU… • 5) Results -> Host Memory • 6) Final processing on Host CPU
Suffix Trees • We want to find the longest subsequence of a string (query) quickly • Suffix Trees permit O(m) string search, m = string length • Space complexity is O(n) • But constants are apparently pretty big
Suffix Trees • Definition: • Node edges have a node label • A string subsequence • Non-empty (but can be terminating) • A path label is the sequence formed by traversing from root to leaf • 1-1 correspondence of suffixes of S to path labels • Internal nodes have at least 2 children • n leaf nodes – one for each suffix of S
Suffix Trees • O(n) space • n leaf nodes • => at most n – 1 internal nodes • => n + (n – 1) + 1 = 2n nodes (worst case) n = 3 n – 1 = 2 3 + 2 + root = 6 nodes
Suffix Trees • Example: TORONTO$ • ‘$’ is terminating character T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1
Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1
Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1
Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1
Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1 ‘ONT’ at position 3 in S
Suffix Trees • MUMmer wants to find all maximal unique matches for all suffixes: • E.g., for query ACCGTGCGTC, we want: • ACCGTGCGTC • CCGTGCGTC • CGTGCGTC • GTGCGTC • … • Up to some reasonable limit… • Don’t want to go back to root of tree each time…
Suffix Trees • Suffix Links • All internal, non-root nodes have a suffix link to another node • If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a. • Got that?
Suffix Trees • Example: TORONTO$ • Suffix Links… Don’t backtrack (bad ex.) T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1
Suffix Trees • Example: BANANA$ • Better example of Suffix Links A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1
Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1
Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1
Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1
Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1
Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1
Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1
Memory Limitations • Suffix trees take up a fair bit of memory • GPUs have 100’s of MBs, but this is still small • Divide the target sequence into ‘k’ segments with overlaps
Cache Optimisation • Memory latency high, cache performance crucial • We’re walking a tree here, not crunching numbers down an array • Can store read-only data in 2D textures; nVidia caching scheme optimises access • Re-order and squish tree nodes into ‘texel blocks’ such that: • Nodes near root are level-ordered (BFS) • Nodes further down are ordered with descendants
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Cache Optimisation • Texture cache organized in 2x2 blocks. • Try to place all children of a node are in the same cache block Shamelessly cribbed from: http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt
Cache Optimisation • Reference Sequence stored in 4x216 blocks of a 2D array • Sequence: A B C D E F G H … ………. A E B F C G D H ………. α Φ β Χ Γ Ψ Δ Ω Why? It worked well.
Cache Optimisation • Memory layouts heuristically determined • nVidia cache details not public • Cache optimisation improves execution speed ‘by several fold’.
Conclusions • GPGPU isn’t just good for ‘arithmetic intensive’ applications • 5-11x speed-up for NGS data
Conclusions • Fine Print: • 5-11x is for the Suffix Tree kernel on the GPU • Reality is different! • 3.5x speed-up for real data in terms of total application runtime. • Pretty constant across read lengths (35-700+ bp) • Careful management of memory layout is crucial • Authors claim several-fold performance increase (could be difference between some improvement and none)
Conclusions • Runtime dominated by serial parts of MUMmer
Food for Thought • 8800 GTX costs ~$400, uses 100-150 watts • Quad Core 2 chip runs ~$250, uses 100-130 watts • Each core approx. 2x faster than their test CPU • MUMmerGPU maximally 3.5x faster than test CPU • What have we won here?
Food for Thought • Confusing reports • “Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement • Earlier course paper (early/mid-2007) • Why from 35x down to 5-11x with MUMmerGPU?
My Impressions… • (…whatever they’re worth) • GPU is not a clear win (in this case) • Suffix trees seem unsuited: • Cache locality trouble • O(n) footprint, but multiplicative constants are still substantial • Host CPUs seem to be as good or better (in $ and watts)
My Impressions… • GPGPU’s aren’t a great fit here • At least for this algorithm… • MUMmerGPU isn’t the order-of-magnitude win it claims to be • But this is a first-generation, general-purpose chip • geared toward number-crunching, not pointer-traversing • I don’t think we’ve seen the last (nor the best) of GPUs…