1 / 37

High-throughput sequence alignment using Graphics Processing Units

High-throughput sequence alignment using Graphics Processing Units. Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble. Motivation. NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…

koko
Download Presentation

High-throughput sequence alignment using Graphics Processing Units

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble

  2. Motivation • NGS technologies produce a ton of data • AB SOLiD: 22e6 25-mers • Others are even worse… • How does 200e6 50-mers sound? • Algorithms have been pushed hard, but typically assume same workstation CPU • Wozniak and others showed S-W could be well-parallelised on special H/W. • What of other algorithms/hardware?

  3. Motivation • GPUs have recently evolved general purpose programmability (GPGPU) • E.g.: nVidia 8800 GTX • 16 multiprocessors • 8 processors each • => 128 stream processors • 768MB onboard • 1.35GHz clock • Almost a year old now…

  4. Short GPU Overview • Highly parallel execution (hundreds of simultaneous operations) • Hundreds of gigaflops per chip! • Large on-board memories (up to 2GB) • Limitations: • No recursion (no stacks) • Each multiprocessor’s constituent processors execute same instruction • Thread Divergence due to conditionals hurts… • No direct host memory access • Small caches (locality is key) • High memory latency • No dynamic memory allocation (why one would ever do that, I don’t know)

  5. Short GPU Overview • GPGPU environments • Previously had to reduce problems to graphics primitives… no more • Simplified C-like programming • Paper has very little detail, but they make it sound enticingly simple… • Each processor runs the same ‘kernel’

  6. Muh-muh-muh… MUMmer! • Maximal Unique Match • Find longest match for each subsequence of a read (of reasonable length) • Employs Suffix Trees

  7. MUMmerGPU • Plug-and-play replacement for MUMmer • MUMmer is not ‘arithmetic intensive’ • Is the GPU a good fit? • Six-step process • 1) Build Suffix Tree of reference genome (Ukkonen’s alg. – O(n)) on host CPU • 2) Suffix Tree -> GPU Memory • 3) Queries -> GPU Memory • 4) Kick off the GPU… • 5) Results -> Host Memory • 6) Final processing on Host CPU

  8. Suffix Trees • We want to find the longest subsequence of a string (query) quickly • Suffix Trees permit O(m) string search, m = string length • Space complexity is O(n) • But constants are apparently pretty big

  9. Suffix Trees • Definition: • Node edges have a node label • A string subsequence • Non-empty (but can be terminating) • A path label is the sequence formed by traversing from root to leaf • 1-1 correspondence of suffixes of S to path labels • Internal nodes have at least 2 children • n leaf nodes – one for each suffix of S

  10. Suffix Trees • O(n) space • n leaf nodes • => at most n – 1 internal nodes • => n + (n – 1) + 1 = 2n nodes (worst case) n = 3 n – 1 = 2 3 + 2 + root = 6 nodes

  11. Suffix Trees • Example: TORONTO$ • ‘$’ is terminating character T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1

  12. Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1

  13. Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1

  14. Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1

  15. Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1 ‘ONT’ at position 3 in S

  16. Suffix Trees • MUMmer wants to find all maximal unique matches for all suffixes: • E.g., for query ACCGTGCGTC, we want: • ACCGTGCGTC • CCGTGCGTC • CGTGCGTC • GTGCGTC • … • Up to some reasonable limit… • Don’t want to go back to root of tree each time…

  17. Suffix Trees • Suffix Links • All internal, non-root nodes have a suffix link to another node • If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a. • Got that?

  18. Suffix Trees • Example: TORONTO$ • Suffix Links… Don’t backtrack (bad ex.) T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1

  19. Suffix Trees • Example: BANANA$ • Better example of Suffix Links A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1

  20. Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1

  21. Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1

  22. Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1

  23. Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1

  24. Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1

  25. Suffix Trees • Example: BANANA$ • Searching for suffixes of ‘ANANA’ A NA BANANA$ $ NA NA$ $ 0 4 2 5 $ NA$ 3 1

  26. Memory Limitations • Suffix trees take up a fair bit of memory • GPUs have 100’s of MBs, but this is still small • Divide the target sequence into ‘k’ segments with overlaps

  27. Cache Optimisation • Memory latency high, cache performance crucial • We’re walking a tree here, not crunching numbers down an array • Can store read-only data in 2D textures; nVidia caching scheme optimises access • Re-order and squish tree nodes into ‘texel blocks’ such that: • Nodes near root are level-ordered (BFS) • Nodes further down are ordered with descendants

  28. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Cache Optimisation • Texture cache organized in 2x2 blocks. • Try to place all children of a node are in the same cache block Shamelessly cribbed from: http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt

  29. Cache Optimisation • Reference Sequence stored in 4x216 blocks of a 2D array • Sequence: A B C D E F G H … ………. A E B F C G D H ………. α Φ β Χ Γ Ψ Δ Ω Why? It worked well.

  30. Cache Optimisation • Memory layouts heuristically determined • nVidia cache details not public • Cache optimisation improves execution speed ‘by several fold’.

  31. Conclusions • GPGPU isn’t just good for ‘arithmetic intensive’ applications • 5-11x speed-up for NGS data

  32. Conclusions • Fine Print: • 5-11x is for the Suffix Tree kernel on the GPU • Reality is different! • 3.5x speed-up for real data in terms of total application runtime. • Pretty constant across read lengths (35-700+ bp) • Careful management of memory layout is crucial • Authors claim several-fold performance increase (could be difference between some improvement and none)

  33. Conclusions • Runtime dominated by serial parts of MUMmer

  34. Food for Thought • 8800 GTX costs ~$400, uses 100-150 watts • Quad Core 2 chip runs ~$250, uses 100-130 watts • Each core approx. 2x faster than their test CPU • MUMmerGPU maximally 3.5x faster than test CPU • What have we won here?

  35. Food for Thought • Confusing reports • “Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement • Earlier course paper (early/mid-2007) • Why from 35x down to 5-11x with MUMmerGPU?

  36. My Impressions… • (…whatever they’re worth) • GPU is not a clear win (in this case) • Suffix trees seem unsuited: • Cache locality trouble • O(n) footprint, but multiplicative constants are still substantial • Host CPUs seem to be as good or better (in $ and watts)

  37. My Impressions… • GPGPU’s aren’t a great fit here • At least for this algorithm… • MUMmerGPU isn’t the order-of-magnitude win it claims to be • But this is a first-generation, general-purpose chip • geared toward number-crunching, not pointer-traversing • I don’t think we’ve seen the last (nor the best) of GPUs…

More Related