820 likes | 1.89k Views
MUMmer: fast alignment of large-scale DNA and protein sequences. Presented by : Arthur Loder Course : CSE 497 Computational Issues in Molecular Biology Date : February 17, 2004. MUMmer’s Significance. MUMmer is a system for rapidly aligning entire genomes or very large protein sequences
E N D
MUMmer: fast alignment of large-scale DNA and protein sequences Presented by : Arthur Loder Course : CSE 497 Computational Issues in Molecular Biology Date : February 17, 2004
MUMmer’s Significance • MUMmer is a system for rapidly aligning entire genomes or very large protein sequences • Input=2 strings; Output=base-to-base alignment • What distinguishes MUMmer from previous algorithms? • Can align very large sequences (millions of nucleotides long)
Complete Genome Alignment • What it means to align complete genomes: • Human chromosome #1: 200 million base pairs • Previous alignment algorithms run space efficiently O(n) • Time complexity O(n2) is unacceptably slow • We would like to align using a global dynamic programming algorithm (Needleman/Wunsch), but infeasible; need a shortcut
MUMmer vs. BLAST • Instead, use technique somewhat similar to BLAST (find similar regions between strings) • BLAST for comparing 1 known string to a large set of unknown strings • MUMmer for aligning 2 very similar known strings
Overview: MUMs Part 1 • String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA… • String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA… • How align without Needleman/Wunsch ?
Overview: MUMs Part 2 • String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA… • String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA… • Easier with large exact matches highlighted?
Overview: MUMs Part 3 • Any optimal global alignment will probably use these two subsequences as “anchors” • This is the shortcut needed to calculate global alignment quickly on large sequences • Very intuitive in alignment process, but only a heuristic
Overview: MUMs Part 4 • MUM: Maximal Unique Match • MUMs occur exactly once in each sequence • Ignores repeat sequences • MUMs found efficiently using suffix tree data structure (to be explained later)
Overview: Choosing MUMs • Once have anchors, need to choose which ones to use in alignment • All MUMs: • MUMs used in alignment (subset):
Overview: Closing the Gaps • After choose/align anchors, what next? • Close the gaps • Use dynamic programming (Smith-Waterman) to extend MUMs • Smaller regions, so can compute quickly • Implicit assumption: sequences very similar
3 Phases of MUMmer • 3 Phases: • 1) Obtaining MUMs (via Suffix Trees) • 2) MUM choosing (via Longest Increasing Subsequences) • 3) Gap closing (via dynamic programming, Smith-Waterman) • Comprised of previously known algorithms packaged to form a unique algorithm
Critiquing MUMmer’s Output • Sample Output: Sequence A: …ACTGC_TGAC_CTA… Sequence B: …ACC_CA_GGCTCG_… ^ ^ ^ • MUMmer best-case: same alignment as Needleman/Wunsch • MUMmer worst-case: sub optimal alignment • At least computable, whereas Needleman/Wunsch is not
Phase 1: Obtaining MUMs via Suffix Trees Edward M. McCreight. (1973) A Space-Economical Suffix Tree Construction Algorithm. http://doi.acm.org/10.1145/321941.321946
Suffix Trees Outline • I. Suffix Trees • A. Motivations • B. Tries • C. Suffix Trees • D. How Mummer utilizes suffix trees
Tries • Term ‘trie’ comes from ‘retrieval’ • Introduced in 1960’s by Fredkin • Suffix trees are a type of trie • Uses: • Quickly search large text via preprocessing • Used for regular expressions, longest common substring, automatic command completion, etc
Non-Compact Trie Example • 5 strings encoded: BIG, BIGGER, BILL, GOOD, GOSH • Every edge represents a symbol of the alphabet
Implementation of Tries • Use linked list • Include pointers to sibling and first child
Compacting Tries : Part 1 • Method 1: trim chains leading to leaves • Compact trie for strings: BIG, BIGGER, BILL, GOOD, GOSH
Compacting Tries : Part 2 • Method 2: Patricia Tries • Before, one edge per character • Now, unary nodes are collapsed
Suffix Trie • Normal trie, but input strings are suffixes • Assume text string [t1…tn] • Q: Tree has how many leaves? • A: Tree has n Leaves
Suffix Tree • First compact suffix trie • Next collapse unary nodes
Suffix Trees: Decreasing storage • Rather than storing strings, store a pair of indices (x,y) where x is beginning of string and y is the end • Storage becomes O(n)
Suffix Tree Algorithms • First linear-time algorithm given by Weiner (1973) • McCreight developed more space efficient algorithm (1976) • Two original papers’ reputations: difficult to understand
McCreight’s Algorithm Part 1 • Algorithm M: • “Maps finite string S of characters into auxiliary index to S in the form of a digital search tree T whose paths are the suffixes of S, and whose terminal nodes correspond uniquely to positions within S.” A Space-Economical Suffix Tree Construction Algorithm. Edward M. McCreight. http://doi.acm.org/10.1145/321941.321946
McCreight’s Algorithm Part 2 • S = ababc • Definitions: • sufi is suffix of S beginning at character position i • headi is longest prefix of sufi which is also prefix for sufj for some j < i • taili is sufi – headi • suf3 =? • abc • head3 =? • ab • tail3 =? • abc – ab = c
McCreight’s Algorithm Part 3 • Builds suffix tree by adding sufi to Treei-1 • Initially, Tree1 contains only suff1 (the entire string) • To obtain Tree2, add suff2 to Tree1 • Continue until you have added suffn to Treen-1 • Treen is the final suffix tree
McCreight’s Algorithm Part 4 • Adding a suffix (going from T2 to T3) • suf3= abc; head3 = ab; tail3 = c
Suffix Trees Complexity • Adding a non-terminal and a new arc that corresponds to tail takes at most constant time • If could find head in at most constant time, it would run in linear time n, the length of the string S • Do so by using suffix links (see paper for details)
Finding MUMs from Suffix Trees 3 • More general case:
Phase 2: Choosing MUMs For Alignment via Longest Increasing Subsequence (LIS) Gusfield, D. (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.
Motivation For Choosing MUMs • Q: Why can’t we use all MUMs for alignment? • A: Due to crossing of MUMs; can only choose increasing set of MUMs • Problem: given a set of MUMs, how do we choose the optimal sequence?
Choosing MUMs (Continued) • Configuration can be uniquely represented: • P = {1, 2, 3, 4, 6, 7, 5}; • LIS(P) = {1, 2, 3, 4, 6, 7} • Determining optimal sequence of MUMs reduces to finding LIS of P
IS Definition • Increasing Subsequence: values (strictly) increase from left to right • Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} • Examples of two increasing subsequences: {4, 5, 9} or {2, 5, 6, 9, 10}
DS Subsequence • Decreasing Subsequence: numbers that are decreasing from left to right • Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} • Examples? <insert class participation here> {4, 2, 1}, {4, 2}, {4, 1}, {2, 1}, or {8, 6}
Covers Definition Part 1 • Cover of P: set of decreasing subsequences of P that contains all numbers of P • P = {7, 3, 4, 8, 6, 2, 1} • Some possible covers ? {{7, 3} {4} {8} {6, 2, 1}} OR {{7, 3, 2, 1} {4} {8, 6}} And Others…
Covers Definitions Part 2 • Size of cover: number of decreasing subsequences it contains • Smallest cover: cover with minimum size • “If I is an increasing subsequence of P with length equal to the size of a cover of P, call it C, then I is a longest increasing subsequence of P and C is a smallest cover of P” • Why?
Covers & Relation to LIS • “If I is an increasing subsequence of P with length equal to the size of a cover of P, call it C, then I is a longest increasing subsequence of P and C is a smallest cover of P”. Why? • Because no increasing subsequence can contain more than one character from each decreasing subsequence in a cover
Covers: Examples • P = {7, 3, 4, 8, 6, 2, 1} • Two possible covers: {{7, 3} {4} {8} {6, 2, 1}} {{7, 3, 2, 1} {4} {8, 6}} • What is the size of the smallest cover? • 3 (no cover can contain < 3 decreasing subsequences) • How many elements in LIS? • 3
Covers: Examples Continued • P = {7, 3, 4, 8, 6, 2, 1} • For a particular cover, say: {{7, 3, 2, 1} {4} {8, 6}} • You can only choose one element from each subsequence, otherwise subsequence would not be increasing. • Example: IS: {3, 4, 6}; to add an element, would need to choose from a subsequence from which you already chose
Greedy Cover Algorithm Part 1 • To create a smallest cover, use Greedy Cover algorithm: • Start from left of sequence P • Examine each number • Place number at the end of the left-most subsequence it can extend • If none exists, make a new decreasing subsequence (to the far right)
Greedy Cover Algorithm Part 2 • Example: P = 6, 3, 5, 1, 9, 7, 2, 10, 4 • {6} • {6, 3} • {6, 3} {5} • {6, 3, 1} {5} • {6, 3, 1} {5} {9} • {6, 3, 1} {5} {9, 7} • {6, 3, 1} {5, 2} {9, 7} • {6, 3, 1} {5, 2} {9, 7} {10} • {6, 3, 1} {5, 2} {9, 7, 4} {10} • {6, 3, 1}{5, 2}{9, 7, 4} {10} (smallest cover)
Obtaining LIS From Smallest Cover • LIS Algorithm: • Set i = # subsequences in greedy cover • Set I to empty list • Choose any element x in subsequence I and place in front of List I • While i > 1 • Scan from top of subsequence (i-1) and find first element y smaller than x • x = y and i = i -1 • Place x in the front of list I
Obtaining LIS : Example • P = {6, 3, 5, 1, 9, 7, 2, 10, 4} • Smallest Cover: {6, 3, 1}{5, 2}{9, 7, 4} {10} 659 10 3 27 14 • i = # subsequences = 4 • i = 4; x = 10; I = {10} • i = 3; x = 9; I = {9, 10} • i = 2; x = 5; I = {5, 9, 10} • i = 1; x = 3; I = {3, 5, 9, 10} • P = 6, 3, 5, 1, 9, 7, 2, 10, 4 • LIS : {3, 5, 9, 10}
How Mummer Utilizes LIS • P = {1, 2, 3, 4, 6, 7, 5} • LIS = {1, 2, 3, 4, 6, 7}
Obtaining LIS: Complexity Analysis • Greedy cover can be found in O(nlogn) • LIS found from greedy cover in O(n)
Phase 3: Closing the Gaps via Smith-Waterman Smith, T. and Waterman, M. (1981) Identification of Common Molecular Subsequences. Journal of Molecular Biology , 147, 195-197. http://citeseer.ist.psu.edu/smith81identification.html
Closing the Gaps • After global-MUM alignment found, need to close local gaps • Gap: interruption in MUM-alignment • Types of gaps: • 1) SNP Single Nucleotide Polymorphisms • 2) Insertion • 3) Highly polymorphic region • 4) Repeat