Rapid Global Alignments

Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Motivation • Genomic sequences are very long: • Human genome = 3 x 109 –long • Mouse genome = 2.7 x 109 –long • Aligning genomic regions is useful for revealing common gene structure • Useful to compare regions > 1,000,000-long

Main Idea Genomic regions of interest contain ordered islands of similarity, such as genes • Find local alignments • Chain an optimal subset of them

Outline • Methods to FIND Local Alignments • Sorting k-long words • Suffix Trees • Methods to CHAIN Local Alignments • Dynamic Programming • Sparse Dynamic Programming

Methods to FIND Local Alignments Sorting K-long words BLAST, BLAT, and the like Suffix Trees

Finding Local Alignments: Sorting k-long words Given sequences x, y: • Write down all (w, 0, i): w = xi+1…xi+k (z, 1, j): z = yj+1…yj+k • Sort them lexicographically • Deduce all k-long matches between x and y • Extend to local alignments

Sorting k-long words: example Let x, y be matched with 3-long words: x = caggc: (cag,0,0), (agg,0,1), (ggc,0,2) y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2) Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1) Matches: 1. cag: x1x2x3 = y3y4y5 2. ggc: x3x4x5 = y1y2y3

Running time • Worst case: O(NxM) • In practice: a large value of k results in a short list of matches Tradeoff: Low k: worse running time High k: significant alignments missed PatternHunter: Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3

Suffix Trees • Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d a b d a c 1 a c b d a b 4 c d c a c c 3 2 6 5

Definition of a Suffix Tree Definition: For string x = x1…xm, a suffix tree is: • A rooted tree with m leaves Leaf i: xi…xm • Each edge is a substring • No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

Constructing a Suffix Tree • Naïve algorithm: O( N2 ) time • Better algorithms: O( N ) time (outside the scope of this class – too technical and not so interesting) Memory: O( N ) but with a significant constant

Naïve Algorithm to Construct a Suffix Tree • Initialize tree T: a single root node r • Insert special symbol $ at end of x • For j = 1 to m • Find longest match of xi…xm to T, starting from r • Split edge where match stops: new node w • Create edge (w, j), and label with unmatched portion of xi…xm

1. Insert d a b d a $ 2. Insert a b d a $ d a b d a $ 3. Insert b d a $ 4. Insert d a $ a $ b 5. Insert a $ d a b 6. Insert $ 4 $ d $ a $ $ 3 2 6 5 Example of Suffix Tree Construction x = d a b d a $ 1

Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

Memory to Store Suffix Tree • Can store in O( N ) memory! • Every edge is labeled with (i, j): (i,j) denotes xi…xj • Tree has O( N ) nodes Proof: • # leafs  # nodes – 1 • # leafs = |x|

Application: find all matches between x, y • Build suffix tree for x, mark nodes with x • Insert y in suffix tree, mark all nodes y “passes from” with y • The path label of every node marked both 0 and 1, is a common substring

y y 2. Insert a b a d a $ 3. Insert b a d a $ a x y 4. Insert a d a $ y d 4 x y a 5. Insert d a $ 6 a $ 6. Insert a $ d d a 6. Insert $ 2 $ a 5 3 $ 1 Example of Suffix Tree construction x = d a b d a $ y = a b a d a $ d a b d a $ 1 1. Construct tree for x x x a $ b d a b 4 $ x d $ a 6 $ $ 3 2 5

Application: String search on a database Say we have a database D = { s1, s2, …sn } (e.g., proteins) Question: Given new string x, find all matches of x to database • Build suffix tree for {s1,…, sn} • All new queries x take O( |x| ) time (somewhat like BLAST)

Application: common substrings of k strings To find the longest common substring of s1, s2, …sn • Build suffix tree for s1,…, sn • All nodes labeled {si1, …, sik} represent a match between si1, …, sik

Methods toCHAINLocal Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Quadratic Time Solution • Build Directed Acyclic Graph (DAG): • Nodes: local alignments [(xa,xb) (ya,yb)] & score • Directed edges: local alignments that can be chained • edge ( (xa, xb, ya, yb), (xc, xd, yc, yd) ) • xa < xb < xc < xd • ya < yb < yc < yd Each local alignment is a node vi with alignment score si

Quadratic Time Solution Dynamic programming: Initialization: Find each node va s.t. there is no edge (u,v0) Set score of V(a) to be sa Iteration: For each vi, optimal path ending in vi has total score: V(i) = max ( weight(vj, vi) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from vj Worst case time: quadratic

Sparse Dynamic Programming Back to the LCS problem: • Given two sequences • x = x1,…, xm • y = y1, …, yn • Find the longest common subsequence • Quadratic solution with DP • How about when “hits” xi = yj are sparse?

Sparse Dynamic Programming • Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Sparse Dynamic Programming – L.I.S. • Longest Increasing Subsequence • Given a sequence over an ordered alphabet • x = x1,…, xm • Find a subsequence • s = s1, …, sk • s1 < s2 < … < sk

Sparse LCS expressed as LIS x Create a sequence w • Every matching point x-to-y, (i, j), is inserted into a sequence as follows: • For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order • The 11 example points are inerted in the order given • Any two points (ya, xa), (yb, xb) can be chained iff • a is before b in w, and • ya < yb y

Sparse LCS expressed as LIS x Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where • (ya, xa) < (yb, xb) if ya < yb Claim: An increasing subsequence of w is a common subsequence of x and y y

Sparse Dynamic Programming for LIS x • Algorithm: initialize empty array L /* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */ for i = 1 to |w| binary search for w[i] in L, to find lj< w[i] ≤ lj+1 replace lj+1 with w[i] keep a backptr lj w[i] y

Sparse Dynamic Programming for LIS x Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = • (4,2) • (3,3) • (3,3) (10,5) • (2,5) (10,5) • (2,5) (8,6) • (1,6) (8,6) • (1,6) (3,7) • (1,6) (3,7) (4,8) • (1,6) (3,7) (4,8) (7,9) • (1,6) (3,7) (4,8) (5,9) • (1,6) (3,7) (4,8) (5,9) (9,10) y Longest common subsequence: s = 4, 24, 3, 11, 18

Sparse DP for rectangle chaining • 1,…, N: rectangles • (hj, lj): y-coordinates of rectangle j • w(j): weight of rectangle j • V(j): optimal score of chain ending in j • L: list of triplets (lj, V(j), j) • L is sorted by lj • L is implemented as a balanced binary tree h l y

Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: • When on the leftmost end of i: • j: rectangle in L, with largest lj < hi • V(i) = w(i) + V(j) • When on the rightmost end of i: • j: rectangle in L, with largest lj li • If V(i) > V(j): • INSERT (li, V(i), i) in L • REMOVE all (lk, V(k), k) with V(k)  V(i) & lk li

Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y

Time Analysis • Sorting the x-coords takes O(N log N) • Going through x-coords: N steps • Each of N steps requires O(log N) time: • Searching L takes log N • Inserting to L takes log N • All deletions are consecutive, so log N per deletion • Each element is deleted at most once: N log N for all deletions • Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Rapid Global Alignments