Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference Inexact Matching Lecturer: Dr. Rose Slides by: Dr. Rose February 21, 2002, year of the palindrome Last night at 2 minutes past 8pm it was: 20:02,20/02/2002

Overview • k-difference inexact matching • Concepts: • d-path • Farthest-reaching d-path in a diagonal • O(km) time and space solution • Primer selection problem • Formulations: • Exact matching primer • Inexact matching primer • k-difference primer • O(km) time solution to k-difference primer problem

Overview • Exclusion methods: fast expected time O(m) • Partition approaches: • BYP algorithm • Aho-Corasick exact matching algorithm • Keyword trees • Back to Aho-Corasick exact matching algorithm • Algorithm for computing failure links • Back to BYP algorithm

K-difference Inexact Matching • Like k-mismatch problem: allows mismatches • Harder than k-mismatch: • allows spaces • End spaces in T are not counted • |P| & |T| can be vastly different  can’t focus on a 2k+1 band centered around the diagonal.

K-difference Inexact Matching Defn: • Diagonals above the main diagonal are numbered 1 through m. Diagonal i starts in cell (0,i). • Diagonals below the main diagonal are numbered -1 through 1n. Diagonal -i starts in cell (i,0). • Row 0 is initialized to be all zeros. • Recall T can have free end spaces • Setting row 0 to be zeros allows the left end of T to start after a gap without any cost.

K-difference Inexact Matching Defn: a d-path is a path that starts in row 0 and specifies exactly d mismatches & spaces. Defn: a d-path is a farthest-reaching in diagonal i if it ends in diagonal i and the index of its ending column c is  the ending column of any other d-path ending in diagonal i. You can visualize this as a d-path that ends farthest in diagonal i.

K-difference Inexact Matching Approach: • Iterate: (1d k) • find the farthest-reaching d-path for each diagonal i, (-ni m) • The farthest-reaching d-path for diagonal i is found from the farthest-reaching (d-1)-paths on diagonals i-1, i and i+1. • Observation: and d-path reaching row n corresponds to a d-difference occurrence of P in T.

K-difference Inexact Matching Observation: a farthest reaching 0-path in diagonal i is the longest match of T[i..m] and P[1..n]. Q: Why is this true? A: 0-path means an exact match  no deviation from the diagonal that you start on. Using suffix trees: Build the suffix tree in linear time (linear in m). Retrieve farthest-reaching 0-paths in constant time/path.

K-difference Inexact Matching Q: How do we find the farthest-reaching d-path on diagonal i for d > 0? A: The d-path for diagonal i depends on the previously found (d-1)-paths on diagonals i-1, i and i+1. The 3 cases are: • Path R1, the farthest-reaching (d-1)-path on diagonal i+1, followed by a vertical edge to diagonal i.

K-difference Inexact Matching Since R1 is a (d-1)-path on diagonal i+1, extending it by a vertical edge (adding a space in T) to diagonal i makes it a d-path on diagonal i.

K-difference Inexact Matching The 2nd case is: • Path R2, the farthest-reaching (d-1)-path on diagonal i-1, followed by a horizontal edge to diagonal i. Again extending a (d-1)-path into a d-path on diagonal i.

K-difference Inexact Matching • Path R3, the farthest-reaching (d-1)-path on diagonal i, followed by a diagonal edge corresponding to a mismatch. Again extending a (d-1)-path into a d-path on diagonal i.

K-difference Inexact Matching • Each of R1, R2, and R3, is initially a farthest-reaching (d-1)-path on diagonal i-1, i, i+1, respectively. • Each is extended by a space or a mismatch resulting in a d-path on diagonal i. • Each is subsequently extended along diagonal i. • The farthest-reaching d-path on diagonal i must be one of these.

k-differences Algorithm d = 0 /* Calculate farthest-reaching 0-paths on diagonals 0 through m */ For i=0 to m { Find the longest common extension between P[1..n] and T[i..m]} /* calculate d-paths by extending (d-1)-paths R1, R2, and R3 */ For d=1 to k { For i = -n to m { extend (d-1)-paths R1, R2, R3 on diagonals i-1, i, i+1 to diagonal i. One of these is the farthest reaching d-path on diagonal i. } A path reaching row n defines an inexact match of P in T containing at most k differences. The column in row n indicates the end character in T. }

K-difference Inexact Matching Space analysis: • For each d and i, we need to store the location of the ending farthest-reaching d-path. • d ranges from 0 to k. • There are (n+m) diagonals. •  O(km) space is required.

K-difference Inexact Matching Time analysis: • Constant time to retrieve 3 (d-1)-paths for particular d and i. •  O(km) for this aspect (like k-differences alignment) • Corresponding O(km) extensions of paths along diagonal. • Each path extension is a maximal identical substring in P & T, i.e., a longest common extension computation. • Using a suffix tree entails only constant time. • Creating the suffix tree entails linear processing of strings O(n+m) •  altogether O(n+m+km) = O(km)

Primer (Probe) Selection Problem Problem: start with two strings a and b (detailed description on page 178-179). • Exact matching version: j > j0, find the shortest substring g of a starting at aj s.t. g  b. • Can be solved in O(|a|+|b|) • Not too bad. • Inexact matching version: Given parameter p, j > j0, find the shortest substring g  a starting at aj that has edit distance at least |g|/p from any substring in b.

Primer (Probe) Selection Problem • Inexact matching version: Given parameter p, j > j0, find the shortest substring g  a starting at aj that has edit distance at least |g|p from any substring in b. • Q: How much work is this? • …find theshortest prefix g of awith edit distance at least |g|p from any substring in b. • The naïve approach appears daunting. • Let’s look at a less intimidating formulation!

Primer (Probe) Selection Problem • Change |g|  p to k • Convert the inexact matching problem to a k-differences problem. • This works out since in practice, |g|  p must fall in a small range for fixed p. • k-difference primer problem: Given parameter k, j > j0, find the shortest substring g  a starting at aj that has edit distance at least k from any substring in b.

Primer (Probe) Selection Problem Approach: For each position j in a Find the shortest prefix of a[j..n] with edit distance  k from every substring in b. Q: How does this compare with the k-differences inexact matching problem? A: It is the opposite problem. Find matches with at most k differences, versus Reject matches of prefixes of a[j..n] with substrings of b with fewer than k differences.

Primer (Probe) Selection Problem Solution: • Use k-differences algorithm. • Use a[j..n] in the place of P. • Use b in the place of T. • Compute the farthest-reaching d-path, d = k, in each diagonal. • d-paths, d < k, reaching row n, mean no solution at j • Q: Why? • A: a d-path, d < k, indicates a[j..n] matches a substring of b with fewer than k differences.

Primer (Probe) Selection Problem Solution: • Only if no farthest-reaching (k-1)-paths reaches row n can there be a primer at position j. • In particular, if no farthest-reaching (k-1)-paths reaches row r < n then a[j..r] is a primer if r is the smallest row with this property. • Repeat this approach for every potential starting position j in a. • Analysis: if |a|= n and |b| = m, then the algorithm takes time O(knm).

Exclusion Methods Q: Can we improve on the Q(km) time we have seen for k-mismatch and k-difference? A: On average, yes. (Are we quibbling?) We adopt a fast expected algorithm < Q(km)  the worst case may not be better than Q(km)

Exclusion Methods Partition Idea: exclude much of T from the search Preliminaries: Let a = |S|, where S is the alphabet used in P and T. Let n = | P |, and m = | T |. Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences. General Partition algorithm: three phases • Partition phase • Search Phase • Check Phase

Exclusion Methods • Partition phase • Partition either T or P into r-length regions (depends on particular algorithm) • Search Phase • Use exact matching to search T for r-length intervals • These are potential targets for approximate occurrences of P. • Eliminate as many intervals as possible. • Check Phase • Use approximate matching to check for an approximate occurrence of P around each surviving interval for the search phase.

BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r = n/(k+1) Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P &T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

BYP Method BYP Algorithm: • Let P be the set of the first k+1 substrings of P’s partitioning. • Build a keyword tree for the set of patterns P. • Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. • ….. Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.

Keyword Trees (section 3.4) Defn. The keyword tree for set P is a rooted directed tree K satisfying: • Each edge is labeled with one character • Any two edges out of the same node have distinct labels. • Every pattern Pi in P maps to some node v of K s.t. the path from the root to v spells out Pi • Every leaf in K is mapped by some pattern in P.

Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

Keyword Trees (section 3.4) Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K. • Every node corresponds to a prefix of a pattern in P. • Conversely, every prefix of a pattern maps to a node in K.

Keyword Trees (section 3.4) • If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed S. • Let Ki denote the partial keyword tree that encodes patterns P1,.. Pi of P.

Keyword Trees (section 3.4) • Consider partial keyword tree K1 • comprised of a single path of |P1| edges out of root r. • Each edge is labeled with one character of P1 • Reading from the root to the leaf spells out P1 • The leaf is labeled 1

Keyword Trees (section 3.4) Creating K2 from K1: • Find the longest path from the root of K1 that matches a prefix of P2. • This paths ends by • Either exhausting the characters of P2 or • Ending at some existing node v in K1 where no extending match is possible. In case 2a) label the node where the path ends 2. In case 2b) create a new path out of v, labeled by the remaining characters of P2.

Keyword Trees (section 3.4) Example: P1 is potato • P2 is pot • P2 is potty

Keyword Trees (section 3.4) Use of keyword trees for matching • Finding occurrences of patterns in P that occur starting at position l in T: • Starting at the root r in K, follow the unique path that matches a substring of T that starts at l. • Numbered nodes along this path indicate matched patterns in P that start at position l. • This takes time proportional to min(n, m) • Traversing K for each position l in T gives O(nm) • This can be improved!

Keyword Tree Speedup Observation: Our naïve keyword tree is like the naïve approach to string comparison. • Every time we increment l, we start all over at the root of K  O(nm) Recall: KMP avoided O(nm) by shifting to get a speedup. Q: Is there an analogous operation we can perform in K ? A: Of course, why else would I ask a rhetorical question?

Keyword Tree Speedup First, we assume Pi Pj for all combinations Pi,Pj in P. Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v. Defn. Let L(v) denote the label of node v. Defn. Let lp(v) denote the length of the longest proper suffix of string L(v) that is a prefix of some pattern in P.

Keyword Tree Speedup Example: L(v) = potat, lp(v) = 2, the suffix at is the prefix of P4.

Keyword Tree Speedup Note: if a is the lp(v)-length suffix of L(v), then there is a unique node labeled a. Example: at is the lp(v)-length suffix of L(v), w is the unique node labeled at.

Keyword Tree Speedup Defn: For node v of K let nv be the unique node in K labeled with the suffix of L(v) of length lp(v). When lp(v) = 0 then nv is the root of K. Defn: The ordered pair (v,nv) is called a failure link. Example:

Aho-Corasick (section 3.4.6) Algorithm AC search l = 1; c = 1; w = root of K; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i then report that Pi occurs in T starting at position l; w= w´ and c = c + 1; } w = nw and l = c - lp(w); } Until c > m; Note: if the root fails to match increment c and the repeat loop again.

Aho-Corasick Example: T = hotpotattach When l = 4 there is a match of pot, but the next position fails. At this point c = 9. The failure link points to the node labeled at and lp(v) = 2. l = c – lp(v) = 9 – 2 = 7

Computing nv in Linear Time • Note: if v is the root r or 1 character away from r, then nv = r. • Imagine nv has been computed for for every node that is exactly k or fewer edges from r. • How can we compute nv for v, a node k+1 edges from r?

Computing nv in Linear Time • We are looking for nv and L(nv). • Let v´ be the parent of v in K and x the character on the edge connecting them. • nv´is known since v´ is k edges from r. • Clearly, L(nv) must be a suffix of L(nv´) followed by x. • First check if there is an edge (nv´,w´) with label x. • If so, then nv = w´. • O/w L(nv) is a proper suffix of L(nv´) followed by x. • Examine nnv´ for an outgoing edge labeled x. • If no joy, keep repeating, finally setting nv = r, if we run out of edges.

BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r = n/(k+1) Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P &T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

BYP Method BYP Algorithm: • Let P be the set of the first k+1 substrings of P’s partitioning. • Build a keyword tree for the set of patterns P. • Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. • For each iI use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]

Bioinformatics Algorithms and Data Structures