870 likes | 1.56k Views
On Finding Repeats in Strings. A Survey of Algorithms 胡吉祥 2004.09.29. I . Problem Description. Repeat : Recurrence of a pattern Aim: 1) Finding occurrence of repeated substrings in a string or a set of strings. 2) Finding a common structure/pattern
E N D
On Finding Repeats in Strings A Survey of Algorithms 胡吉祥 2004.09.29
I.Problem Description • Repeat:Recurrence of a pattern • Aim: 1) Finding occurrence of repeated substrings in a string or a set of strings. 2) Finding a common structure/pattern shared by a family of strings.
1. Basic Definition • A pair of substrings R = (S[i1, j1], S[i2, j2]) is called a repeat: - exact repeat:S[i1,j1]=S[i2,j2] - k-mismatch repeat : k mismatches between S[i1,j1] and S[i2, j2] - k-differences repeat: k differences (mismatches, insertions, deletions) between S[i1,j1] and S[i2, j2]
Definition 2 A repeat in a string x is a tuple Mx,u = (p; i1,i2,….ir), r≥2; Where u = x[i1…i1+p-1] = x[i2…i2+p-1] = ... =x[ir…ir+p-1]. • u is said to be a repeating substring of x and the generator of Mx,u. • p = |u| the period of the repeat and r its exponent.
2.Maximal exact matches • Definition 3: If S[i,i +k]=S[j, j+k], where 1≤k≤ n - 1 and 1≤i, j≤n - k, then we say S[i,i+k]and • S[j, j+k] match. This match is represented by (i,j,k+1 ) or equivalently, ( i,j,k+1 ); i.e., by the • two starting positions of the match and the length of the match.
Definition 4 A match (i, j,k) in a string S is left-extensibleif i, j > 1 and Si-1 = Sj-1, and right-extensibleif i+k, j+k<nand Si+k+1 = Sj+k+1. • A match ismaximalif it is neither left-extensible nor right-extensible and it is not the trivial match (1,1,n), where nis the length of the string
3. Approximate Repeats • Exact repeat: identical substrings • Inexact repeats: distance measures between recurrences • Hamming distance: mismatch => k-mismatch repeat • Levenshtein / edit distance: substitution, insertion, deletion => k-differences repeat
II. Algorithms for finding repeats in strings • 1. N-gram Iterative / KMR • 2. Martinez’s Sorting Algorithm • 3. Fingerprint • 4. Suffix Tree • 5. GST (Generalized Suffix Tree) • 6. Suffix Array • 7. Grammar Induce / Sequiur • 8. Term Frequency Statistical Approach • 9. Others
1. N-gram Iterative / KMR • [KMR,1972] • A classical algorithm for finding all exact repetition in a string. • A O(Nlogk) algorithm for finding exact repeated k-length words in a N-length sequence.
Aim • Given a string S, KMR solves the following problems: • 1. identify the positions of all the words of a fixed length k that appear repeated in s; • 2. find the length kmax of the longest repeated word in s, and solve problem 1 for k= kmax.
Idea • Iterative construction of equivalence relation over the positions of s. • Definition: Given a string s=s1s2…sn∈Σn , two position i and j ∈{1,…,n-k+1} of s are said to be k-equivalent, noted i Ek j, if and only if the words of length k staring at these positions in s are identical.
Lemma 1: Let s=s1s1…s1∈Σn . Let a,b ∈{1,…,n} with b≤a, and i,j∈{1,…,n-(a+b)+1}. i Ea+b j iff i Ea j and i+b Ea j+b • Lemma 2: Let s=s1s1…s1∈Σn . Let a,b ∈{1,…,n}, and i,j∈{1,…,n-(a+b)+1}. i Ea+b j iff i Ea j and i+a Eb j+a
Ek and of its classes is basically an operation of set intersection on the classes of Ek’ for k’<k. • Double Technique => Suffix Sorting • Analysis: Time Complexity: O(nlogk) Space Complexity: O(n)
Variation of KMR • [Land,1989]: finding a common structure shared by a family of string • Aim: find a structure common to N strings of characters belonging to a given alphabet. - consider a word is relevant if it appears without any distortion in at least a fixed number q of strings among the N strings, and with some distortion in the N-q remaining strings.
Adaptation of the KMR Algorithm • arrange the N strings in a single vector S, take into account the boundaries of the N strings in S, any word must be wholly enclosed in only one string. • memorize the locations in S of the first and last characters of each string. • adapt the KMR method, keep the word which abides by the quorum condition and does not cross any boundary.
2. Matinez’s Sorting Algorithm • [Martinez,1983]: • a sorting algorithm with a time complexity O(nlogn) to display a priori unknown identically repeated patterns in several molecular sequences. • solve the problem of finding repeats in molecular sequences as a sorting problem. • linear in space complexity and NlogN in expected time complexity. • involve no path tracing. The repeats are immediate and are reported during the processing of sorting
Idea • (1) First construct a sequence P of pointers such that pointer value P[i] is the location of the ith element in the sequence S. • (2) Sort P so that it constitutes an ordering of S. P[i] <, >,= P[j] S[P[i]J <, >,= S[P[j]] - all the pointer values which point to the same kind of element in S are grouped together. - at most m groups of pointer values in this first sorting. (m=|Σ|)
Idea (cont.) • (3) Sort each of these groups of P so that in the resulting subgroups two pointer values belong to the same one if and only if the elements immediately following the ones they point to are equal. • (4) When no subgroups contain more than one pointer value the process is complete.
Conclusion: • No more than repeated application of a sorting algorithm. The overall speed is essentially determined by the speed of the sorting algorithm employed. No significant storage space is required beyond that necessary for the sequence S and its pointer sequence P.
3. Fingerprint • Definition: A fingerprint (a.k.a. signature) of an object Ob is a small tag f(Ob) with the following properties: • a) f is a function of Ob. f(A)≠f(B) =>A≠B • b) Pr(f(A)=f(B) | A≠B) = very small
Useful Properties of Fingerprints • Fast Calculation • Low collision rate • Cryptographically unbreakable • Updatable • Concatenation of Objects
Karp Rabin Style Fingerprints • A=[a0,a1,…an-1] Calculation time linear in N
Easy to calculatconsecutive n-grams Easy to calculate signatures of concatenations
Discover repetitions using fingerprints • Algorithm FSR: Finds shortest repetition. - Input: A string y∈Σ*Σ, |y|=n - Output: A pair of indices 1≤i<j≤n, s.t. y(i,j-1)=y(j,2j-i-1) and |j-i| is minimal • Time Complexity: O(nlogn)
4. Suffix Tree Based Algorithms • Repeat finding with suffix trees: – Exact repeats – Approxiamte
Suffix Trees • A suffix tree is a trie-like data structure representing all suffixes of a string. • A suffix tree of a string S, T(S), is a rooted tree whose edges are labeled with strings such that – all edges leaving a node begin with different characters and – the paths from the root to the leaves represent all the suffixes of S.
b x a c x a 6 a c b c x 5 b c x a a 4 c c 3 2 1 Suffix Tree (xabxab) • {xabxac, abxac, bxac, xac, ac, c}
(1) Finding Exact Repeats • Folklore: (see e.g. Gusfield, 1997) It is possible to find all pairs of repeated substrings (repeats) in S in linear time. • Idea: • consider string S and its suffix tree T(S). •repeated substrings of S correspond to internal locations in T(S). • leaf numbers tell us positions where substrings occur. • Analysis: O(n + |output|) time, O(n) space
Finding maximal exact repeats • Idea: (see e.g. Gusfield, 1997) • For right-maximality (X ≠ Y) – consider only internal nodes of T(S) – report only pairs of leaves from different subtrees (or from different leaf-lists) • For left-maximality (A ≠ B) – keep lists for the different left-characters – report only pairs from different lists
Dup: Finding Duplication in Strings and Software. (Baker,1993) • Maximal matching pairs must have both different right context and left context. • build up lists of suffixes grouped by left context; • compare the lists found for its subtrees to identify longest matches. • Analysis: O(n + |output|) time, O(n) space
abcbc# # bc c # bc# # bc# Finding Maximal Repeats abcbc left diverse a c
(2) Finding degenerate repeats • k-mismatch repeats (Hamming distance) / k-differences repeats (edit distance) • (Kurtz et al. 2000/2001, Adebiyi et al. 2001, Volfovsky et al,2001, Kolapov & Kucherov,2001)
Idea: • Minimal length l, up to k errors → filter method (“seed and extend”) - first search for exact repeats of small but appropriate lengths - form maximal approximate repeats by expansion of the exact repeats (called seeds) to the surrounding sequence, allowing k-mismatches or k-differences between recurrences. - two types of exact repeats can be used as seeds: maximal repeats and the super-maximal repeats.
Algorithm: 1. Search for local exact repeats (seeds). 2. Extend the seeds while allowing up to k errors. 3. If extension is long enough, output repeat. • Analysis: O(n +ζk3) time with E(ζ) = O (n2/4s), s minimal seed length.
5. Generalized Suffix Tree (GST) • GST: • a Suffix Tree that combines the suffixes of a set {S1, ...., Sn} of strings.
$ a ab$ b # b $ ab# # $ ab# # Generalized Suffix Trees - Example T1 = abab T2 = aab # $
Generalized Suffix Trees - Applications • Searching for a pattern in a database of strings. • Finding longest common substring.
(1) k-common substring Problem (Guesfield,1997) • Problem definition: Longest common substrings of >2 strings: - Input Strings S1, …, SK (total length n) - Output l(j) (and pointers to substrings) for 2 <= j <= K • l(k): the length of the longest substring that appears in at least k distinct strings of S.
Solution • - Build a generalized suffix tree for the K strings each string has a unique end character, so each leaf shows up only once - C(v): number of distinct leaf labels in subtree rooted at node v - Given C(v) values and string-depth values, do a simple traversal of tree to find these K-1 values and pointers to locations in substrings • Time complexity: O(Kn)
(2) Color Set Size Problem (Lucas Hui, 1992) • CSS(Color Set Size) Problem: given a rooted tree of size n with l leaves colored from 1 to m, m≤l,for each vertex u find the number of different leaf color in the subtree rooted at u. • CSS Linear Theorem: The CSS problem can be solve in O(n) time and space.
K-out-of-m • k-out-of-m Problem: find the longest substring that is common to at least k strings for a fixed k between 1 and m. • k-out-of-m Problem finding an internal vertex in GST such that css(u)>k the path length from root to u is maximum.
Algorithm KM • - Build GST for the input strings; - Solve the CSS problem of the GST; - Compute path length for all vertices using pre-order traversal; - Among all vertices, select one with css() value >k and have maximum path length; - Output the path from root to the vertex found above as answer.
Multiple common substring Problem • a. Given m input strings, for all k between 1 and m, find the longest pattern which appears in at least k input strings. • b. Given m input strings and integers k and l, find a pattern with length l which appears in exactly k input strings.
c. Given m input strings and integers l1<l2,find the pattern with length between l1 and l2 which appears in as many input strings as possible. • All the above problems can be solved in O(n+|output|) time by modified the k-out-of-m solution.
6.Suffix Array • Definition: Given a string D thesuffixarraySA for this string is the sorted list of pointers toall suffixes of D. • (Manber, Myers 1990)