On Finding Repeats in Strings

On Finding Repeats in Strings A Survey of Algorithms 胡吉祥 2004.09.29

I．Problem Description • Repeat：Recurrence of a pattern • Aim: 1) Finding occurrence of repeated substrings in a string or a set of strings. 2) Finding a common structure/pattern shared by a family of strings.

1. Basic Definition • A pair of substrings R = (S[i1, j1], S[i2, j2]) is called a repeat: - exact repeat：S[i1,j1]=S[i2,j2] - k-mismatch repeat ： k mismatches between S[i1,j1] and S[i2, j2] - k-differences repeat： k differences (mismatches, insertions, deletions) between S[i1,j1] and S[i2, j2]

Definition 2 A repeat in a string x is a tuple Mx,u = (p; i1,i2,….ir), r≥2; Where u = x[i1…i1+p-1] = x[i2…i2+p-1] = ... =x[ir…ir+p-1]. • u is said to be a repeating substring of x and the generator of Mx,u. • p = |u| the period of the repeat and r its exponent.

2.Maximal exact matches • Definition 3: If S[i,i +k]=S[j, j+k], where 1≤k≤ n - 1 and 1≤i, j≤n - k, then we say S[i,i+k]and • S[j, j+k] match. This match is represented by (i,j,k+1 ) or equivalently, ( i,j,k+1 ); i.e., by the • two starting positions of the match and the length of the match.

Definition 4 A match (i, j,k) in a string S is left-extensibleif i, j > 1 and Si-1 = Sj-1, and right-extensibleif i+k, j+k<nand Si+k+1 = Sj+k+1. • A match ismaximalif it is neither left-extensible nor right-extensible and it is not the trivial match (1,1,n), where nis the length of the string

3. Approximate Repeats • Exact repeat: identical substrings • Inexact repeats: distance measures between recurrences • Hamming distance: mismatch => k-mismatch repeat • Levenshtein / edit distance: substitution, insertion, deletion => k-differences repeat

II. Algorithms for finding repeats in strings • 1. N-gram Iterative / KMR • 2. Martinez’s Sorting Algorithm • 3. Fingerprint • 4. Suffix Tree • 5. GST (Generalized Suffix Tree) • 6. Suffix Array • 7. Grammar Induce / Sequiur • 8. Term Frequency Statistical Approach • 9. Others

1. N-gram Iterative / KMR • [KMR,1972] • A classical algorithm for finding all exact repetition in a string. • A O(Nlogk) algorithm for finding exact repeated k-length words in a N-length sequence.

Aim • Given a string S, KMR solves the following problems: • 1. identify the positions of all the words of a fixed length k that appear repeated in s; • 2. find the length kmax of the longest repeated word in s, and solve problem 1 for k= kmax.

Idea • Iterative construction of equivalence relation over the positions of s. • Definition: Given a string s=s1s2…sn∈Σn , two position i and j ∈{1,…,n-k+1} of s are said to be k-equivalent, noted i Ek j, if and only if the words of length k staring at these positions in s are identical.

Lemma 1: Let s=s1s1…s1∈Σn . Let a,b ∈{1,…,n} with b≤a, and i,j∈{1,…,n-(a+b)+1}. i Ea+b j iff i Ea j and i+b Ea j+b • Lemma 2: Let s=s1s1…s1∈Σn . Let a,b ∈{1,…,n}, and i,j∈{1,…,n-(a+b)+1}. i Ea+b j iff i Ea j and i+a Eb j+a

Ek and of its classes is basically an operation of set intersection on the classes of Ek’ for k’<k. • Double Technique => Suffix Sorting • Analysis: Time Complexity: O(nlogk) Space Complexity: O(n)

Example for KMR

Example for KMR (cont.)

Variation of KMR • [Land,1989]: finding a common structure shared by a family of string • Aim: find a structure common to N strings of characters belonging to a given alphabet. - consider a word is relevant if it appears without any distortion in at least a fixed number q of strings among the N strings, and with some distortion in the N-q remaining strings.

Adaptation of the KMR Algorithm • arrange the N strings in a single vector S, take into account the boundaries of the N strings in S, any word must be wholly enclosed in only one string. • memorize the locations in S of the first and last characters of each string. • adapt the KMR method, keep the word which abides by the quorum condition and does not cross any boundary.

2. Matinez’s Sorting Algorithm • [Martinez,1983]: • a sorting algorithm with a time complexity O(nlogn) to display a priori unknown identically repeated patterns in several molecular sequences. • solve the problem of finding repeats in molecular sequences as a sorting problem. • linear in space complexity and NlogN in expected time complexity. • involve no path tracing. The repeats are immediate and are reported during the processing of sorting

Idea • (1) First construct a sequence P of pointers such that pointer value P[i] is the location of the ith element in the sequence S. • (2) Sort P so that it constitutes an ordering of S. P[i] <, >,= P[j]  S[P[i]J <, >,= S[P[j]] - all the pointer values which point to the same kind of element in S are grouped together. - at most m groups of pointer values in this first sorting. (m=|Σ|)

Idea (cont.) • (3) Sort each of these groups of P so that in the resulting subgroups two pointer values belong to the same one if and only if the elements immediately following the ones they point to are equal. • (4) When no subgroups contain more than one pointer value the process is complete.

Example

Conclusion: • No more than repeated application of a sorting algorithm. The overall speed is essentially determined by the speed of the sorting algorithm employed. No significant storage space is required beyond that necessary for the sequence S and its pointer sequence P.

3. Fingerprint • Definition： A fingerprint (a.k.a. signature) of an object Ob is a small tag f(Ob) with the following properties: • a) f is a function of Ob. f(A)≠f(B) =>A≠B • b) Pr(f(A)=f(B) | A≠B) = very small

Useful Properties of Fingerprints • Fast Calculation • Low collision rate • Cryptographically unbreakable • Updatable • Concatenation of Objects

Karp Rabin Style Fingerprints • A=[a0,a1,…an-1] Calculation time linear in N

Easy to calculatconsecutive n-grams Easy to calculate signatures of concatenations

Discover repetitions using fingerprints • Algorithm FSR: Finds shortest repetition. - Input: A string y∈Σ*Σ, |y|=n - Output: A pair of indices 1≤i<j≤n, s.t. y(i,j-1)=y(j,2j-i-1) and |j-i| is minimal • Time Complexity: O(nlogn)

4. Suffix Tree Based Algorithms • Repeat finding with suffix trees: – Exact repeats – Approxiamte

Suffix Trees • A suffix tree is a trie-like data structure representing all suffixes of a string. • A suffix tree of a string S, T(S), is a rooted tree whose edges are labeled with strings such that – all edges leaving a node begin with different characters and – the paths from the root to the leaves represent all the suffixes of S.

b x a c x a 6 a c b c x 5 b c x a a 4 c c 3 2 1 Suffix Tree (xabxab) • {xabxac, abxac, bxac, xac, ac, c}

(1) Finding Exact Repeats • Folklore: (see e.g. Gusfield, 1997) It is possible to find all pairs of repeated substrings (repeats) in S in linear time. • Idea: • consider string S and its suffix tree T(S). •repeated substrings of S correspond to internal locations in T(S). • leaf numbers tell us positions where substrings occur. • Analysis: O(n + |output|) time, O(n) space

Finding maximal exact repeats • Idea: (see e.g. Gusfield, 1997) • For right-maximality (X ≠ Y) – consider only internal nodes of T(S) – report only pairs of leaves from different subtrees (or from different leaf-lists) • For left-maximality (A ≠ B) – keep lists for the different left-characters – report only pairs from different lists

Dup: Finding Duplication in Strings and Software. (Baker,1993) • Maximal matching pairs must have both different right context and left context. • build up lists of suffixes grouped by left context; • compare the lists found for its subtrees to identify longest matches. • Analysis: O(n + |output|) time, O(n) space

abcbc# # bc c # bc# # bc# Finding Maximal Repeats abcbc left diverse a c

(2) Finding degenerate repeats • k-mismatch repeats (Hamming distance) / k-differences repeats (edit distance) • (Kurtz et al. 2000/2001, Adebiyi et al. 2001, Volfovsky et al,2001, Kolapov & Kucherov,2001)

Idea: • Minimal length l, up to k errors → filter method (“seed and extend”) - first search for exact repeats of small but appropriate lengths - form maximal approximate repeats by expansion of the exact repeats (called seeds) to the surrounding sequence, allowing k-mismatches or k-differences between recurrences. - two types of exact repeats can be used as seeds: maximal repeats and the super-maximal repeats.

Algorithm: 1. Search for local exact repeats (seeds). 2. Extend the seeds while allowing up to k errors. 3. If extension is long enough, output repeat. • Analysis: O(n +ζk3) time with E(ζ) = O (n2/4s), s minimal seed length.

5. Generalized Suffix Tree (GST) • GST: • a Suffix Tree that combines the suffixes of a set {S1, ...., Sn} of strings.

$ a ab$ b # b $ ab# # $ ab# # Generalized Suffix Trees - Example T1 = abab T2 = aab # $

Generalized Suffix Trees - Applications • Searching for a pattern in a database of strings. • Finding longest common substring.

(1) k-common substring Problem (Guesfield,1997) • Problem definition: Longest common substrings of >2 strings: - Input Strings S1, …, SK (total length n) - Output l(j) (and pointers to substrings) for 2 <= j <= K • l(k): the length of the longest substring that appears in at least k distinct strings of S.

Solution • - Build a generalized suffix tree for the K strings each string has a unique end character, so each leaf shows up only once - C(v): number of distinct leaf labels in subtree rooted at node v - Given C(v) values and string-depth values, do a simple traversal of tree to find these K-1 values and pointers to locations in substrings • Time complexity: O(Kn)

(2) Color Set Size Problem (Lucas Hui, 1992) • CSS(Color Set Size) Problem: given a rooted tree of size n with l leaves colored from 1 to m, m≤l，for each vertex u find the number of different leaf color in the subtree rooted at u. • CSS Linear Theorem: The CSS problem can be solve in O(n) time and space.

K-out-of-m • k-out-of-m Problem: find the longest substring that is common to at least k strings for a fixed k between 1 and m. • k-out-of-m Problem  finding an internal vertex in GST such that css(u)>k the path length from root to u is maximum.

Algorithm KM • - Build GST for the input strings; - Solve the CSS problem of the GST; - Compute path length for all vertices using pre-order traversal; - Among all vertices, select one with css() value >k and have maximum path length; - Output the path from root to the vertex found above as answer.

Multiple common substring Problem • a. Given m input strings, for all k between 1 and m, find the longest pattern which appears in at least k input strings. • b. Given m input strings and integers k and l, find a pattern with length l which appears in exactly k input strings.

c. Given m input strings and integers l1<l2,find the pattern with length between l1 and l2 which appears in as many input strings as possible. • All the above problems can be solved in O(n+|output|) time by modified the k-out-of-m solution.

6.Suffix Array • Definition: Given a string D thesuffixarraySA for this string is the sorted list of pointers toall suffixes of D. • (Manber, Myers 1990)

On Finding Repeats in Strings

On Finding Repeats in Strings

Presentation Transcript

Repeats, Pseudogenes etc.

Strings in JAVA

Strings in BASH

STRINGS IN C

Strings in Python

Strings in Python

Strings in Python

Spectrin repeats

Strings in Python

Repeats!

Operations on RNA Strings

Telomeric repeats

Inverted terminal repeats

Constraints on Strings

Strings in MIPS

Finding Subtle Motifs by Branching from Sample Strings

Strings in Python

Strings in C++

Strings in Java

STRINGS IN C

Repeats