Languages with mismatches and an application to approximate indexing

Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi

Outline • Motivations and basic definitions • The languages L(S,k,r) • The repetition index R(S,k,r) • Some combinatorial properties of the repetition index • A trie based approach for approximate indexing data structures • Conclusions and related works

Main motivation: Approximate String Matching It concerns the finding of strings in texts in presence of “errors” or “mismatches”. • Recovering the original signals after their transmission over noisy channels; • Finding DNA subsequences after possible mutations; • Text searching where there are typing or spelling errors; • Retrieving musical passages. It has several applications in data analysis and data retrieval, such as:

Each application uses a different error model, which defines how different two strings are. Some best studied cases of error models are: • Levenshtein or edit distance[Levenshtein, 1965]: it allows us to insert, delete and substitute simple characters (with a different one) in both strings; • Hamming distance [Sankoff and Kruskal, 1983]: it allows us only substitutions; • Scoring functions: they are not distances in mathematical terms and they measure the similarity degree between two words.

The distanced(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and  if no such sequence exists). The different possible operations are: 1) Insertion, 2) Deletion, 3) Substitution, 4) Transposition. We consider the Hamming distance, that allows only substitutions, which cost 1 (simplified definition). It is finite whenever |x|=|y| and it holds 0  d(x,y)  |x|. Ex.: x=acgtatct, y=aggttact Ex.:x=acgtatct,y=aggttact d(x,y)=3 (in the simplified definition)

Typical approaches for finding a string x in a text S: to consider a percentage D of errors, or to fix the number k of them.Hybrid approach: to introduce a new parameter r and to allow at most k errors for any window (or factor) of length r. • Let S be a string over the alphabet Σ, and let k, r be non negative integers such that k ≤ r. A string u occurs in S at position l up to k errors in a window of size r, or simply kr-occurs in S at position l, if one of the following two conditions hold: • if |u| < rd(u, S(l, l+|u|-1)) ≤ k; • if |u| ≥ ri, 1≤ i ≤ |u|-r+1, d(u(i,i+r-1), S(l+i-1, l+i+r-2)) ≤ k. • A string u satisfying above property is a kr-occurrence of S. Let L(S,k,r) be the set of words that kr-occurs in S at position l,for some l, 1≤ l ≤ |S|-|u|+1. The parameter r introduced in the previous definition can befixed or canvary as a function of the text.

Example of L(S,k,r) S=abaa • k=1, r=2 • L(S,1,2)={a,b,aa,bb,ab,ba,bb,aaa,aab,aba,abb,baa, bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab, bbba} • bbba  L(S,1,2), but bbba  L(S,1,4)

The Repetition IndexR(S,k,r) of S is the smallest integer h such that all strings of length hkr-occur at most in a unique position of the text: R(S,k,r) = min{h 1 s.t . i, j, 1  i, j  |S| - h + 1, V(S(i,i+h-1),k,r) V(S(j,j+h-1),k,r)  i=j}, where V(u,k,r) is the set of all words of length |u| that have at most k errors in every window of size r with respect to u. • Remarks: • R(S,k,r) is well defined because the integer h=|S| is an element of the set above described; • If k/r 1/2 then R(S,k,r)=|S|.

Let us consider the stringS = a b c d e f g h i j k l m n o a b z d e z g h z j k z m n z with k = 1 and r = 2. Example • k/r = 1/2 R(S,1,2)=|S|=30. • A word w of length R(S,1,2)-1=29 that 12-appears at position 1 and 2 is w = a c c e e g g i i k k m m o o b b d d z z h h j j z z n n

Some combinatorial properties of R(S,k,r) Lemma 1: If k and S are fixed, R(S,k,r) is a non-increasing function of r; Lemma 2: If r and S are fixed, R(S,k,r) is a non-decreasing function of k; Lemma 3: If k and S are fixed and r R(S,k,r), the repetition index gets constant. Theorem If k and S are fixed, there exists only one solution to the equation r = R(S,k,r).

AnIndex over a fixed text S is an abstract data type which basic set is Fact(S) and that contains operations giving access to factors of S. The principal operations are: 1)Membership: given a word x, say if xFact(S); 2)Position: given xFact(S), find the left position of its first (resp. last) occurrence in S; 3)Number of occurrences: given xFact(S), find the number of occurrences of x in S; 4)List of positions: given xFact(S), produce the list occ(x) of the occurrences of x in S. All these operations can easily be extended to the case of approximate string matching.

We give the following results. • The size of this indexing data structure is linear times a polylog of the size of the text S on average, i.e. O(|S|• logk|S|). • For each word x, the time spent by our algorithms for finding thelist occ(x) of all kr-occurrences of the word x in the text S is proportional to |x|+|occ(x)| on average.

Description of the indexing data structure • Build the trie T(I,k,r) that represents the set of all possible strings having length R(S,k,r) that kr-occur in the string S; • Add to any leaf of the trie T(I,k,r) an integer i that is the starting position of the kr-occurrence of Srepresented by the concatenation of the labels from the root to the leaf i.

Finding all kr-occurrences of a string x in a text S • “Read” as long as possible the string x and let q the last visited node i) If q is a leaf and |x|=R(S,k,r)  return i; ii) If q is a leaf and |x|>R(S,k,r)  ifxkr-occurs in S at position i then return i else“x is a false positive” iii) If |x|<R(S,k,r) returnocc(x). The list of all kr-occurrences of x has at most one element The list of all kr-occurrences of x can have more than one element In iii) we use the Colored Range Query solution [Muthukrishnan, SODA’02].

Results Proposition: The overall time for finding all kr-occurrences of a string x in a text S is O(|x|+|occ(x)|). Theorem:The k-mismatch problem on a text S over a fixed alphabet can be settled by a data structure having average size O(|S|∙logk|S|) that answers queries in time O(|x|+|occ(x)|), for any query word x.

Conclusions and related work • Results of this paper are in PhD Thesis of A. Gabriele [Genuary, 2005] • Independently, M. Maass and J. Nowak gave analogous results, by using the same data structure essentially and the CRQ solution [Preprint March, 2005 – CPM June, 2005], but: - a window is not used -it is improved the analysis on the size of the data structure - the technique is extended to edit distance • It is still open to find an indexing data structure of linear times a polylog size and searching time O(|x|+|occ(x)|)

Languages with mismatches and an application to approximate indexing