Search Algorithms Winter Semester 2004/2005 25 Oct 2004 3rd Lecture

Search AlgorithmsWinter Semester 2004/200525 Oct 20043rd Lecture Christian Schindelhauer schindel@upb.de

Chapter I Chapter I Searching Text 18 Oct 2004

Searching Text (Overview) • The task of string matching • Easy as a pie • The naive algorithm • How would you do it? • The Rabin-Karp algorithm • Ingenious use of primes and number theory • The Knuth-Morris-Pratt algorithm • Let a (finite) automaton do the job • This is optimal • The Boyer-Moore algorithm • Bad letters allow us to jump through the text • This is even better than optimal (in practice) • Literature • Cormen, Leiserson, Rivest, “Introduction to Algorithms”, chapter 36, string matching, The MIT Press, 1989, 853-885.

The Naive Algorithm Naive-String-Matcher(T,P) • n  length(T) • m length(P) • for s  0 to n-m do • if P[1..m] = T[s+1 .. s+m] then • return “Pattern occurs with shift s” • fi • od Fact: • The naive string matcher needs worst case running time O((n-m+1) m) • For n = 2m this is O(n2) • The naive string matcher is not optimal, since string matching can be done in time O(m + n)

The Rabin-Karp-Algorithm • Idea: Compute • checksum for pattern P and • checksum for each sub-string of T of length m m a n a m a n a p a t i p i t i p i checksums 4 2 3 1 4 2 3 1 3 1 2 3 1 0 1 checksum 3 spurious hit valid hit p a t i

Finite-Automaton-Matcher • The example automaton accepts at the end of occurences of the pattern abba • For every pattern of length m there exists an automaton with m+1 states that solves the pattern matching problem with the following algorithm: Finite-Automaton-Matcher(T,,P) • n  length(T) • q  0 • for i  1 to n do • q  (q,T[i]) • if q = m then • s  i - m • return “Pattern occurs with shift” s • fi • od

The Finite-Automaton-Matcher a Q is a finite set of states q0Qis the start state Q is a set of accepting sates : input alphabet : Q    Q: transition function a 1 b b 0 a a 2 b b b 4 3 a a b a b b a b b a a 0 1 2 1 2 3 3 4 1 4 2

Knuth-Morris-Pratt Pattern Matching m a n a m a m a p a KMP-Matcher(T,P) • n  length(T) • m  length(P) •   Compute-Prefix-Function(P) • q  0 • for i  1 to n do • while q > 0 and P[q+1]  T[i] do • q  [q] od • if P[q+1] = T[i] then • q  q+1 fi • if q = m then • print “Pattern occurs with shift”i-m • q  [q] fiod m m a m a m m m m m a m a m m a m a m a m Pattern m a m a m  m

Boyer-Moore: The ideas! m a n a m a n a p a t i p i t i p i What’s this? There is no “a” in the search pattern We can shift m+1 letters p i t i p i t i An “a” again... Start comparingat the end p i t i First wrongletter!Do a large shift! p i t i Bingo! Do anotherlarge shift! p i t i That’s it! 10 letters compared and ready!

Boyer-Moore-Matcher(T,P,) • n  length(T) • m  length(P) •   Compute-Last-Occurence-Function(P,m, ) •   Compute-Good-Suffix(P,m) • s  0 • while s  n-m do • j  m • while j > 0 and P[j] = T[s+j] do • j  j-1 od • if j=0 then • print “Pattern occurs with shift” s • s  s+ [0] else • s  s+ max([j], j - [T[s+j]] ) fiod Bad character shift Valid shifts We start comparing at the right end Success! Now do a valid shift Shift as far as possible indicated by bad character heuristic or good suffix heuristic

Boyer-Moore: Last-occurrence m a n a m a n t p a t i p i t i p i What’s this? There is no “a” in the search pattern We can shift by j - [a] = 4-0 letters j=4 “p” occurs in “piti” at the first position Shift by j - [a] = 4-1 = 3 letters p i t i j=4 p i t i j=4 p i t i j=2 “t” occurs in “piti” at the 3rd position: Shift byj - [a] = 4-3 = one step p i t i There is no “a” in the search pattern We can shift by at least j - [a] = 2-0 letters

Compute-Last-Occurrence-Function(P,m,) for each character a   do [a]  0od for j  1 to m do [P[j]]  jod return  Running time: O(|| + m) p i t i a i p t

The Prefix Function [q] := max {k : k < q and Pk is a suffix of Pq} Pattern a b a a b a a a a Text a b a a b a a b P7b P8 a b a a b a a a P7 a b a a b a a P6 a b a a b a P5 a b a a b a b a a b a a a a [7] = 4 a b a a b a a a a

[q] := max {k : k < q and Pk is a suffix of Pq}Pattern: a b a a b a a a a a [1] = 0 [6] = 3 a b a a b a a b a a b a b [2] = 0 a a b a a b a a [7] = 4 a b a a b a [3] = 1 a b a a b [8] = 1 a b a a b a a a [4] = 1 a b a a a b a a b a a a b a [5] = 2 [9] = 1 a b a a b a b a a b a a a a a b a a a b a a b a a a

Computing  Compute-Prefix-Function(P) • m  length(P) • [1]  0 • k  0 • for q  2 to m do • while k > 0 and P[k+1]  P[q] do • k  [k] od • if P[k+1] = P[q] then • k  k+1 fi • [q]  kod If Pk+1 is not a suffix of Pq ... shift the pattern to the nextreasonable position (given by smaller values of ) If the letter fits, then increment position(otherwise k = 0) We have found the position such that[q] := max {k : k < q and Pk is a suffix of Pq}

Boyer-Moore: Good Suffix - the far jump First mismatch j=6 Pattern: m a n a m a n a m a n a m a n a Is Rev(P)5 a suffix of Rev(P)6? m a n a m a n a Is Rev(P)5 a suffix of Rev(P)7? Is Rev(P)5 a suffix of Rev(P)8? (or P5 a suffix of P8)? [8]=4 Shift =m- [j] =8-4 =4 m a n a m a n a m a n a m a n a Is P4 a suffix of P8? m a n a m a n a Is P3 a suffix of P8? m a n a m a n a Is P2 a suffix of P8? m a n a m a n a Is P1 a suffix of P8? m a n a m a n a Is P0 a suffix of P8? [q] := max {k : k < q and Pk is a suffix of Pq}

Boyer-Moore: Good Suffix - the small jump First mismatch j=6 Pattern: m a m a m a m a f[6]=8 Shift (f[j]-j)=8-6=2 m a m a m a n a Is Rev(P)5 a suffix of Rev(P)6? m a m a m a m a Is Rev(P)5 a suffix of Rev(P)7? Is Rev(P)5 a suffix of Rev(P)8? (or P5 a suffix of P8)? m a m a m a m a m a m a m a m a Is P4 a suffix of P8? m a m a m a m a Is P3 a suffix of P8? m a m a m a m a Is P2 a suffix of P8? m a m a m a m a Is P1 a suffix of P8? m a m a m a m a Is P0 a suffix of P8? f[j] := min{k : k > j and Rev(P)j is a suffix of Rev(P)k} ’[q] := max {k : k < q and Rev(P)k is a suffix of Rev(P)q}

Boyer-Moore: Good Suffix - the small jump j=6 Pattern: f[6]=8 Shift (f[j]-j)=8-6=2 f[j] := min{k : k > j and Rev(P)j is a suffix of Rev(P)k} ’[q] := max {k : k < q and Rev(P)k is a suffix of Rev(P)q}

Why is it the same? Matrix for Rev(P)j is a suffix of Rev(P)k ’[k] := max {j : j < k and Rev(P)j is a suffix of Rev(P)k} k f[j] := min{k : k > j and Rev(P)j is a suffix of Rev(P)k} j

Compute-Good-Suffix-Function(P,m) •  Compute-Prefix-Function(P) • P’  reverse(P) • ’ Compute-Prefix-Function(P’) • for j  0 to m do • [j]  m - [m]od • for l  1 to m do • j  m - ’[l] • if [j] > l - ’[l] then • [j]  l - ’[l] fiod • return  • Running time: O(m) The far jump or is it a small jump

Boyer-Moore-Matcher(T,P,) n  length(T) m  length(P)   Compute-Last-Occurence-Function(P,m, )   Compute-Good-Suffix(P,m) s  0 while s  n-m do j  m while j > 0 and P[j] = T[s+j] do j  j-1 od if j=0 then print “Pattern occurs with shift” s s  s+ [0] else s  s+ max([j], j - [T[s+j]] ) fiod Running time: O((n-m+1)m) in the worst case In practice: O(n/m + v m + m + ||) for v hits in the text

Chapter II Chapter II Searching in Compressed Text 25 Oct 2004

Searching in Compressed Text (Overview) • What is Text Compression • Definition • The Shannon Bound • Huffman Codes • The Kolmogorov Measure • Searching in Non-adaptive Codes • KMP in Huffman Codes • Searching in Adaptive Codes • The Lempel-Ziv Codes • Pattern Matching in Z-Compressed Files • Adapting Compression for Searching

What is Text Compression? • First approach: • Given a text s  n • Find a compressed version c  m such that m < n • Such that s can be derived from c • Formal: • Compression Function f : *  * • is one-to-one (injective) and efficiently invertible • Fact: • Most of all text is uncompressible • Proof: • There are (||m+1-1)/(||-1) strings of length at most m • There are ||n strings of length n • From these strings at most (||m+1-1)/(||-1) strings can be compressed • This is fraction of at most ||m-n+1/(||-1) • E.g. for || = 256 and m=n-10 this is 8.3 × 10-25 • which implies that only 8.3 × 10-25 of all files of n bytes can be compressed to a string of length n-10

Why does Text Compression work? • Usually texts are using letters with different frequencies • Relative Frequencies of Letters in General English Plain text From Cryptographical Mathematics, by Robert Edward Lewand: • e: 12%, t: 10%, a: 8%, i: 7%, n: 7%, o: 7% • ... • k: 0.4%, x: 0. 2%, j: 0. 2%, q: 0. 09%, z:0. 06% • Special characters like $,%,# occur even less frequent • Some character encodings are (nearly) unused, e.g. bytecode: 0 of ASCII • Text underlies a lot of rules • Words are (usually) the same (collected in dictionaries) • Not all words can be used in combination • Sentences are structured (grammar) • Program codes use code words • Digitally encoded pictures have smooth areas, where colors change gradually • Patterns repeat

Information Theory: The Shannon bound • C. E. Shannon in his 1949 paper "A Mathematical Theory of Communication". • Shannon derives his definition of entropy • The entropy rate of a data source means the average number of bits per symbol needed to encode it. • Example text: ababababab • Entropy: 1 • Encoding: • Use 0 for a • Use 1 for b • Code: 0101010101 • Huffman Codes are a way to derive such a Shannon bound (for sufficiently large text)

Huffman Code Huffman Code is adapted for each text (but not within the text) consists of a dictionary, which maps each letter of a text to a binary string and the code given as a prefix-free binary encoding Prefix-free code uses strings s1,s2,...,sm of variable length such that no strint si is a prefix of sj Example of Huffman encoding: Text: m a n a m a m a p a t i p i t i p i Encoding: 001 01 111 01 001 01 111 01 000 10 110 10 000 10 000 10 111 10 t i p i t i p i m a n a m a m a p a

Computing Huffman Codes Compute the letter frequencies Build root nodes labeled with frequencies repeat Build node connected the two least frequent unlinked nodes Mark sons with 0 and 1 Father node carries the sum of the frequencies until one tree is left The path to each letter carries the code 1 0 18 10 8 1 0 0 1 5 4 1 1 0 0 5 3 2 4 2 2 p n a i t m 110 10 01 001 000 111

Searching in Huffman Codes • Let u be the size of the compressed text • Let v be the size of the pattern Huffman-encoded according to the text dictionary • KMP can search in Huffman Codes in time O(u+v+m) • Encoding the pattern takes O(v+m) steps • Building the prefix takes time O(v) • Searching the text on a bit level takes time O(u+v) • Problems: • This algorithm is bit-wise not byte-wise • Exercise: Develop a byte-wise strategy

The Downside of Huffman Codes • Example: Consider 128 Byte text: • abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba • will be encoded using 16 Bytes (and an extra byte for the dictionary) as • 0110011001100110011001100110011001100110011001100110011001100110 • 0110011001100110011001100110011001100110011001100110011001100110 • This does not use the full compression possibilities for this text • E.g. using (abba)^32 would need only 9 Bytes • The perfect code: • A self-extracting program for a string x is a program that started without input produces the output x and then halts. • So, the smallest self-extracting-program is the ultimate encoding • Kolmogorov complexity K(x) of a string x denotes the length of such an self-extracting program for x

Kolmogoroff Complexity • Is the Kolmogorov Complexity depending on programming language? • No, as long as the programming language is universal, e.g. can simulate any Turing machine Lemma Let K1(x) and K2(x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x: K1(x)  K2(x) + c • Is the Kolmogorov Complexity useful? • No: Theorem K(x) is not recursive.

Proof of Lemma Lemma Let K1(x) and K2(x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x: K1(x)  K2(x) + c Proof • Let M1 be the self-extracting program for x with respect to the first language • Let U be a universal program in the seconde that simulates a given machine M1of the first language • The output of U(M1,) is x • Then, the can find a machine M2 of length |U|+|M1|+O(1) that has the same functionality as U(M1,) • by using S-m-n-Theorem • Since |U| is a fixed (constant-sized) machine this proves the statement.

Proof of the Theorem Theorem K(x) is not recursive. Proof • Assume K(x) is recursive. • For string length n let xn denote the smallest string of length n such that K(x)  |x| = n • We can enumerate xn • Compute for all strings x of size n the Kolmogorov complexity K(x) and output the first string x with K(x)n • Let M be the program computing xn on input n • We can efficiently encode xn: • Combine M with binary encoded n: K(x)  log n + |M| = log n + O(1) • For large enough n this is a contradiction to K(x)  n

Thanks for your attentionEnd of 3rd lectureNext lecture: Mo 8 Nov 2004, 11.15 am, FU 116Next exercise class: Mo 25 Oct 2004, 1.15 pm, F0.530 or We 27 Oct 2004, 1.00 pm, E2.316

Search Algorithms Winter Semester 2004/2005 25 Oct 2004 3rd Lecture