270 likes | 470 Views
Introduction to Stringology. Like Zhang. Outlines. What is “Stringology”? How to perform string matching? KMP String Searching Booyer-Moore Algorithm Trie and Suffix Tree Approximate Pattern Matching Interesting Problems. Stringology?. Text algorithms; Algorithms on strings
E N D
Introduction to Stringology Like Zhang
Outlines • What is “Stringology”? • How to perform string matching? • KMP String Searching • Booyer-Moore Algorithm • Trie and Suffix Tree • Approximate Pattern Matching • Interesting Problems
Stringology? • Text algorithms; Algorithms on strings Practical Problems: e.g. String matching, text compression Theoretical Problems: e.g. Symmetric string, repetitions, etc.
Why “Stringology” • Isolated and brief description in most algorithm books • Rich content only accessible in academic papers or journals • Applicable to many applications including web search, intrusion detection, bioinformatics, multimedia, data compression, etc. • Fundamental of computer science is to understand binary strings
At the beginning… Question: Given a string “abcdefghijk” and a pattern P, try to find if P exists in the given string. Solution: C++: String s=“abcdefghijk”; If( s.find(P,0)!=string:npos ) return true; else return false; Java: String s=“abcdefghijk”; If( s.indexOf(P)>0 ) return true; else return false;
Do you care about performance? • What if the given string is 100GB and the pattern is 100MB? • What if the indexOf() and find() method are using brute force searching? Brute Force string searching (pseudo code): for(int i=0;i<s.Length;++i) //O(n) { compare(s[i, i+p.Length], p); //O(m) } Total Time: O(n*m) for 100GB data and 100MB pattern, takes around 277777 hours (32 Years) on a 10G Hz cpu –supposing comparison takes 1 clock
Why it is slow? Problem of brute force string searching: • Same patterns have been processing multiple times e.g. S=“abedabcdfghij”, P=“abedabz”; 1st: abedabcdfghij, start at index 0 2nd: abedabcdfghij, start at index 1 3rd: abedabcdfghij, start at index 2 …
KMP Algorithm • Knuth – Morris – Pratt Algorithm Proposed in 1977 Preprocessing searching pattern to avoid trivial comparisions e.g. For pattern “abedabz”, if we know the mismatching happens at z, and the maximum rollback location is from “abz”, we don’t need to shift the matching string one by one KMP: (i is the current location) … If ( S[i]!=P[j] ) i=I + lookupTable[j]; … Brute Force: (i is the current location) … If ( S[i]!=P[j] ) i=i+1; …
Build KMP Table e.g. the pattern is “101101” Table[0]=1; Table[1]=1; Table[2]=1; Table[3]=2; // 1011011 Table[4]=3; // 1011011 Table[5]=3; // 1011011 Table[i]=k if P[i-k,i-1]==P[0,k-1] Otherwise, Table[i]=1
Boyer-Moore Algorithm • Published in 1977 • The longer the pattern is, the faster it works • Starts from the end of pattern, while KMP starts from the beginning • Works best for character string, while KMP works best for binary string Live Demo: http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html
Trie and Suffix Tree • KMP and Boyer-Moore - Preprocessing existing patterns - Searching patterns in input strings • Trie and Suffix Tree - Preprocessing existing strings (e.g. dictionary) - Searching input patterns in the build tree
A Simple Non-Compact Trie For strings: BIG, BIGGER, BILL,GOOD, GOSH
Compact Trie Shrink all chains leading to leaves
Patricia Each Edge represent multiple characters
Online Suffix Trie Building For each input character X Add X to all suffix leaves Make X as Suffix (if X cannot be found, add it to the root children)
Build a Suffix Trie Online Given Text: abaab Step 1 (start from the end): a
Build a Suffix Trie Online Step 2: Input character “b” a b (new suffix) b (new suffix)
Build a Suffix Trie Online Step 3: Input character “a” a (existing suffix) b b a (new suffix) a (new suffix)
Build a Suffix Trie Online Step 4: Input character “a” a b 7 a(new suffix) a b a(new suffix) a a(new suffix)
Build a Suffix Trie Online Step 5: Input character “b” a b a b a(new suffix) a a b a(new suffix) b b
Suffix Array String Searching U. Manber and G. Myers, “Suffix arrays: a new method for on-line string searches”, SIAM Journal on Computing, 1993 Another source: “Programming Pearls”, Ch.15 • Sort string by suffix (pointers) • Binary search
Example of Suffix Array Search Existing string: Google Then we have the following suffixes: google oogle ogle gle le e e gle google le ogle oogle Search pattern “good” Compare with “le” Compare with “gle “good” != “google”, return false
Performance Comparison Previous Question: Find the 100MB string in 100GB content, what’s the worst case time complexity? Brute Force: O(n*m) is about 32 years Suffix Array: • Quick Sort the 100GB: O(nlgn)=O(37*237) • Binary Search: O(m*lgn)=O(37*227) Total is about 38*237, about 10mins
Approximate Pattern Question: “University” is the correct pattern, but we also allow typos, which means “Unversity” “Oniversity” “Univsitty” are also acceptable. Then find all acceptable patterns in the content. How?
Distance Definition String s1 and s2 have distance K if s1 can be transformed to s2 by K steps. The steps can only be of the following actions: • Change a character • Insert a character • Delete a character e.g. String “wojtk” can be transformed to “wjeek” by 3 steps, then Distance(“wojtk”, “wjeek”)=3
Distance Calculation Dynamic Programming (similar to Longest Common String Calculation) For s1[1,m], s2[1,n], 0<i<=m, 0<j<=n Distance( i, j ) =min{ Distance(i-1, j)+1, Distance(i, j-1)+1, Distance(i-1, j-1), f(s1[i], s2[j])} where f(a,b)= ( a==b)?0:1
Final Thoughts • String searching is critical to most applications • A problem has to deal with unless you don’t care how the indexOf() is implemented • 2D pattern matching is the hot topic of image/video research e.g. object detection, face recognition, etc. • Many interesting questions available e.g. symmetric patterns, shortest common string • Be sure to answer those questions for Microsoft/Google/etc. interviews