Introduction to Stringology

Introduction to Stringology Like Zhang

Outlines • What is “Stringology”? • How to perform string matching? • KMP String Searching • Booyer-Moore Algorithm • Trie and Suffix Tree • Approximate Pattern Matching • Interesting Problems

Stringology? • Text algorithms; Algorithms on strings Practical Problems: e.g. String matching, text compression Theoretical Problems: e.g. Symmetric string, repetitions, etc.

Why “Stringology” • Isolated and brief description in most algorithm books • Rich content only accessible in academic papers or journals • Applicable to many applications including web search, intrusion detection, bioinformatics, multimedia, data compression, etc. • Fundamental of computer science is to understand binary strings

At the beginning… Question: Given a string “abcdefghijk” and a pattern P, try to find if P exists in the given string. Solution: C++: String s=“abcdefghijk”; If( s.find(P,0)!=string:npos ) return true; else return false; Java: String s=“abcdefghijk”; If( s.indexOf(P)>0 ) return true; else return false;

Do you care about performance? • What if the given string is 100GB and the pattern is 100MB? • What if the indexOf() and find() method are using brute force searching? Brute Force string searching (pseudo code): for(int i=0;i<s.Length;++i) //O(n) { compare(s[i, i+p.Length], p); //O(m) } Total Time: O(n*m) for 100GB data and 100MB pattern, takes around 277777 hours (32 Years) on a 10G Hz cpu –supposing comparison takes 1 clock

Why it is slow? Problem of brute force string searching: • Same patterns have been processing multiple times e.g. S=“abedabcdfghij”, P=“abedabz”; 1st: abedabcdfghij, start at index 0 2nd: abedabcdfghij, start at index 1 3rd: abedabcdfghij, start at index 2 …

KMP Algorithm • Knuth – Morris – Pratt Algorithm Proposed in 1977 Preprocessing searching pattern to avoid trivial comparisions e.g. For pattern “abedabz”, if we know the mismatching happens at z, and the maximum rollback location is from “abz”, we don’t need to shift the matching string one by one KMP: (i is the current location) … If ( S[i]!=P[j] ) i=I + lookupTable[j]; … Brute Force: (i is the current location) … If ( S[i]!=P[j] ) i=i+1; …

Build KMP Table e.g. the pattern is “101101” Table[0]=1; Table[1]=1; Table[2]=1; Table[3]=2; // 1011011 Table[4]=3; // 1011011 Table[5]=3; // 1011011 Table[i]=k if P[i-k,i-1]==P[0,k-1] Otherwise, Table[i]=1

Boyer-Moore Algorithm • Published in 1977 • The longer the pattern is, the faster it works • Starts from the end of pattern, while KMP starts from the beginning • Works best for character string, while KMP works best for binary string Live Demo: http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html

Trie and Suffix Tree • KMP and Boyer-Moore - Preprocessing existing patterns - Searching patterns in input strings • Trie and Suffix Tree - Preprocessing existing strings (e.g. dictionary) - Searching input patterns in the build tree

A Simple Non-Compact Trie For strings: BIG, BIGGER, BILL,GOOD, GOSH

Compact Trie Shrink all chains leading to leaves

Patricia Each Edge represent multiple characters

Online Suffix Trie Building For each input character X Add X to all suffix leaves Make X as Suffix (if X cannot be found, add it to the root children)

Build a Suffix Trie Online Given Text: abaab Step 1 (start from the end): a

Build a Suffix Trie Online Step 2: Input character “b” a b (new suffix) b (new suffix)

Build a Suffix Trie Online Step 3: Input character “a” a (existing suffix) b b a (new suffix) a (new suffix)

Build a Suffix Trie Online Step 4: Input character “a” a b 7 a(new suffix) a b a(new suffix) a a(new suffix)

Build a Suffix Trie Online Step 5: Input character “b” a b a b a(new suffix) a a b a(new suffix) b b

Suffix Array String Searching U. Manber and G. Myers, “Suffix arrays: a new method for on-line string searches”, SIAM Journal on Computing, 1993 Another source: “Programming Pearls”, Ch.15 • Sort string by suffix (pointers) • Binary search

Example of Suffix Array Search Existing string: Google Then we have the following suffixes: google oogle ogle gle le e e gle google le ogle oogle Search pattern “good” Compare with “le” Compare with “gle “good” != “google”, return false

Performance Comparison Previous Question: Find the 100MB string in 100GB content, what’s the worst case time complexity? Brute Force: O(n*m) is about 32 years Suffix Array: • Quick Sort the 100GB: O(nlgn)=O(37*237) • Binary Search: O(m*lgn)=O(37*227) Total is about 38*237, about 10mins

Approximate Pattern Question: “University” is the correct pattern, but we also allow typos, which means “Unversity” “Oniversity” “Univsitty” are also acceptable. Then find all acceptable patterns in the content. How?

Distance Definition String s1 and s2 have distance K if s1 can be transformed to s2 by K steps. The steps can only be of the following actions: • Change a character • Insert a character • Delete a character e.g. String “wojtk” can be transformed to “wjeek” by 3 steps, then Distance(“wojtk”, “wjeek”)=3

Distance Calculation Dynamic Programming (similar to Longest Common String Calculation) For s1[1,m], s2[1,n], 0<i<=m, 0<j<=n Distance( i, j ) =min{ Distance(i-1, j)+1, Distance(i, j-1)+1, Distance(i-1, j-1), f(s1[i], s2[j])} where f(a,b)= ( a==b)?0:1

Final Thoughts • String searching is critical to most applications • A problem has to deal with unless you don’t care how the indexOf() is implemented • 2D pattern matching is the hot topic of image/video research e.g. object detection, face recognition, etc. • Many interesting questions available e.g. symmetric patterns, shortest common string • Be sure to answer those questions for Microsoft/Google/etc. interviews

Introduction to Stringology

Introduction to Stringology

Presentation Transcript

INTRODUCTION TO…

Introduction to

Introduction to

Introduction to

Introduction to introduction to introduction to … Optimization

Embedded Stringology

Introduction to

Introduction to Bioinformatics Introduction to Databases

Introduction to Engineering Introduction to CAD

Introduction to Introduction to Database Systems

Introduction to Introduction to Psychology

INTRODUCTION TO

INTRODUCTION to

Introduction to

Introduction to Concurrency: Introduction to Concurrency

Introduction to

Introduction to

Introduction to

Introduction to Psychophysiology Lecture 1- introduction to introduction

Introduction to Introduction to Artificial Intelligence