240 likes | 398 Views
A Fast Regular Expression Indexing Engine. Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM). Problem. How can we match a regular expression fast? Large text-corpus Several days to match a simple regular expression! Our solution Use an index!. Motivation. Advanced search interface
E N D
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
Problem • How can we match a regular expression fast? • Large text-corpus • Several days to match a simple regular expression! • Our solution • Use an index! Junghoo "John" Cho (UCLA Computer Science)
Motivation • Advanced search interface • What is the middle name of Thomas Edison? • State-of-the-art: Keyword-based • Thomas Edison • Regular expression • Thomas [a-z]+ Edison • Data extraction [Brin 98] Junghoo "John" Cho (UCLA Computer Science)
Outline • Index key selection • Useful gram • Algorithm for key selection • Other issues • Experiments Junghoo "John" Cho (UCLA Computer Science)
Motivating example • All mp3 URLs on the Web:<a href=(“|’)?.*\.mp3(“|’)?>Every matching string contains mp3. • Questions: • Should we index “mp3”? • Should we index “<a href=”? Junghoo "John" Cho (UCLA Computer Science)
What index entires? • Solution 1: Inverted index (English words) • Cannot handle many regular expressions • Solution 2: k-grams for k = 1, 2, …, 10 • Index too large (10 times as large!) • Our solution: multigram Junghoo "John" Cho (UCLA Computer Science)
Main idea • “mp3” is helpful. • Not many pages have it. • “<a href=” is not. • All pages have it. • We index only “useful” grams. Junghoo "John" Cho (UCLA Computer Science)
Gram selectivity • Sel(x): selectivity of gram x Sel(x) = M(x)/N M(x): number of pages containing gram x N: total number of pages • C-useful gram: All grams with Sel(x) < C • C: system parameter random access vs. sequential access time • We index only “C-useful” grams Junghoo "John" Cho (UCLA Computer Science)
Minimal useful gram • “Unix is great” • If “Unix” is useful “Unix i”, “Unix is”, “Unix is g”, … are all useful. • “Unix” is the minimal useful gram. • We index only the minimal useful gram. Junghoo "John" Cho (UCLA Computer Science)
Advantages • Versatile • We can look up “Unix” for all grams like “Unix i”, “Unix is g”, etc. • Easy to find • Reduction to “A priori” algorithm • Index size guarantee Junghoo "John" Cho (UCLA Computer Science)
Algorithm • Main idea: • If “abcde” is minimal useful gram, then “abcd” is not useful. • If “abcd” is not useful, then “a”, “ab”, “abc” is not useful. • Minimal useful gram identification is equivalent to useless gram identification. Junghoo "John" Cho (UCLA Computer Science)
A priori algorithm • Useless gram identification • Find all sequences of characters that occur in more than k pages • A priori algorithm • Find all sets of items that occur in more than k baskets • Less than 4 scans of the corpus to find all minimal useful grams. Junghoo "John" Cho (UCLA Computer Science)
Prefix free set • A set of grams X is prefix free ifno x X is a prefix of any other x’ Xe.g.) X = {ab, ac, abc} is not prefix free. • A set of minimal useful grams is a prefix free set. Junghoo "John" Cho (UCLA Computer Science)
Size of a prefix free set • Let X be a set of grams extracted from corpus D and is prefix free. Then |X| |D||X|: number of grams in X|D|: number of characters in D • The size of an index with minimal useful grams does not exceed the size of the corpus! Junghoo "John" Cho (UCLA Computer Science)
Shortest suffix gram • <a href=“k • If =“k is useful, then <a href=“k, a href=“k, href=“k,etc are all useful. • =“k: shortest suffix gram • We index only the shortest suffix gram. • Pre-suf shell Junghoo "John" Cho (UCLA Computer Science)
Other issues • Given a regular expression how to find an index entry to look up? • Optimization? Junghoo "John" Cho (UCLA Computer Science)
Experiments • Half million Web documents • Comparison • Raw scanning • Multigram index • Complete: k-grams for k = 1,2, …, 10 • Benchmark queries • No standard • Collected from IBM Almaden researchers Junghoo "John" Cho (UCLA Computer Science)
Example queries (simplified) • MP3 URLs: <a href=.*\.mp3> • Invalid HTML: <[^>]*< • Phone numbers: (\d\d\d) \d\d\d-\d\d\d\d • PowerPC chip number: (xpc|mpc)[0-9]+[0-9a-z]+ • Middle name of Clinton:William [a-z]+ Clinton Junghoo "John" Cho (UCLA Computer Science)
Evaluation metrics • Index construction time • Index size • Matching time • Overall throughput • Response time for first 10 matches Junghoo "John" Cho (UCLA Computer Science)
Construction time & Index size • An order of magnitude reduction in index size Junghoo "John" Cho (UCLA Computer Science)
Matching time • On average, Complete is faster than Multigram only by 33% Junghoo "John" Cho (UCLA Computer Science)
Result size & Improvement Junghoo "John" Cho (UCLA Computer Science)
Related work • Suffix tree • Beaza-Yates et al., JACM,1998 • Main-memory based • Disk-based string index • Cooper et al., VLDB, 2001 • Good for exact string matching • Inverted index • English words Junghoo "John" Cho (UCLA Computer Science)
Conclusion • Fast matching of regular expressions • Multigram index • Small size • Significant improvement in matching time • Future work • Optimization? Junghoo "John" Cho (UCLA Computer Science)