A Fast Regular Expression Indexing Engine

A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)

Problem • How can we match a regular expression fast? • Large text-corpus • Several days to match a simple regular expression! • Our solution • Use an index! Junghoo "John" Cho (UCLA Computer Science)

Motivation • Advanced search interface • What is the middle name of Thomas Edison? • State-of-the-art: Keyword-based • Thomas Edison • Regular expression • Thomas [a-z]+ Edison • Data extraction [Brin 98] Junghoo "John" Cho (UCLA Computer Science)

Outline • Index key selection • Useful gram • Algorithm for key selection • Other issues • Experiments Junghoo "John" Cho (UCLA Computer Science)

Motivating example • All mp3 URLs on the Web:<a href=(“|’)?.*\.mp3(“|’)?>Every matching string contains mp3. • Questions: • Should we index “mp3”? • Should we index “<a href=”? Junghoo "John" Cho (UCLA Computer Science)

What index entires? • Solution 1: Inverted index (English words) • Cannot handle many regular expressions • Solution 2: k-grams for k = 1, 2, …, 10 • Index too large (10 times as large!) • Our solution: multigram Junghoo "John" Cho (UCLA Computer Science)

Main idea • “mp3” is helpful. • Not many pages have it. • “<a href=” is not. • All pages have it. • We index only “useful” grams. Junghoo "John" Cho (UCLA Computer Science)

Gram selectivity • Sel(x): selectivity of gram x Sel(x) = M(x)/N M(x): number of pages containing gram x N: total number of pages • C-useful gram: All grams with Sel(x) < C • C: system parameter random access vs. sequential access time • We index only “C-useful” grams Junghoo "John" Cho (UCLA Computer Science)

Minimal useful gram • “Unix is great” • If “Unix” is useful “Unix i”, “Unix is”, “Unix is g”, … are all useful. • “Unix” is the minimal useful gram. • We index only the minimal useful gram. Junghoo "John" Cho (UCLA Computer Science)

Advantages • Versatile • We can look up “Unix” for all grams like “Unix i”, “Unix is g”, etc. • Easy to find • Reduction to “A priori” algorithm • Index size guarantee Junghoo "John" Cho (UCLA Computer Science)

Algorithm • Main idea: • If “abcde” is minimal useful gram, then “abcd” is not useful. • If “abcd” is not useful, then “a”, “ab”, “abc” is not useful. • Minimal useful gram identification is equivalent to useless gram identification. Junghoo "John" Cho (UCLA Computer Science)

A priori algorithm • Useless gram identification • Find all sequences of characters that occur in more than k pages • A priori algorithm • Find all sets of items that occur in more than k baskets • Less than 4 scans of the corpus to find all minimal useful grams. Junghoo "John" Cho (UCLA Computer Science)

Prefix free set • A set of grams X is prefix free ifno x  X is a prefix of any other x’  Xe.g.) X = {ab, ac, abc} is not prefix free. • A set of minimal useful grams is a prefix free set. Junghoo "John" Cho (UCLA Computer Science)

Size of a prefix free set • Let X be a set of grams extracted from corpus D and is prefix free. Then |X|  |D||X|: number of grams in X|D|: number of characters in D • The size of an index with minimal useful grams does not exceed the size of the corpus! Junghoo "John" Cho (UCLA Computer Science)

Shortest suffix gram • <a href=“k • If =“k is useful, then <a href=“k, a href=“k, href=“k,etc are all useful. • =“k: shortest suffix gram • We index only the shortest suffix gram. • Pre-suf shell Junghoo "John" Cho (UCLA Computer Science)

Other issues • Given a regular expression how to find an index entry to look up? • Optimization? Junghoo "John" Cho (UCLA Computer Science)

Experiments • Half million Web documents • Comparison • Raw scanning • Multigram index • Complete: k-grams for k = 1,2, …, 10 • Benchmark queries • No standard • Collected from IBM Almaden researchers Junghoo "John" Cho (UCLA Computer Science)

Example queries (simplified) • MP3 URLs: <a href=.*\.mp3> • Invalid HTML: <[^>]*< • Phone numbers: (\d\d\d) \d\d\d-\d\d\d\d • PowerPC chip number: (xpc|mpc)[0-9]+[0-9a-z]+ • Middle name of Clinton:William [a-z]+ Clinton Junghoo "John" Cho (UCLA Computer Science)

Evaluation metrics • Index construction time • Index size • Matching time • Overall throughput • Response time for first 10 matches Junghoo "John" Cho (UCLA Computer Science)

Construction time & Index size • An order of magnitude reduction in index size Junghoo "John" Cho (UCLA Computer Science)

Matching time • On average, Complete is faster than Multigram only by 33% Junghoo "John" Cho (UCLA Computer Science)

Result size & Improvement Junghoo "John" Cho (UCLA Computer Science)

Related work • Suffix tree • Beaza-Yates et al., JACM,1998 • Main-memory based • Disk-based string index • Cooper et al., VLDB, 2001 • Good for exact string matching • Inverted index • English words Junghoo "John" Cho (UCLA Computer Science)

Conclusion • Fast matching of regular expressions • Multigram index • Small size • Significant improvement in matching time • Future work • Optimization? Junghoo "John" Cho (UCLA Computer Science)

A Fast Regular Expression Indexing Engine

A Fast Regular Expression Indexing Engine

Presentation Transcript

Expression engine

Matlab Regular Expression

Regular Expression 1. What is regular expression?

An Improved DFA for Fast Regular Expression Matching

Regular Expression

Regular Expression

^Regular Expression$

Regular Expression - Intro

Regular Expression

Regular Expression

Regular Expression

Regular Expression

Regular Expression

Regular Expression

Regular Expression Support