400 likes | 618 Views
Regular Expression and Finite State Machine. Based on Slides by Jim Martin. Regular Expressions and Text Searching. Everybody does it Emacs, vi, perl, grep, sed, awk, etc.. REs Character sequence Kleene star Character set, complement set Anchors Disjunction Grouping. Some Examples.
E N D
Regular Expression and Finite State Machine Based on Slides by Jim Martin
Regular Expressions and Text Searching • Everybody does it • Emacs, vi, perl, grep, sed, awk, etc.. • REs • Character sequence • Kleene star • Character set, complement set • Anchors • Disjunction • Grouping
Some Examples Courtesy of Kathy McCoy
RE E.G. /pupp(y|ies)/ Morphological variants of ‘puppy’ / (.+)ier and \1ier / happier and happier, fuzzier and fuzzier Courtesy of Kathy McCoy
Optionality and Repetition • /[Ww]oodchucks?/ matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou?r/ matches color or colour • /he{3}/ matches heee • /(he){3}/ matches hehehe • /(he){3,} matches a sequence of at least 3 he’s Courtesy of Kathy McCoy
Operator Precedence Hierarchy 1. Parentheses () 2. Counters * + ? {} 3. Sequence of Anchors the ^my end$ 4. Disjunction | Examples /moo+/ /try|ies/ /and|or/ Courtesy of Kathy McCoy
A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy
A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy
A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /[Tt]he/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy
A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /\b[Tt]he\b/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy
The Two Kinds of Errors • The process we just went through was based on fixing errors in the regular expression • Errors where some of the instances were missed (judged to not be instances when they should have been) – False negatives • Errors where the instances were included (when they should not have been) – False positives • This is pretty much going to be the story of the rest of the course! Courtesy of Kathy McCoy
Finite State Automata as Graphs • Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. • Let’s start with the sheep language from the text • /baa+!/
Sheep FSA • We can say the following things about this machine • It has 5 states • At least b, a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions
But note • There are other machines that correspond to this language • More on this one later
Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions
Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational
Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical/semantic purpose different from the original
Nouns and Verbs (English) • Nouns are simple (not really) • Markers for plural and possessive • Verbs are only slightly more complex • Markers appropriate to the tense of the verb
Regulars and Irregulars • Ok so it gets a little complicated by the fact that some words misbehave (refuse to follow the rules) • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.
Regular and Irregular Verbs • Regulars… • Walk, walks, walking, walked, walked • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut
Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you. • Quasi-systematicity • Irregular meaning change • Changes of word class
Derivational Examples • Verb/Adj to Noun
Derivational Examples • Noun/Verb to Adj
Compute • Many paths are possible… • Start with compute • Computer -> computerize -> computerization • Computation -> computational • Computer -> computerize -> computerizable • Compute -> computee
Stemming vs Morphology • Sometimes you just need to know the stem of a word and you don’t care about the structure. • In fact you may not even care if you get the right stem, as long as you get a consistent string. • This is stemming… it most often shows up in IR applications
Stemming in IR • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Match • This is basically a form of hashing
Porter Stemmer • No lexicon needed • Basically a set of staged sets of rewrite rules that strip suffixes • Handles both inflectional and derivational suffixes • Doesn’t guarantee that the resulting stem is really a stem (see first bullet) • Lack of guarantee doesn’t matter for IR
wear wear wearable wearabl wearer wearer wearied weari wearier wearier weariest weariest wearily wearili weariness weari wearing wear wearisome wearisom wearisomely wearisom wears wear weather weather weathercock weathercock weathercocks weathercock web web Webb webb Webber webber webs web Webster webster Websterville webstervil wedded wedd wedding wedd weddings wedd wedge wedg wedged wedg wedges wedg wedging wedg Porter Stemmer: Examples
static RuleList step1a_rules[] = { {101, "sses", "ss", 3, 1, 0, NULL}, {102, "ies", "i", 2, 0, 0, NULL}, {103, "ss", "ss", 1, 1, 0, NULL}, {104, "s", LAMBDA, 0, -1, 0, NULL}, {000, NULL, NULL, 0, 0, 0, NULL} }; static RuleList step1b_rules[] = { {105, "eed", "ee", 2, 1, 0, NULL}, {106, "ed", LAMBDA, 1, -1, -1, ContainsVowel}, {107, "ing", LAMBDA, 2, -1, -1, ContainsVowel}, {000, NULL, NULL, 0, 0, 0, NULL} };
static RuleList step1b1_rules[] = { {108, "at", "ate", 1, 2, 0, NULL}, {109, "bl", "ble", 1, 2, 0, NULL}, {110, "iz", "ize", 1, 2, 0, NULL}, {111, "bb", "b", 1, 0, 0, NULL}, {112, "dd", "d", 1, 0, 0, NULL}, {113, "ff", "f", 1, 0, 0, NULL}, {114, "gg", "g", 1, 0, 0, NULL}, {115, "mm", "m", 1, 0, 0, NULL}, {116, "nn", "n", 1, 0, 0, NULL}, {117, "pp", "p", 1, 0, 0, NULL}, {118, "rr", "r", 1, 0, 0, NULL}, {119, "tt", "t", 1, 0, 0, NULL}, {120, "ww", "w", 1, 0, 0, NULL}, {121, "xx", "x", 1, 0, 0, NULL}, {122, LAMBDA, "e", -1, 0, 0, AddAnE}, {000, NULL, NULL, 0, 0, 0, NULL} };
static RuleList step1c_rules[] = { {123, "y", "i", 0, 0, -1, ContainsVowel}, {000, NULL, NULL, 0, 0, 0, NULL} }; static RuleList step2_rules[] = { {203, "ational", "ate", 6, 2, 0, NULL}, {204, "tional", "tion", 5, 3, 0, NULL}, {205, "enci", "ence", 3, 3, 0, NULL}, {206, "anci", "ance", 3, 3, 0, NULL}, {207, "izer", "ize", 3, 2, 0, NULL}, {208, "abli", "able", 3, 3, 0, NULL}, {209, "alli", "al", 3, 1, 0, NULL}, {210, "entli", "ent", 4, 2, 0, NULL}, {211, "eli", "e", 2, 0, 0, NULL}, {213, "ousli", "ous", 4, 2, 0, NULL},
static RuleList step3_rules[] = { {301, "icate", "ic", 4, 1, 0, NULL}, {302, "ative", LAMBDA, 4, -1, 0, NULL}, {303, "alize", "al", 4, 1, 0, NULL}, {304, "iciti", "ic", 4, 1, 0, NULL}, {305, "ical", "ic", 3, 1, 0, NULL}, {308, "ful", LAMBDA, 2, -1, 0, NULL}, {309, "ness", LAMBDA, 3, -1, 0, NULL}, {000, NULL, NULL, 0, 0, 0, NULL} };
static RuleList step4_rules[] = { {401, "al", LAMBDA, 1, -1, 1, NULL}, {402, "ance", LAMBDA, 3, -1, 1, NULL}, {403, "ence", LAMBDA, 3, -1, 1, NULL}, {405, "er", LAMBDA, 1, -1, 1, NULL}, {406, "ic", LAMBDA, 1, -1, 1, NULL}, {407, "able", LAMBDA, 3, -1, 1, NULL}, {408, "ible", LAMBDA, 3, -1, 1, NULL}, {409, "ant", LAMBDA, 2, -1, 1, NULL}, {410, "ement", LAMBDA, 4, -1, 1, NULL}, {411, "ment", LAMBDA, 3, -1, 1, NULL},
Soundex • You work as a telephone information operator. Someone calls looking for our senior theory professor… • What do you type as your query string?
Soundex • Keep the first letter • Drop non-initial occurrences of vowels, h, w and y • Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 • Replace strings of identical numbers with a single number (333 -> 3) • Drop any numbers beyond a third one
Soundex • Effect is to map (hash) all similar sounding transcriptions to the same code. • Structure your directory so that it can be accessed by code as well as by correct spelling • Used for census records, phone directories, author searches in libraries etc.