Regular Expressions and Automata in Natural Language Analysis

Regular Expressions and Automata in Natural Language Analysis CS 4705 CS 4705

Statistical vs. Symbolic (Knowledge Rich) Techniques • How much linguistic knowledge do our representations and algorithms need to have to do ‘successful’ NLP? • Bill hit John. • John, Bill hit. • 80/20 Rule: when do we need to worry about the other 20%?

Today • Review some of the simple representations and ask ourselves how we might use them to do interesting and useful things • Regular Expressions • Finite State Automata • Think about the limits of these simple approaches • When are simple methods good enough? • When do we need more?

Regular Expression Pattern Matching in NLP • Simple but powerful tools for ‘shallow’ processing, e.g. of very large corpora • What word is most likely to begin a sentence? • What word is most likely to begin a question? • How often do people end sentences with prepositions? • With other simple statistical tools, allow us to • Obtain word frequency and co-occurrence statistics • Build simple interactive applications (e.g. Eliza) • Authorship: Who wrote Shakespeare’s plays? The Federalist papers? The Unibomber letters? • Deception detection

RE Matches Possible use /./ Any character A non-blank line /\./, /\?/ A ‘.’, a ‘?’ /[bckmsr]/ Any char in set Rhyme:/[bckmrs]ite/ /[a-z]/ Any l.c. letter /[A-Z]/ Any u.c. letter /[A-Z][a-z]*/ /[^A-Z]/ Any non-u.c. char /[^A-Z][a-z]*/ Review A statement, a question Rhyme: /[a-z]ite/

RE plus E.G. /kitt(y|ies)/ Morphological variants of ‘kitty’ -- but / (.+ier) and \1 / Patterns: happier and happier, fuzzier and fuzzier, classifier and classifier

Substitutions (Transductions) • E.g. unix sed or ‘s’ operator in Perl • s/regexpr/pattern/ • s/I am feeling (.+)/Why are you feeling \1 ?/ • s/I gave (.+) to (.+)/Why would you give \2 \1 ?/ • s/You are (.+)[.]*/Why would you say that I am \1?/ • s/([1]?[0-9]) o’clock ([AaPp][. ]*[Mm][. ]*)/\1:00 \2/ • How would you convert to 24-hour clock? • s/[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]/ 000-000-0000/

Possible Applications • Predictions from a news corpus: • Which candidate for President is mentioned most often in the news? Is going to win? • What stock should you buy? • Which White House advisers have the most power? • Language use: • Which form of comparative is more common: ‘Xer’ or ‘moreX’? • Which pronouns occur most often in subject position? • How often do sentences end with infinitival ‘to’? • What words typically begin and end sentences? • What are the 20 most common words in your email? In the news? In Shakespeare’s plays?

Emotional language: • What words indicate what emotions? • Happiness • Anger • Confidence • Despair • What words signal deception?

a b a a ! q0 q1 q2 q3 q4 Finite State Automata • FSAs recognize the regular languages represented by regular expressions • SheepTalk: /baa+!/ • Directed graph with labeled nodes and arc transitions • Five states: q0 the start state, q4 the final state, 5 transitions

a b a a ! q0 q1 q2 q3 q4 Formally • FSA is a 5-tuple consisting of • Q: set of states {q0,q1,q2,q3,q4} • : an alphabet of symbols {a,b,!} • q0: a start state in Q • F: a set of final states in Q {q4} • (q,i): a transition function mapping Q x  to Q

FSA recognizes (accepts) strings of a regular language • baa! • baaa! • baaaa! • … • Tape metaphor: will this input be accepted?

State Transition Table for SheepTalk

a b a a ! q0 q1 q2 q3 q4 Non-Deterministic FSAs for SheepTalk b a a ! q0 q1 q2 q3 q4 

Problems of Non-Determinism • At any choice point, we may follow the wrong arc • Potential solutions: • Save backup states at each choice point • Look-aheadin the input before making choice • Pursue alternatives in parallel • Determinizeour NFSAs (and thenminimize) • FSAs can be useful tools for recognizing – and generating – subsets of natural language • But they cannot represent all NL phenomena (center embedding: The mouse the cat chased died.)

Simple vs. linguistically rich representations…. • How do we decide what we need?

FSAs as Grammars for Natural Language dr the rev mr pat l. robinson q0 q1 q2 q3 q4 q5 q6 ms hon  mrs 

Recognizing Person Names • If we want to extract all the proper names in the news, will this work? • What will it miss? • Will it accept something that is not a proper name? • How would you change it to accept all proper names without false positives? • Precision vs. recall….

Summing Up • Regular expressions and FSAs can represent subsets of natural language as well as regular languages • Both representations may be difficult for humans to understand for any real subset of a language • Can be hard to scale up: e.g., when many choices at any point (e.g. surnames) • But quick, powerful and easy to use for small problems • Next class: • Read Ch 3.1, 3.12 (new version)

Regular Expressions and Automata in Natural Language Analysis