Lecture 4: Matching Things. Regular Expressions

Lecture 4: Matching Things. Regular Expressions Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material

Today • Regular Expressions • Snippet on Speech Recognition • At least half of it.

Regular Expressions • Can be viewed as a way to specify • Search patterns over a text string • Design a particular kind of machine, a Finite State Automaton (FSA) • we probably won’t cover this today. • Define a formal “language” i.e. a set of strings

Uses of Regular Expressions • Simple powerful tools for large corpus analysis and ‘shallow’ processing • What word is most likely to begin a sentence • What word is most likely to begin a question? • Are you more or less polite than the people you correspond with?

Definitions • Regular Expression: Formula in algebraic notation for specifying a set of strings • String: Any sequence of characters • Regular Expression Search • Pattern: specifies the set of strings we want to search for • Corpus: the texts we want to search through

Simple Example

More Examples

And still more examples

Optionality and Repetition /[Ww]oodchucks?/ /colou?r/ /he{3}/ /(he){3}/ /(he){3},/

Character Groups Some groups of characters are used very frequently, so the RE language includes shorthands for them

Special Characters These enable the matching of multiple occurrences of a pattern

Escape Characters Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier.

RE Matching in Python NLTK • Set up: • import re • from nltk.util import re_show • sent = “colourless green ideas sleep furiously • re_show(pattern, str) • shows where the pattern matches

Substitutions • Replace every l with an s • re.sub(‘l’, ‘s’, sent) • ‘cosoursess green ideas sseepfurioussy’ • re.sub(‘green’, ‘red’, sent) • ‘colourless red ideas sleep furiously’

Findall • re.findall(pattern, sent) • will return all of the substrings that match the pattern • re.findall(‘(green|sleep)’, sent) • [‘green’, ‘sleep’]

Match • Matches from the beginning of the string • match(pattern, string) • Returns: a Match object or None (if not found) • Match objects contain information about the search

Methods in Match

More Match Methods

Search • re.search(pattern, string) • Finds the pattern anywhere in the string. • re.search(‘\d+’, ‘ 1034 ’).group() • ‘1034’ • re.search(‘\d+’, ‘ abc123 ‘).group() • ‘123’

Splitting • ‘text can be made into lists’.split() • re.split(pattern, split) • uses the pattern to identify the split point • re.split(‘\d+’, “I want 4 cats and 13 dogs”) • [“I want ”, “ cats and ”, “ dogs”] • re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”) • [“I want”, “cats and”, “dogs”]

Joining ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’] This simple formatting can be helpful to report results or merge information

Stemming with Regular Expressions def stem(word): regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' stem, suffix = re.findall(regexp, word)[0] return stem

Play with some code

Snippet on Speech Recognition

Lecture 4: Matching Things. Regular Expressions