LING 388: Language and Computers

LING 388: Language and Computers Sandiway Fong Lecture 5: 9/8

Administrivia • Reminder • LING 388 Homework #1 • due today • Email submissions by midnight to • sandiway@email.arizona.edu • Need help? • Office hour after this class

Today’s Topic • Regular Expressions • Finite State Automata (FSA) in Prolog

FSA Regular Expressions Regular Grammars Regular Expressions • Formally equivalent to • Finite state automata (FSA), and • Regular grammars • Practical use • string matching • typically for a single word form • search text: unix (e)grep, perl, microsoft word • caution: differences in notation and implementation

Regular Expressions: Microsoft Word • Terminology: • wildcard search

Regular Expressions: Microsoft Word

Regular Expressions: GNU grep • Terminology: • metacharacter • character with special meaning, not interpreted literally, e.g. ^ vs. a • must be quoted or escaped using the backslash\to get its literal meaning, e.g.\^ • Excerpts from the manpage • A list of characters enclosed by [ and ] matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. • For example, the regular expression [0123456789] matches any single digit. A range of characters may be specified by giving the first and last characters, separated by a hyphen.

Regular Expressions: GNU grep • Excerpts from the manpage • Finally, certain named classes of characters are predefined. • Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. • For example, [[:alnum:]] means [0-9A-Za-z] • The period . matches any single character. • The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum]]. • The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. • The symbols \< and \> respectively match the empty string at the beginning and end of a word. • The symbol \b matches the empty string at the edge of a word • Terminology • word • unbroken sequence of digits, underscores and letters

Regular Expressions: GNU grep • Excerpts from the manpage • A regular expression may be followed by one of several repetition operators: • ? The preceding item is optional and matched at most once. • * The preceding item will be matched zero or more times. • + The preceding item will be matched one or more times. • {n} The preceding item is matched exactly n times • {n,} The preceding item is matched n or more times. • {n,m} The preceding item is matched at least n times, but not more than m times.

Regular Expressions: GNU grep • Excerpts from the manpage • Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. • Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.

Regular Expressions: Examples • Example • \b99 matches 99 in “there are 99 bottles …” but not in “there are 299 bottles …” • Note: $99 contains two words, so \b99 will match 99 here • Example (sheeptalk) • baa! • baaa! • baaaa! … • Regular Expression • baaa*! • baa+!

beginning of line negation Regular Expressions: Examples • Example • guppy • guppies • Regular Expression • gupp(y|ies) • Example • the (whole word, case insensitive) • the25 • Regular Expression • (^|[^a-zA-Z])[tT]he[^a-zA-Z]

Regular Expressions • Regular Expressions shorthand for describing sets of strings • Examples: • string+ - set of one or more occurrences of string • a+ = {a, aa, aaa, aaaa, aaaaa, …} • (abc)+ = {abc, abcabc, abcabcabc, …} • string* - set of zero or more occurrences of string • a* = {, a, aa, aaa, aaaa, …} • (abc)* = {, abc, abcabc, …} • Note: • a a* = a+ • a {, a, aa, aaa, aaaa, …} = {a , aa, aaa, aaaa, aaaaa, …} • stringn - exactly n occurrences of string • a4 b3 = { aaaabbb } • Language = a set of strings

Regular Expressions • Regular Expressions • formally equivalent to regular grammars/finite state automata • How to show this? • Proof by construction… • can’t describe all possible languages • Examples: • {anbn | n>=0}is not regular • {wwR | w  {a,b}+ } is not regular • R = reverse, e.g. abcR = cba • How to show this? • Proof by Pumping Lemma

Finite State Automata (FSA) • Example: • Language: L = {a+b+} “one or more a’s followed by one or more b’s” • regular language • described by a regular expression • Note: • infinite set of strings belonging to language L • e.g. abbb, aaaab, aabb, *abab, * • Notation: • is the empty string (or string with zero length) • * means string is not in the language

a s x a b y b Finite State Automata (FSA) L = {a+b+} L = {aa*bb*}

Finite State Automata (FSA) • FSA shown on previous slide: • acceptor - no output • cf. transducer - input/output pairs • deterministic - no ambiguity i.e. at any given state and input character, the next state is uniquely determined • cf. non-deterministic FSA (NDFSA) • Note: • NDFSAs are not more powerful than FSAs • Proof: by construction

how: 3 vs. 5 keystrokes michael: 7 vs. 15 keystrokes Finite State Automata (FSA) • Practical Applications • Encode regular expressions • Morphological analyzers • Different word forms, e.g. want, wanted, unwanted (suffixation/prefixation) • Speech Recognizers • Markov models = FSA + probabilities • and many more … • T9 text entry (tegic.com) • Probably built in to your cellphone • Predictive text entry for mobile messaging • Reduces the number of keystrokes for inputting words on a telephone keypad (8 keys)

Finite State Automata (FSA) • More formally, FSA can be specified as a tuple containing: • set of states: {s,x,y} • start state: s • end state: y • alphabet: {a, b} • transition function : signature: character x state -> state • (a,s)=x • (a,x)=x • (b,x)=y • (b,y)=y

Finite State Automata (FSA) • In Prolog: • Define one predicate for each state • taking one argument (the input string L) • consume input character • call next state with remaining input string • Initially: • fsa(L) :- s(L). i.e. call start state s

s x a a b y b Finite State Automata (FSA) • State s: (start state) • s([a|L]) :- x(L). match input string beginning with a and call state x with remainder of input • State x: • x([a|L]) :- x(L). • x([b|L]) :- y(L). • State y: (end state) • y([]). • y([b|L]) :- y(L).

Finite State Automata (FSA) s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L). • Example: • ?- fsa([a,a,b]). • ?- s([a,a,b]). • ?- x([a,b]). • ?- x([b]). • ?- y([]).Yes

Finite State Automata (FSA) s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L). • Example: • ?- fsa([a,b,a]). • ?- s([a,b,a]). • ?- x([b,a]). • ?- y([a]). No

Finite State Automata (FSA) • Another possible Prolog encoding strategy: • fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). • fsa(y,[]).% End state • transition(s,a,x).% Encodes transition • transition(x,a,x).% function • transition(x,b,y). • transition(y,b,y).

Finite State Automata (FSA) • Example: ?- fsa(s,[a,a,b]). ?- transition(s,a,T). T=x ?- fsa(x,[a,b]). ?- transition(x,a,T’). T’=x ?- fsa(x,[b]). ?- transition(x,b,T”). T”=y ?- fsa(y,[]).Yes fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). fsa(y,[]). transition(s,a,x). transition(x,a,x). transition(x,b,y). transition(y,b,y).

Next Time • More on FSA including • NDFSA non-deterministic • Regular Grammars

LING 388: Language and Computers