1 / 26

LING 388: Language and Computers

LING 388: Language and Computers. Sandiway Fong Lecture 5: 9/8. Administrivia. Reminder LING 388 Homework #1 due today Email submissions by midnight to sandiway@email.arizona.edu Need help? Office hour after this class. Today’s Topic. Regular Expressions

rowdy
Download Presentation

LING 388: Language and Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 388: Language and Computers Sandiway Fong Lecture 5: 9/8

  2. Administrivia • Reminder • LING 388 Homework #1 • due today • Email submissions by midnight to • sandiway@email.arizona.edu • Need help? • Office hour after this class

  3. Today’s Topic • Regular Expressions • Finite State Automata (FSA) in Prolog

  4. FSA Regular Expressions Regular Grammars Regular Expressions • Formally equivalent to • Finite state automata (FSA), and • Regular grammars • Practical use • string matching • typically for a single word form • search text: unix (e)grep, perl, microsoft word • caution: differences in notation and implementation

  5. Regular Expressions: Microsoft Word • Terminology: • wildcard search

  6. Regular Expressions: Microsoft Word

  7. Regular Expressions: GNU grep • Terminology: • metacharacter • character with special meaning, not interpreted literally, e.g. ^ vs. a • must be quoted or escaped using the backslash\to get its literal meaning, e.g.\^ • Excerpts from the manpage • A list of characters enclosed by [ and ] matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. • For example, the regular expression [0123456789] matches any single digit. A range of characters may be specified by giving the first and last characters, separated by a hyphen.

  8. Regular Expressions: GNU grep • Excerpts from the manpage • Finally, certain named classes of characters are predefined. • Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. • For example, [[:alnum:]] means [0-9A-Za-z] • The period . matches any single character. • The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum]]. • The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. • The symbols \< and \> respectively match the empty string at the beginning and end of a word. • The symbol \b matches the empty string at the edge of a word • Terminology • word • unbroken sequence of digits, underscores and letters

  9. Regular Expressions: GNU grep • Excerpts from the manpage • A regular expression may be followed by one of several repetition operators: • ? The preceding item is optional and matched at most once. • * The preceding item will be matched zero or more times. • + The preceding item will be matched one or more times. • {n} The preceding item is matched exactly n times • {n,} The preceding item is matched n or more times. • {n,m} The preceding item is matched at least n times, but not more than m times.

  10. Regular Expressions: GNU grep • Excerpts from the manpage • Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. • Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.

  11. Regular Expressions: Examples • Example • \b99 matches 99 in “there are 99 bottles …” but not in “there are 299 bottles …” • Note: $99 contains two words, so \b99 will match 99 here • Example (sheeptalk) • baa! • baaa! • baaaa! … • Regular Expression • baaa*! • baa+!

  12. beginning of line negation Regular Expressions: Examples • Example • guppy • guppies • Regular Expression • gupp(y|ies) • Example • the (whole word, case insensitive) • the25 • Regular Expression • (^|[^a-zA-Z])[tT]he[^a-zA-Z]

  13. Regular Expressions • Regular Expressions shorthand for describing sets of strings • Examples: • string+ - set of one or more occurrences of string • a+ = {a, aa, aaa, aaaa, aaaaa, …} • (abc)+ = {abc, abcabc, abcabcabc, …} • string* - set of zero or more occurrences of string • a* = {, a, aa, aaa, aaaa, …} • (abc)* = {, abc, abcabc, …} • Note: • a a* = a+ • a {, a, aa, aaa, aaaa, …} = {a , aa, aaa, aaaa, aaaaa, …} • stringn - exactly n occurrences of string • a4 b3 = { aaaabbb } • Language = a set of strings

  14. Regular Expressions • Regular Expressions • formally equivalent to regular grammars/finite state automata • How to show this? • Proof by construction… • can’t describe all possible languages • Examples: • {anbn | n>=0}is not regular • {wwR | w  {a,b}+ } is not regular • R = reverse, e.g. abcR = cba • How to show this? • Proof by Pumping Lemma

  15. Finite State Automata (FSA) • Example: • Language: L = {a+b+} “one or more a’s followed by one or more b’s” • regular language • described by a regular expression • Note: • infinite set of strings belonging to language L • e.g. abbb, aaaab, aabb, *abab, * • Notation: • is the empty string (or string with zero length) • * means string is not in the language

  16. a s x a b y b Finite State Automata (FSA) L = {a+b+} L = {aa*bb*}

  17. Finite State Automata (FSA) • FSA shown on previous slide: • acceptor - no output • cf. transducer - input/output pairs • deterministic - no ambiguity i.e. at any given state and input character, the next state is uniquely determined • cf. non-deterministic FSA (NDFSA) • Note: • NDFSAs are not more powerful than FSAs • Proof: by construction

  18. how: 3 vs. 5 keystrokes michael: 7 vs. 15 keystrokes Finite State Automata (FSA) • Practical Applications • Encode regular expressions • Morphological analyzers • Different word forms, e.g. want, wanted, unwanted (suffixation/prefixation) • Speech Recognizers • Markov models = FSA + probabilities • and many more … • T9 text entry (tegic.com) • Probably built in to your cellphone • Predictive text entry for mobile messaging • Reduces the number of keystrokes for inputting words on a telephone keypad (8 keys)

  19. Finite State Automata (FSA) • More formally, FSA can be specified as a tuple containing: • set of states: {s,x,y} • start state: s • end state: y • alphabet: {a, b} • transition function : signature: character x state -> state • (a,s)=x • (a,x)=x • (b,x)=y • (b,y)=y

  20. Finite State Automata (FSA) • In Prolog: • Define one predicate for each state • taking one argument (the input string L) • consume input character • call next state with remaining input string • Initially: • fsa(L) :- s(L). i.e. call start state s

  21. s x a a b y b Finite State Automata (FSA) • State s: (start state) • s([a|L]) :- x(L). match input string beginning with a and call state x with remainder of input • State x: • x([a|L]) :- x(L). • x([b|L]) :- y(L). • State y: (end state) • y([]). • y([b|L]) :- y(L).

  22. Finite State Automata (FSA) s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L). • Example: • ?- fsa([a,a,b]). • ?- s([a,a,b]). • ?- x([a,b]). • ?- x([b]). • ?- y([]).Yes

  23. Finite State Automata (FSA) s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L). • Example: • ?- fsa([a,b,a]). • ?- s([a,b,a]). • ?- x([b,a]). • ?- y([a]). No

  24. Finite State Automata (FSA) • Another possible Prolog encoding strategy: • fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). • fsa(y,[]).% End state • transition(s,a,x).% Encodes transition • transition(x,a,x).% function • transition(x,b,y). • transition(y,b,y).

  25. Finite State Automata (FSA) • Example: ?- fsa(s,[a,a,b]). ?- transition(s,a,T). T=x ?- fsa(x,[a,b]). ?- transition(x,a,T’). T’=x ?- fsa(x,[b]). ?- transition(x,b,T”). T”=y ?- fsa(y,[]).Yes fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). fsa(y,[]). transition(s,a,x). transition(x,a,x). transition(x,b,y). transition(y,b,y).

  26. Next Time • More on FSA including • NDFSA non-deterministic • Regular Grammars

More Related