290 likes | 635 Views
Regular Languages Regular Expressions Finite-State Automata. Torbjörn Lager, Stockholm University. Languages. Sets of strings Example: . {“ac” “abc” “abbc” “abbbc” ...}. b. c. a. 0. 1. 2. Finite-State Automata.
E N D
Regular LanguagesRegular ExpressionsFinite-State Automata Torbjörn Lager, Stockholm University
Languages • Sets of strings • Example: {“ac” “abc” “abbc” “abbbc” ...} FST - Torbjörn Lager, UU
b c a 0 1 2 Finite-State Automata • Directed graphs consisting of states, and arcs labelled with symbols. For example: FST - Torbjörn Lager, UU
Regular expressions • Example: • Note: • Atomic symbols (e.g. ‘a’) denote languages (e.g. {“a”}) • Operations (here: concatenation and kleene-star) are operations over languages [a b* c] FST - Torbjörn Lager, UU
Regular expressions, regular languages and automata Regular expressions denote compile into Regular languages Finite-state automata generateaccept FST - Torbjörn Lager, UU
Regular expressions, regular languages and automata [a b* c] denotes compiles into b {“ac” “abc” “abbc” ...} c a 0 1 2 generatesaccepts FST - Torbjörn Lager, UU
Regular expression operators • Concatenation A B • Union A | B • Iteration (Kleene-star) A* • Difference A - B • Intersection A & B • Grouping of expressions [A] FST - Torbjörn Lager, UU
Examples • a b {“ab”} • a [b|c] {“ab” “ac”} • a* [b|c]* {“” “a” “ab” ... } • [a|b] & [b|c] {“b”} • [a|b] - b {“a”} • [a|b] - [b|a] {} FST - Torbjörn Lager, UU
Special symbols • ? The any symbol • ?* The universal language • [] The empty-string language • 0 (or “”) The empty string (epsilon) FST - Torbjörn Lager, UU
Regular expression operators • Optionality (A) • Kleene-plus A+ • Complement ~A • Containment $A • Restriction A => B _ C FST - Torbjörn Lager, UU
Examples • a (b) c {“ac” “abc”} • a b+ c {“abc” “abbc” “abbbc” ...} • ~[a b c] {“” ... “a” ... “ab” ... “abca” ..} • $[a|b] {“a” ... “abba” ... “abcd” ...} • b => a _ c {“” ... “a” ... “ccc” ... “abc” ..} FST - Torbjörn Lager, UU
Component technologies in FST • Word lists and lexica • Tokenisers • Morphological analysers • Part-of-speech taggers • Parsers FST - Torbjörn Lager, UU
Applications of FST • Named-entity recognition • Information extraction • Corpus linguistics • Spelling- and grammar checking • Speech processing applications FST - Torbjörn Lager, UU
The Xerox Finite-State Tool • Compiles extended regular expressions into finite-state machines (automata and transducers) • Allows the user to display, examine and modify the machines FST - Torbjörn Lager, UU
Non-deterministic FSAs • At least one state has more than one transition leading from it labelled with the same symbol b b a 0 1 2 FST - Torbjörn Lager, UU
Determinization of FSAs • Any non-deterministic FSA can be transformed into an equivalent deterministic FSA. • Example: • Determinize for efficiency! b b b b a a 0 1 2 0 1 2 FST - Torbjörn Lager, UU
Minimization of FSAs • Any (deterministic) FSA can be transformed into an equivalent FSA that has a minimal number of states. • Minimize for space! FST - Torbjörn Lager, UU
Representing word lists • Think of a word list as a regular language • Use the calculus of regular expressions to query and update the wordlist • Determinize for speed! • Minimize for space! FST - Torbjörn Lager, UU
Various equivalences • (A) = A|[] • A+ = A A* • A+ = A* - [] • A - B = A & ~B • ~A = ?* - A • $A = ?* A ?* • ~[A | B] = ~A & ~B • ~[A & B] = ~A | ~B FST - Torbjörn Lager, UU
Various equivalences • A - A = ~[?*] • A | ~[?*] = A • A [] = A • A ~[?*] = ~[?*] • A & ?* = A • A | ?* = ?* FST - Torbjörn Lager, UU
Important theoretical results • Kleene’s theorem (concerning FSAs) • Closure properties of regular languages and regular relations • Decidability FST - Torbjörn Lager, UU
FSAs and regular expressions • Kleene’s theorem: Any language recognised by an FSA is denoted by a regular expression and any language denoted by a regular expression can be recognised by a FSA. FST - Torbjörn Lager, UU
Regex to FSA to regex NondeterministicFSA Deterministic FSA Nondeterministic FSAwith epsilon transitions Regularexpressions Picture adapted from Hopcroft & Ullman 1979 FST - Torbjörn Lager, UU
From regular expressions to finite-state automata • The only really necessary operators: • Disjunction • Concatenation • Iteration • Sidenote: Compare regular grammars: • A --> x BA --> x (where A and B are nonterminals, and where x is a sequence of terminals) FST - Torbjörn Lager, UU
Closure properties of regular languages • A set is said to be closed under an operation iff applying the operation to members of the set will never take us outside the set • Example: if A and B are regular languages, then [A|B] is always regular. Therefore regular languages are closed under union. FST - Torbjörn Lager, UU
Decidability • Given one automaton A: • Is the string S a string in L(A) ? • Does L(A) contain any strings at all ? • Is L(A) equivalent to ?* ? • Given two automata A1 and A2: • Is L(A1) a subset of L(A2) ? • Are L(A1) and L(A2) equivalent ? • Do L(A1) and L(A2) overlap ? FST - Torbjörn Lager, UU