620 likes | 714 Views
Introduction to Computational Linguistics. Words and Finite State Machinery. Acknowledgement. Material derived from/copied from Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 Richard Sproat, Lecture notes. Finite State Methods. Word-Oriented Application Areas
E N D
Introduction toComputational Linguistics Words and Finite State Machinery CLINT-CS Finite State
Acknowledgement Material derived from/copied from • Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 • Richard Sproat, Lecture notes CLINT-CS Finite State
Finite State Methods • Word-Oriented Application Areas • Tokenization • Sentence breaking • Spelling correction • Morphology (analysis/generation) • Phonological disambiguation (Speech Recognition) • Morphological disambiguation (“Tagging”) • Pattern matching (“Named Entity Recognition”) • Shallow Parsing CLINT-CS Finite State
Outline Words Regular Languages Regular Expressions Finite State Automota CLINT-CS Finite State
What is a Word? Some Distinctions • Written • Spoken • Word Type • Word Token CLINT-CS Finite State
Information Associated with Words • Spelling • orthographic • phonological • Syntax • POS • Valency • Semantics • Meaning • Relationship to other words CLINT-CS Finite State
Properties of Words • Sequence • characters pollution • phonemes • Delimitation • whitespace • other? • Structure • simple ("atomic") words • complex ("molecular") words CLINT-CS Finite State
Complex Words • Complex words have subparts: • e.g. "enlargement"en + large + ment • Some subparts are valid wordslarge • Others are prefixes and suffixesen, ment • N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment) CLINT-CS Finite State
Morphological Processes • affixation • prefix • suffix • circumfix: għandi - mgħandix • infix: phenidinephenetidine • other morphological processes • redoubling (mexa; mexxa) • vowel change (swim; swam) CLINT-CS Finite State
Affixation uses Concatenation prefixes roots suffixes large charge infect code decide ed ing ee er ly dis re un en + + CLINT-CS Finite State
The Language of Words • What kind of formal language is the language of words? • One which can be constructed out of • A characteristic set of basic symbols (alphabet) • A characteristic set of combining operations • Union (disjunction) • Concatenation • Iteration • Regular Language; Regular Sets CLINT-CS Finite State
MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State
Outline Words Regular Languages Regular Expressions Finite State Automota CLINT-CS Finite State
Regular Languages • A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations: • Set union • Concatenation • Transitive closure (Kleene star) CLINT-CS Finite State
Some things that areregular languages • Zero or more a’s followed by zero or more b’s • The set of words in an English dictionary • Dates • URLs • English? CLINT-CS Finite State
Some things that are not regular languages • Zero or more a’s followed by exactly the same number of b’s • The set of all English palindromes (e.g. Madam I'm Adam) • The set that includes all noun phrases of the form • the cat slept • the cat the dog bit slept • the cat the dog the man fed bit slept CLINT-CS Finite State
Some special regular languages • The universal language (Σ*) • The empty language (Ø) Note: the empty language is not the same as the empty string CLINT-CS Finite State
Some closure propertiesof regular languages • Intersection • Complementation • Difference • Reversal • Power CLINT-CS Finite State
MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State
Outline Words Regular Languages Regular Expressions Finite Automota CLINT-CS Finite State
Regular Expressions • Notation for describing regular sets • Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word) • Xerox Finite State tools use a somewhat different notation, but similar function. CLINT-CS Finite State
Regular Expressions a a simple symbol A B concatenation A | B alternation operator A & B intersection operator A* Kleene star CLINT-CS Finite State
Caveats • Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages. • For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …) /(…+)\1/ CLINT-CS Finite State
MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State
Outline Words Regular Languages Regular Expressions Finite Automota CLINT-CS Finite State
Finite Automaton • A finite automaton is a quintuple (Q, I, q0,F, δ ) where: • Q is a finite set of states • Σ is alphabet of symbols • q0 Q is a start state • F Q are final states • δ is a transition relationδ(q,i,q') between a state q Q, a symbol σ Σand q' Q CLINT-CS Finite State
Representation of FSA’s:State Diagram CLINT-CS Finite State
State Table CLINT-CS Finite State
1- h 2 a h 3 ! 4= Prolog initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h). CLINT-CS Finite State
Mr. S.K. CLINT-CS Finite State
Kleene’s theorem • Languages generated by NFAs are exactly equivalent languages described by Regular Expressions. • Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA. • Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression. CLINT-CS Finite State
Converting a Regular Expressionto an NFA • The NFA representing the empty string is: • The NFA representing a single character is: ε 1 2 a 1 2 CLINT-CS Finite State
Converting a Regular Expressionto an NFA • The union operator is represented by a choice of paths from a node, e.g. a|b b 1 2 a CLINT-CS Finite State
Converting a Regular Expressionto an NFA • Concatenation simply involves connecting one NFA to the other, so that ab is represented by a b 1 2 3 CLINT-CS Finite State
Converting a Regular Expressionto an NFA • The Kleene star must allow for zero or more occurrences. So a* is represented by ε a ε 1 2 3 3 ε ε CLINT-CS Finite State
Deterministic versus non-deterministic finite automata • The definition of finite-state automata given above was for non-deterministic finite automata (NFA): • δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states. • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state • In other words, δ is a function CLINT-CS Finite State
A deterministic automaton CLINT-CS Finite State
NFAs vs DFAs • NDFA’s are typically smaller and simpler than their equivalent DFA’s • Why do we care about DFA’s? CLINT-CS Finite State
NFAs vs DFAs • NDFA’s are typically smaller and simpler than their equivalent DFA’s • Why do we care about DFA’s? • EFFICIENCY! CLINT-CS Finite State
Equivalence of NFA’s and DFA’s CLINT-CS Finite State
Subset Construction for Determinisation • Any two states that are connected by an εtransition may as well be the same, since we can move from one to the other without consuming any character. • Thus states which are connected by an εtransition will be represented by the same states in the DFA. • If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). • Thus these states will be combined into a single DFA state. • more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html CLINT-CS Finite State
Xerox Tools Finite State Machinery CLINT-CS Finite State
The Xerox Approach • Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi. • Meta-languages for describing regular languages and regular relations. • Compiler for mapping meta-language "programs" into efficient FS machinery • Several tools and applications CLINT-CS Finite State
xerox tools • xfstXerox Finite-State Tool • lexcFinite-State Lexicon Compiler • twolcTwo-Level Rule Compiler CLINT-CS Finite State
xfst • xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. • xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.) CLINT-CS Finite State
Simple Regular Expressions • Atomic Expressions • Simple Symbols • Multicharacter Symbols • Complex Expressions • Union • Intersection • Concatenation CLINT-CS Finite State
xfst Notation Examples A|B Union A&B Intersection A B Concatenation A* Closure (Kleene Star) (A) Optional Element ? Any symbol \b Any symbol other than b ~A Complement (= [?* - A ]) 0 Empty string language $A [ ?* A ?* ] CLINT-CS Finite State
Regular Expression E1: = [a|b] E2: = [c|d] E1 E2 = [a|b] [c|d] Language L1 = {"a", "b"} L2 = {"c", "d"} L1 L2 = {"ac", "ad", "bc", "bd"} Concatenation over Reg. Expression and Language CLINT-CS Finite State
Concatenation overFS Automata a c + b d a c = b d CLINT-CS Finite State
Simple Commands • In addition to the notation there are also commands, e.g. • define: give a name to an RE • print: print information • read: read information • various stack operations • file interaction • various command line options CLINT-CS Finite State