390 likes | 488 Views
CSA3050: Natural Language Algorithms. Words and Finite State Machinery. Acknowledgement. Material derived from/copied from Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 Richard Sproat, Lecture notes. Outline. Words Regular Languages Regular Expressions
E N D
CSA3050: Natural Language Algorithms Words and Finite State Machinery Natural Language Processing
Acknowledgement Material derived from/copied from • Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 • Richard Sproat, Lecture notes Natural Language Processing
Outline Words Regular Languages Regular Expressions Finite State Automata Natural Language Processing
What is a Word? • A series of speech sounds that symbolizes meaning without being divisible into smaller units • Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark • A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements • The smallest meaningful element of language. When written it stands alone with a space on either side of it. Natural Language Processing
Information Associated with Words • Spelling • orthographic • phonological • Syntax • POS • Valency • Semantics • Meaning • Relationship to other words Natural Language Processing
Properties of Words • Sequence • characters pollution • phonemes • Delimitation • whitespace • other? • Structure • simple ("atomic") words • complex ("molecular") words Natural Language Processing
Complex Words • Complex words have subparts: • e.g. "enlargement"en + large + ment • Some subparts are valid wordslarge • Others are prefixes and suffixesen, ment • N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment) Natural Language Processing
Morphological Processes • affixation • prefix • suffix • circumfix: għandi - mgħandix • infix: phenidinephenetidine • other morphological processes • redoubling (mexa; mexxa) • vowel change (swim; swam) Natural Language Processing
Complex Words Formed by Concatenation prefixes roots suffixes large charge infect code decide ed ing ee er ly dis re un en + + Natural Language Processing
The Language of Words • What kind of formal language is the language of words? • One which can be constructed out of • A characteristic set of basic symbols (alphabet) • A characteristic set of combining operations • Union (disjunction) • Concatenation • Iteration • Regular Language; Regular Sets Natural Language Processing
Outline Words Regular Languages Regular Expressions Finite State Automota Natural Language Processing
Regular Languages • A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations: • Set union • Concatenation • Transitive closure (Kleene star) Natural Language Processing
Some things that areregular languages • Zero or more a’s followed by zero or more b’s • The set of words in an English dictionary • Dates • URLs • English? Natural Language Processing
Some things that are not regular languages • Zero or more a’s followed by exactly the same number of b’s • The set of all English palindromes (e.g. Madam I'm Adam) • The set that includes all noun phrases of the form • the cat slept • the cat the dog bit slept • the cat the dog the man fed bit slept Natural Language Processing
Some special regular languages • The universal language (Σ*) • The empty language (Ø) Note: the empty language is not the same as the empty string Natural Language Processing
Some closure propertiesof regular languages • Intersection • Complementation • Difference • Reversal • Power Natural Language Processing
MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION Natural Language Processing
Outline Words Regular Languages Regular Expressions Finite Automota Natural Language Processing
Regular Expressions • Notation for describing regular sets • Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word) • Xerox Finite State tools use a somewhat different notation, but similar function. Natural Language Processing
Regular Expressions a a simple symbol A B concatenation A | B alternation operator A & B intersection operator A* Kleene star Natural Language Processing
MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION Natural Language Processing
Outline Words Regular Languages Regular Expressions Finite Automata Natural Language Processing
Finite Automaton • A finite automaton is a quintuple (Q, I, q0,F, δ ) where: • Q is a finite set of states • Σ is alphabet of symbols • q0 Q is a start state • F Q are final states • δ is a transition relationδ(q,i,q') between a state q Q, a symbol σ Σand q' Q Natural Language Processing
Representation of FSA’s:State Diagram Natural Language Processing
State Table Natural Language Processing
Mr. Kleene Natural Language Processing
Kleene’s theorem • Languages generated by NFAs are exactly equivalent to languages described by Regular Expressions. • Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA. • Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression. http://www.cs.may.ie/~jpower/Courses/parsing/node6.html Natural Language Processing
Converting a Regular Expressionto an NFA • The NFA representing the empty string is: • The NFA representing a single character is: ε 1 2 a 1 2 Natural Language Processing
Regular Expression to NFA Diagram from Leonidas Fegaras, Univ. Texas Natural Language Processing
Deterministic Finite Automata • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state • In other words, δ is a function • Why do we care about DFAs? Natural Language Processing
Deterministic Finite Automata • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state • In other words, δ is a function • Why do we care about DFAs? • EFFICIENCY!! Natural Language Processing
Equivalence of NFA’s and DFA’s Natural Language Processing
Subset Construction for Determinisation • States which are connected by an εtransition will be represented by the same states in the DFA. • If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). • Thus these states will be combined into a single DFA state. • more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html Natural Language Processing
Subset construction for determinization Natural Language Processing
Subset construction for determinization Natural Language Processing
Subset construction for determinization Natural Language Processing
Subset construction for determinization Natural Language Processing
Subset construction for determinization Natural Language Processing
Subset construction for determinization Natural Language Processing