420 likes | 596 Views
Computational Linguistics Introduction. Finite State Machinery and Language Description. Acknowledgement. The material for this lecture is derived from a series of talks given by Dr. Ken Beesley (Xerox European Research Centre, Grenoble) in Malta, 2001. Today’s Topics.
E N D
Computational Linguistics Introduction Finite State Machinery and Language Description CLINT-LIN: Finite State Machinery
Acknowledgement • The material for this lecture is derived from a series of talks given by • Dr. Ken Beesley (Xerox European Research Centre, Grenoble) • in Malta, 2001. CLINT-LIN: Finite State Machinery
Today’s Topics • Finite State Technology • Regular Languages and Relations • Review of Set Theory • Understand the mathematical operations that can be performed on such Languages. • Understand how Languages, Relations, Regular Expressions, and Networks are interrelated. CLINT-LIN: Finite State Machinery
What is Finite State Technology? • Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems. • Such Techniques include • Design of user languages for specifying FSA • Compilation of such languages into efficient transition networks. • Development environments and runtime systems CLINT-LIN: Finite State Machinery
What is Finite-State Technology Good For? • Finite-state techniques cannot handle central embedding • the man the dog the cat bit followed ate. • They are well suited to “lower-level” natural language processing such as • Tokenization – what is the next word? • Spelling error detection: does the next word belong to a list? • Morphological/phonological analysis/generation • Shallow syntactic parsing and “chunking” CLINT-LIN: Finite State Machinery
Tokenisation Problems • VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London) • VfB Stuttgart, Manchester United • succession • 2-1 • Wednesday • Finite state techniques provide a means to specify the language of words, thus defining what it means to be the next token. • There are three ways to specify such languages CLINT-LIN: Finite State Machinery
MACHINE Languages,Notations and Machines LANGUAGE (set of strings) NOTATION CLINT-LIN: Finite State Machinery
FINITE STATE AUTOMATON Languages,Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION CLINT-LIN: Finite State Machinery
FINITE STATE AUTOMATA:preliminary definition • A finite state automaton includes: • A finite set of states • A finite set of labelled transitions between states CLINT-LIN: Finite State Machinery
Physical Machines with Finite States • The Lightswitch Machine UP OFF ON DOWN CLINT-LIN: Finite State Machinery
Physical Machines with Finite States • The Lightswitch Toggle Machine PUSH OFF ON PUSH CLINT-LIN: Finite State Machinery
The Five Cent Machine • Problem: • Assume you have one, two, and five cent pieces • Design a finite state automaton which accepts exactly 5 cents. CLINT-LIN: Finite State Machinery
The Cola Machine • Need to enter 25 cents (USA) to get a drink • Accepts the following coins: • Nickel = 5 cents • Dime = 10 cents • Quarter = 25 cents • For simplicity, our machine needs exact change • We will model only the coin-accepting mechanism CLINT-LIN: Finite State Machinery
Physical Machines with Finite States • The Cola Machine Start State Final State N N N N N 5 10 15 20 25 0 D D D D Q CLINT-LIN: Finite State Machinery
The Cola Machine Language • List of all the sequences of coins accepted: • { Q, DDN, DND, NDD, DNNN, NDNN, NNDNNNND, NNNNN } • Think of the coins as SYMBOLS or CHARACTERS • The set of symbols accepted is the ALPHABET of the machine • Think of sequences of coins as WORDS or “strings” • The set of words accepted by the machine is its LANGUAGE CLINT-LIN: Finite State Machinery
FINITE STATE AUTOMATA:better definition • A finite state automaton includes: • A finite set of states • Initial State • Final State (s) • A finite set of labelled transitions beween states • Labels are symbols from an alphabet • Recognises a language • Generates a language as well! CLINT-LIN: Finite State Machinery
A Network that Accepts aOne Word Language c n a t o CLINT-LIN: Finite State Machinery
A Network that Accepts aThree Word Language a n t o c t g i r e m a e s CLINT-LIN: Finite State Machinery
Scaling Up the Network • Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language. • We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language • This is the basis for a Spanish spelling error detector. CLINT-LIN: Finite State Machinery
Looking Up a Word a n t o c t g i r e m a e s “Apply” m e s a CLINT-LIN: Finite State Machinery
Lookup Failure • Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because: • Not all input is consumed ("libro", "tigra") • Input is fully consumed but state is not final ("cant") • Final state is reached but there is still unconsumed output ("mesas") CLINT-LIN: Finite State Machinery
Shared Structure c l e a r v e e CLINT-LIN: Finite State Machinery
Transducers “Lookdown” mesa+Noun+Fem+Pl m e s a +Noun +Fem +Pl m e s a 0 0 s “Lookup” m e s a s CLINT-LIN: Finite State Machinery
A Morphological Analyzer dog +n +pl Transducer dogs CLINT-LIN: Finite State Machinery
A Morphological Analyzer Lexical Language Transducer Surface Language CLINT-LIN: Finite State Machinery
A Quick Review of Set Theory • A set is a collection of objects. B A E D We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }. There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc. CLINT-LIN: Finite State Machinery
Uniqueness of Elements • You cannot have two or more • ‘A’ elements in the same set B A D E { A, A, D, B, E} is just a redundant specification of the set { A, D, B, E }. CLINT-LIN: Finite State Machinery
Cardinality of Sets • The Empty Set: • A Finite Set: • An Infinite Set: e.g. The Set of all Positive Integers Norway Denmark Sweden CLINT-LIN: Finite State Machinery
Simple Operations on Sets: Union A B D E C Set 1 Set 2 B C A D E Union of Set1 and Set 2 CLINT-LIN: Finite State Machinery
Simple Operations on Sets (2): Union A B C D C Set 1 Set 2 B C A D Union of Set1 and Set 2 CLINT-LIN: Finite State Machinery
Simple Operations on Sets (3): Intersection A B C D C Set 1 Set 2 C Intersection of Set1 and Set 2 CLINT-LIN: Finite State Machinery
Simple Operations on Sets (4): Subtraction A B C D C Set 1 Set 2 A B Set 1 minus Set 2 CLINT-LIN: Finite State Machinery
Formal Languages Very Important Concept in Formal Language Theory: A Language is just a Set of Words. • We use the terms “word” and “string” interchangeably. • A Language can be empty, have finite cardinality, or be infinite in size. • You can union, intersect and subtract languages, just like any other sets. CLINT-LIN: Finite State Machinery
Union of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 dog cat rat elephant mouse Union of Language 1 and Language 2 CLINT-LIN: Finite State Machinery
Intersection of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 Intersection of Language 1 and Language 2 CLINT-LIN: Finite State Machinery
Intersection of Languages (Sets) dog cat rat rat mouse Language 1 Language 2 rat Intersection of Language 1 and Language 2 CLINT-LIN: Finite State Machinery
Subtraction of Languages (Sets) dog cat rat rat mouse Language 1 Language 2 dog cat Language 1 minus Language 2 CLINT-LIN: Finite State Machinery
Languages • A language is a set of words (=strings). • Words (strings) are composed of symbols (letters) that are “concatenated” together. • At another level, words are composed of “morphemes”. • In most natural languages, we concatenate morphemes together to form whole words. For sets consisting of words (i.e. for Languages), the operation of concatenation is very important. CLINT-LIN: Finite State Machinery
Concatenation of Languages work talk walk 0 ing ed s Root Language Suffix Language The concatenation of the Suffix language after the Root language. work working worked works talk talking talked talks walk walking walked walks CLINT-LIN: Finite State Machinery
Languages and Networks 0 t a s w a l k s s o i r n g e Network/Language 1 d Network/Language 2 0 a t s w a l k The concatenation of Network 1 and Network 2 s i o n g r e d CLINT-LIN: Finite State Machinery
Why is “Finite State” Computing so Interesting? • Finite-state systems are mathematically elegant, easily manipulated and modifiable. • Computationally efficient. Usually very compact. • The programming we linguists do is declarative. We describe the facts of our natural language; i.e. we write grammars. We do not hack ad hoc code. • The runtime code, which applies our systems to linguistic input, is already written and it is completely language-independent. • Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate. CLINT-LIN: Finite State Machinery
FINITESTATE MACHINE Languages,Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION CLINT-LIN: Finite State Machinery