680 likes | 703 Views
This presentation delves into Finite State Automata (FSAs) and English Morphology concepts adapted from Jurafsky & Martin's work. Learn about recognition processes, deterministic and non-deterministic recognition, equivalence of machines, and more. Discover the role of morphology in building words through morphemes, such as stems and affixes. Gain insights into inflectional and derivational morphology in English. Key points covered include non-determinism, compositional machines, and the importance of finite-state methods in language processing tasks.
E N D
Natural Language Processing Finite State Morphology Slides adapted from Jurafsky & Martin, Speech and Language Processing
Outline • Finite State Automata • (English) Morphology • Finite State Transducers
Finite State Automata • Let’s start with the sheep language from Chapter 2 • /baa+!/
Sheep FSA • We can say the following things about this machine • It has 5 states • b, a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions
More Formally • You can specify an FSA by enumerating the following things. • The set of states: Q • A finite alphabet: Σ • A start state • A set of accept/final states • A transition relation that maps QxΣ to Q
Yet Another View If you’re in state 1 and you’re looking at an a, go to state 2 • The guts of FSAs can ultimately be represented as tables
Recognition • Recognition is the process of determining if a string should be accepted by a machine • Or… it’s the process of determining if a string is in the language we’re defining with the machine • Or… it’s the process of determining if a regular expression matches a string • Those all amount the same thing in the end
Recognition • Traditionally, (Turing’s notion) this process is depicted with a tape.
Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer • Until you run out of tape
Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all unambiguous regular languages. • To change the machine, you simply change the table.
Recognition as Search • You can view this algorithm as a trivial kind of state-space search. • States are pairings of tape positions and state numbers. • Operators are compiled into the table • Goal state is a pairing with the end of tape position and a final accept state • It is trivial because?
Non-Determinism • Yet another technique • Epsilon transitions • Key point: these transitions do not examine or advance the tape during recognition
Equivalence • Non-deterministic machines can be converted to deterministic ones with a fairly simple construction • That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones in terms of the languages they can accept
ND Recognition • Two basic approaches (used in all major implementations of regular expressions, see Friedl 2006) • Either take a ND machine and convert it to a D machine and then do recognition with that. • Or explicitly manage the process of recognition as a state-space search (leaving the machine as is).
Non-Deterministic Recognition: Search • In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. • But not all paths directed through the machine for an accept string lead to an accept state. • No paths through the machine lead to an accept state for a string not in the language.
Non-Deterministic Recognition • So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when all of the possible paths for a given string lead to failure.
Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.
Why Bother? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother? • More natural (understandable) solutions • Not always equivalent
Compositional Machines • Formal languages are just sets of strings • Therefore, we can talk about various set operations (intersection, union, concatenation) • This turns out to be a useful exercise
Words • Finite-state methods are particularly useful in dealing with a lexicon • Many devices, most with limited memory, need access to large lists of words • And they need to perform fairly sophisticated tasks with those lists • So we’ll first talk about some facts about words and then come back to computational methods
English Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes • Stems: The core meaning-bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions
English Morphology • We can further divide morphology up into two broad classes • Inflectional • Derivational
Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word: • Has the same word class as the original • Serves a grammatical/semantic purpose that is • Different from the original • But is nevertheless transparently related to the original
Nouns and Verbs in English • Nouns are simple • Markers for plural and possessive • Verbs are only slightly more complex • Markers appropriate to the tense of the verb
Regulars and Irregulars • It is a little complicated by the fact that some words misbehave (refuse to follow the rules) • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular are used to refer to words that follow the rules and those that don’t
Regular and Irregular Verbs • Regulars… • Walk, walks, walking, walked, walked • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut
Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you. • Quasi-systematicity • Irregular meaning change • Changes of word class
Derivational Examples • Verbs and Adjectives to Nouns
Derivational Examples • Nouns and Verbs to Adjectives
Morphology and FSAs • We’d like to use the machinery provided by FSAs to capture these facts about morphology • Accept strings that are in the language • Reject strings that are not • And do so in a way that doesn’t require us to in effect list all the words in the language
Start Simple • Regular singular nouns are ok • Regular plural nouns have an -s on the end • Irregulars are ok as is
Derivational Rules If everything is an accept state how do things ever get rejected?
Parsing/Generation vs. Recognition • We can now run strings through these machines to recognize strings in the language • But recognition is usually not quite what we need • Often if we find some string in the language we might like to assign a structure to it (parsing) • Or we might have some structure and we want to produce a surface form for it (production/generation) • Example • From “cats” to “cat +N +PL”
Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”
Applications • The kind of parsing we’re talking about is normally called morphological analysis • It can either be • An important stand-alone component of many applications (spelling correction, information retrieval) • Or simply a link in a chain of further linguistic analysis