750 likes | 985 Views
Morphology & FSTs. Shallow Processing Techniques for NLP Ling570 October 17, 2011. Roadmap. Two-level morphology summary Unsupervised morphology. Combining FST Lexicon & Rules. Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate
E N D
Morphology & FSTs Shallow Processing Techniques for NLPLing570 October 17, 2011
Roadmap • Two-level morphology summary • Unsupervised morphology
Combining FST Lexicon & Rules • Two-level morphological system: ‘Cascade’ • Transducer from Lexicon to Intermediate • Rule transducers from Intermediate to Surface
Integrating the Lexicon • Replace classes with stems
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject (fox^z#,foxz#) ?
Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated
Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated • Proposed approach: • Unsupervised morphology induction
Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated • Proposed approach: • Unsupervised morphology induction • Potentially useful for many applications • IR, MT
Unsupervised Morphology • Start from tokenized text (or word frequencies) • talk 60 • talked 120 • walked 40 • walk 30
Unsupervised Morphology • Start from tokenized text (or word frequencies) • talk 60 • talked 120 • walked 40 • walk 30 • Treat as coding/compression problem • Find most compact representation of lexicon • Popular model MDL (Minimum Description Length) • Smallest total encoding: • Weighted combination of lexicon size & ‘rules’
Approach • Generate initial model: • Base set of words, compute MDL length
Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size
Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size • E.g. for talk, talked, walk, walked • 4 words
Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size • E.g. for talk, talked, walk, walked • 4 words • 2 words (talk, walk) + 1 affix (-ed) + combination info • 2 words (t,w) + 2 affixes (alk,-ed) + combination info
Successful Applications • Inducing word classes (e.g. N,V) by affix patterns • Unsupervised morphological analysis for MT • Word segmentation in CJK • Word text/sound segmentation in English
Formal Languages • Formal Languages and Grammars • Chomsky hierarchy • Languages and the grammars that accept/generate
Formal Languages • Formal Languages and Grammars • Chomsky hierarchy • Languages and the grammars that accept/generate • Equivalences • Regular languages • Regular grammars • Regular expressions • Finite State Automata
Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs
Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs • Packages and operations: Carmel
Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs • Packages and operations: Carmel • FSTs & regular relations • Closures and equivalences • Composition, inversion
FSA/FST Applications • Range of applications: • Parsing • Translation • Tokenization…
FSA/FST Applications • Range of applications: • Parsing • Translation • Tokenization… • Morphology: • Lexicon: cat: N, +Sg; -s: Pl • Morphotactics: N+PL • Orthographic rules: fox + s foxes • Parsing & Generation
Implementation • Tokenizers • FSA acceptors • FST acceptors/translators • Orthographic rule as FST
Roadmap • Motivation: • LM applications • N-grams • Training and Testing • Evaluation: • Perplexity
Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect …..
Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences
Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences • Approach: • Build model of word sequences from corpus • Given alternative sequences, select the most probable
N-gram LM Applications • Used in • Speech recognition • Spelling correction • Augmentative communication • Part-of-speech tagging • Machine translation • Information retrieval
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus • Word tokens: total # of words in corpus
Corpus Counts • Estimate probabilities by counts in large collections of text/speech • Should we count: • Wordformvslemma ? • Case? Punctuation? Disfluency? • Type vs Token ?
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars.
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct):
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ):
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent)
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause)
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause) • Keep, depending on app.: can help prediction; uh vs um
LM Task • Training: • Given a corpus of text, learn probabilities of word sequences
LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives
LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives • LM types: • Basic, Class-based, Structured
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect)
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute?
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues?
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues? • Zero counts: language is productive! • Joint word sequence probability of length N: • Count of all sequences of length N & count of that sequence
Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) =
Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history