Morphology & FSTs

Morphology & FSTs Shallow Processing Techniques for NLPLing570 October 17, 2011

Roadmap • Two-level morphology summary • Unsupervised morphology

Combining FST Lexicon & Rules • Two-level morphological system: ‘Cascade’ • Transducer from Lexicon to Intermediate • Rule transducers from Intermediate to Surface

Integrating the Lexicon • Replace classes with stems

Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject (fox^z#,foxz#) ?

Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated

Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated • Proposed approach: • Unsupervised morphology induction

Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated • Proposed approach: • Unsupervised morphology induction • Potentially useful for many applications • IR, MT

Unsupervised Morphology • Start from tokenized text (or word frequencies) • talk 60 • talked 120 • walked 40 • walk 30

Unsupervised Morphology • Start from tokenized text (or word frequencies) • talk 60 • talked 120 • walked 40 • walk 30 • Treat as coding/compression problem • Find most compact representation of lexicon • Popular model MDL (Minimum Description Length) • Smallest total encoding: • Weighted combination of lexicon size & ‘rules’

Approach • Generate initial model: • Base set of words, compute MDL length

Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size

Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size • E.g. for talk, talked, walk, walked • 4 words

Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size • E.g. for talk, talked, walk, walked • 4 words • 2 words (talk, walk) + 1 affix (-ed) + combination info • 2 words (t,w) + 2 affixes (alk,-ed) + combination info

Successful Applications • Inducing word classes (e.g. N,V) by affix patterns • Unsupervised morphological analysis for MT • Word segmentation in CJK • Word text/sound segmentation in English

Unit #1 Summary

Formal Languages • Formal Languages and Grammars • Chomsky hierarchy • Languages and the grammars that accept/generate

Formal Languages • Formal Languages and Grammars • Chomsky hierarchy • Languages and the grammars that accept/generate • Equivalences • Regular languages • Regular grammars • Regular expressions • Finite State Automata

Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs

Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs • Packages and operations: Carmel

Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs • Packages and operations: Carmel • FSTs & regular relations • Closures and equivalences • Composition, inversion

FSA/FST Applications • Range of applications: • Parsing • Translation • Tokenization…

FSA/FST Applications • Range of applications: • Parsing • Translation • Tokenization… • Morphology: • Lexicon: cat: N, +Sg; -s: Pl • Morphotactics: N+PL • Orthographic rules: fox + s  foxes • Parsing & Generation

Implementation • Tokenizers • FSA acceptors • FST acceptors/translators • Orthographic rule as FST

Language Modeling

Roadmap • Motivation: • LM applications • N-grams • Training and Testing • Evaluation: • Perplexity

Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect …..

Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences

Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences • Approach: • Build model of word sequences from corpus • Given alternative sequences, select the most probable

N-gram LM Applications • Used in • Speech recognition • Spelling correction • Augmentative communication • Part-of-speech tagging • Machine translation • Information retrieval

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus • Word tokens: total # of words in corpus

Corpus Counts • Estimate probabilities by counts in large collections of text/speech • Should we count: • Wordformvslemma ? • Case? Punctuation? Disfluency? • Type vs Token ?

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars.

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct):

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ):

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent)

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause)

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause) • Keep, depending on app.: can help prediction; uh vs um

LM Task • Training: • Given a corpus of text, learn probabilities of word sequences

LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives

LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives • LM types: • Basic, Class-based, Structured

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect)

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute?

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues?

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues? • Zero counts: language is productive! • Joint word sequence probability of length N: • Count of all sequences of length N & count of that sequence

Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) =

Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history

Morphology & FSTs

Morphology & FSTs

Presentation Transcript

Bacterial Morphology and Structure

Dental Morphology and Vocabulary

Morphology

TETRALOGY OF FALLOT

Introduction of Cestodes (Tapeworms)

Cell injury:necrosis, apoptosis. Adaptations:atrophy, hypertrophy,hyperplasia. Metaplasia.

BACTERIAL MORPHOLOGY: PROCARYOTIC AND EUCARYOTIC CELLS Chapter 4

Sysmex XE-2100

Tooth Histology and Morphology

Morphology, Phonology & FSTs

Figure 35.0 The effect of submersion in water on leaf development in Cabomba

Morphology

Unsupervised learning of natural language morphology

12-Lead EKG MEPN Level IV

From ULIRGs to QSOs

Evaluating an Agglutinative Segmentation Model for ParaMor

Myology

Morphology

Diphyllobothrium Latum (Broad tape worm, Fish tape worm)

Pasteurella multocida

Morphology &amp; FSTs

Morphology &amp; FSTs

Presentation Transcript

Bacterial Morphology and Structure

Dental Morphology and Vocabulary

Morphology

TETRALOGY OF FALLOT

Introduction of Cestodes (Tapeworms)

Cell injury:necrosis, apoptosis. Adaptations:atrophy, hypertrophy,hyperplasia. Metaplasia.

BACTERIAL MORPHOLOGY: PROCARYOTIC AND EUCARYOTIC CELLS Chapter 4

Sysmex XE-2100

Tooth Histology and Morphology

Morphology, Phonology &amp; FSTs

Figure 35.0 The effect of submersion in water on leaf development in Cabomba

Morphology

Unsupervised learning of natural language morphology

12-Lead EKG MEPN Level IV

From ULIRGs to QSOs

Evaluating an Agglutinative Segmentation Model for ParaMor

Myology

Morphology

Diphyllobothrium Latum (Broad tape worm, Fish tape worm)

Pasteurella multocida

Morphology & FSTs

Morphology & FSTs

Morphology, Phonology & FSTs