250 likes | 392 Views
Towards unsupervised induction of morphophonological rules. Erwin Chan University of Pennsylvania Morphochallenge workshop 19 Sept 2007. Goals of unsup morphology induction. Provide analysis of input data 2. Analyzer for unseen data
E N D
Towards unsupervised inductionof morphophonological rules Erwin Chan University of Pennsylvania Morphochallenge workshop 19 Sept 2007
Goals of unsup morphology induction • Provide analysis of input data 2. Analyzer for unseen data Key task: generalize analysis of input data by inducing phonological characteristics
Example: inducing phonology(English plural nouns) 2. Induce segmentation process.es witness.es match.es hatch.es maid.s fern.s mate.s 3. Induce phonology es: ends in ch or sh s: other characters 4. Apply to novel words bench.es fate.s foe.s wish.es 1. Input corpus processes witnesses matches hatches maids ferns mates
Base-and-transforms model of morphological paradigms Apply transforms to base forms to generate inflections Lexeme 1 Lexeme 2 Lexeme 3 t1 t2 t1 t2 t1 t2 base 1 base 2 base 3 t3 t3 t3 t4 t4 t4 t5 t5 t5
Base forms • Base form serves as lexical entry for all inflections of a lexeme e.g. base of {help, helps, helping, helped} is help • Same fine-grained POS type for all lexemes e.g. “nominative singular” for all nouns
Transforms • Generates inflected form from base • Format: ( A, B ) A, B: simple regular expressions A: characters in base to replace B: characters in inflected to replace
Transform examples Base form eat time time hang Inflected eating times timing hung Transform ( $, ing ) ( $, s ) ( e, ing ) ( *a*, *u* ) non-concat
Comparison to phonological rules • Standard rewrite rule: A B / C _ D 1. A B: rewrite operation 2. C _ D: phonological context of application • A transform is an ungeneralized rule A B / { set of base forms } • Future work: induce phonological rules Learn generalized phonological properties of base forms
Compare with stem-suffix model • Stem-suffix • saves = save + s • saving = sav + ing Drawback: multiple lexical representations • Base-transform • saves = save + ( $,s ) • saving = save + ( e,ing )
Limitations of model • Simple morphotactic structure: • assumes one suffix • a word is either a base form, or inflected from a base form • Does not account for: • agglutination • compounds • prefixing • irregulars, suppletion
Distribution of morphological forms • What information is available in corpora for learning? • Is there structure within the distribution of morphological forms that a learner can exploit? • Examine annotated corpora for several languages
Spanish newswire verbs Sparse data Log(freq) Lemma Inflection
# word types per inflection (Slovene 2.5 M) roughly Zipfian Dist. of inflectional categories
Most frequent inflection (in types) often matches intuitions of what inflection a base form should be Slovene: A.Pos.Nom.Sg.Indef N.Nom.Sg V.Main.Ind.Pres.3.Sg Swedish: A.Pos.Sg.Indef.Nom N.Sg.Indef.Nom V.Inf.Act Spanish: A.Sg N.Sg V.Inf High frequency of base form
Goals of induction algorithm • Select words from corpus to be base forms • Formulate transforms Technique: take advantage of high type frequency of base inflectional category
Start state End state Transforms = {($,s), ($,’s), …} Transforms = {} Base forms base Inflected forms inflected unmodeled unmodeled
Greedy algorithm At each iteration, • construct potential transforms • add the transform(s) that accounts for most data
Sources of words for transform Current grammar New transform base base inflected inflected unmodeled
Table for ( $, s ) Base greater: 3750, Inflected greater: 817 Choose ( $, s ) instead of ( s, $ ) Choose direction of transform
Morphochallenge English data • High number of word types ( ~250,000 ) leads to spurious transforms • ( $, a ) (music, musica) (naam,naama) (nucci,nuccia) (retin,retina) (mash,masha) (gab,gaba) • ( $, o ) (rutili,rutilio) (lazar,lazaro) (vern,verno) (berk,berko) (rikky,rikkyo) (economic,economico)
Summary • Base-and-transforms model of morphological paradigms • First step towards learning morphophonological rules • More linguistically satisfying than stem-and-suffix • Algorithm: • learn inventory of base forms • learn transforms (base-specific rules) • Exploits high freq. of base inflectional category
More slides available… • Longer version of this presentation • base forms simplify POS induction • Different system: transforms in parallel • Slovene, Spanish