10 likes | 144 Views
ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis. Christian Monson, Jaime Carbonell, Alon Lavie, Lori Levin. Monolingual Text. Unsupervised Morphology Induction. Morphologically Analyzed Text. Paradigms Organize Inflectional Morphology.
E N D
ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis Christian Monson, Jaime Carbonell, Alon Lavie, Lori Levin Monolingual Text Unsupervised Morphology Induction Morphologically Analyzed Text Paradigms Organize Inflectional Morphology Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive cells. Exactly one cell from each paradigm can be filled (by an affix) in a surface word form. Paradigm Discovery in 3 Steps • Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search • Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms • Filter – Improve precision by removing unclustered and unlikely candidates • Spanish data guided algorithm development and parameter adjustment 1. Recall Centric Search e.er.erá.ido.ieron.ió 28: deb, escog, ofrec, roconoc, vend, ... e.ido.ieron.ir.irá.ió 28: asist, dirig, exig, ocurr, sufr, ... azar.e.ido.ieron.ir.ió 1: sal e.er.erá.ieron.ió 32: deb, padec, romp, ... e.erá.ido.ieron.ió 28: deb, escog, ... e.er.ido.ieron.ió 46: deb, parec, recog... e.ido.ieron.irá.ió 28: asist, dirig, ... e.ido.ieron.ir.ió 39: asist, bat, sal, ... e.ido.ieron.ió 86: asist, deb, hund,... e.erá.ieron.ió 32: deb, padec, ... er.ido.ieron.ió 58: ascend, ejerc, recog, ... ido.ieron.ir.ió 44: interrump, sal, ... 3. Filter Unlikely Candidates 2. Cluster Candidate Paradigms Error analysis identified 2 major categories of incorrect candidates 17: a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó 15: a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó Small Candidates contain few affixes and cover few types Incorrect Morpheme Boundary Candi- dates segment too far to the left. 16: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó 15: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.ó Ø.ipo covers 8 words Ø.e.iu covers 12 words iza.izado.izan.izar.izaron.izarán.izó der.derá.dido.diendo.dieron.dió.día 15: a.aba.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.aría.ó Segmentation Evaluation Methodology llega • Match word to segment against clustered affixes • Replace any matched affix with new affix from cluster • Segment the original word, if the corpus contains the hypothesized word form • Sample pairs of words that share morphemes. • Precision: Sample pairs sharing a morpheme in the automatic analyses • Recall: Sample pairs from an answer key of morphologically analyzed words • Examine corresponding analyses • Precsion: Count sampled pairs that share a morpheme in the answer key • Recall: Count sampled pairs that share a morpheme in the automatic analyses lleg aba lleg aban lleg ada … lleg +a Results A Closer Look at ParaMor vs. Morfessor • Morpho Challenge 2007 • Competition for unsupervised morphology • induction algorithms • English • 3rd Place Overall • Bested Morfessor (Creutz, 2006) a • state-of-the-art unsupervised • morphology induction algorithm • German • 1st Place with Combined ParaMor- • Morfessor System The Next Steps Extend ParaMor to hypothesize more than one morpheme boundary per analysis Expand beyond suffixation to other morphological phenomena, prefixes, etc. Merge inflection classes of the same paradigm Identify morphophonemic changes