20 likes | 160 Views
S. NP. VP. N. NP. V. S. Det. N. VP. NP. VP. John. ate. an. NP. apple. V. John. ne. ek. seb. khaya. John. ate. an. ate. an. apple. apple. John. ne. ek. seb. khaya. ek. seb. khaya. NP. VP. S. NP. V. S. S. VP. VP. VP. NP. VP. NP. ne. VP. V. NP.
E N D
S NP VP N NP V S Det N VP NP VP John ate an NP apple V John ne ek seb khaya John ate an ate an apple apple John ne ek seb khaya ek seb khaya NP VP S NP V S S VP VP VP NP VP NP ne VP V NP NP V Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser-Resourced Languages Christian Monson, Ariadna Font Llitjós, Vamshi Ambati, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Katharina Probst Paradigms Organize Inflectional Morphology Paradigm Discovery in 3 Steps Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive morphological operations. • Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search • Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms • Filter – Improve precision by removing unclustered and unlikely candidates • Spanish data guided algorithm development and parameter adjustment Results • Morpho Challenge 2007 • Competition for unsupervised morphology • induction algorithms • English • 3rd Place Overall • Bested Morfessor (Creutz, 2006) a • state-of-the-art unsupervised • morphology induction algorithm • German • 1st Place with Combined ParaMor- • Morfessor System Some Kind of Results Some Kind of Results • SL: pu püchükeche awkantu y kiñe awkantun • TL: niños jugaron un juego • AL: ((1,1),(2,1)),(3,2),(4,2),(5,3),(6,4)) • Action 1: add (W1=los) • C_TL: los niños jugaron un juego • CAL: ((1,2),(2,2)),(3,3),(4,3),(5,4),(6,5)) This needs to be in the form of trees for this poster Next (Thursday), block out: 1) rule learning 2) Rule Refinement Then (Friday), fill in and update details Finally (Next week), make look perfect
Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser-Resourced Languages Christian Monson, Ariadna Font Llitjós, Vamshi Ambati, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Katharina Probst Monolingual Text Unsupervised Morphology Induction Morphologically Analyzed Text Paradigms Organize Inflectional Morphology Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive cells. Exactly one cell from each paradigm can be filled (by an affix) in a surface word form. Paradigm Discovery in 3 Steps • Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search • Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms • Filter – Improve precision by removing unclustered and unlikely candidates • Spanish data guided algorithm development and parameter adjustment 1. Recall Centric Search e.er.erá.ido.ieron.ió 28: deb, escog, ofrec, roconoc, vend, ... e.ido.ieron.ir.irá.ió 28: asist, dirig, exig, ocurr, sufr, ... azar.e.ido.ieron.ir.ió 1: sal e.er.erá.ieron.ió 32: deb, padec, romp, ... e.erá.ido.ieron.ió 28: deb, escog, ... e.er.ido.ieron.ió 46: deb, parec, recog... e.ido.ieron.irá.ió 28: asist, dirig, ... e.ido.ieron.ir.ió 39: asist, bat, sal, ... e.ido.ieron.ió 86: asist, deb, hund,... e.erá.ieron.ió 32: deb, padec, ... er.ido.ieron.ió 58: ascend, ejerc, recog, ... ido.ieron.ir.ió 44: interrump, sal, ... 3. Filter Unlikely Candidates 2. Cluster Candidate Paradigms Error analysis identified 2 major categories of incorrect candidates 17: a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó 15: a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó Small Candidates contain few affixes and cover few types Incorrect Morpheme Boundary Candi- dates segment too far to the left. 16: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó 15: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.ó Ø.ipo covers 8 words Ø.e.iu covers 12 words iza.izado.izan.izar.izaron.izarán.izó der.derá.dido.diendo.dieron.dió.día 15: a.aba.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.aría.ó Segmentation Evaluation Methodology llega • Match word to segment against clustered affixes • Replace any matched affix with new affix from cluster • Segment the original word, if the corpus contains the hypothesized word form • Sample pairs of words that share morphemes. • Precision: Sample pairs sharing a morpheme in the automatic analyses • Recall: Sample pairs from an answer key of morphologically analyzed words • Examine corresponding analyses • Precsion: Count sampled pairs that share a morpheme in the answer key • Recall: Count sampled pairs that share a morpheme in the automatic analyses lleg aba lleg aban lleg ada … lleg +a Results A Closer Look at ParaMor vs. Morfessor • Morpho Challenge 2007 • Competition for unsupervised morphology • induction algorithms • English • 3rd Place Overall • Bested Morfessor (Creutz, 2006) a • state-of-the-art unsupervised • morphology induction algorithm • German • 1st Place with Combined ParaMor- • Morfessor System The Next Steps Extend ParaMor to hypothesize more than one morpheme boundary per analysis Expand beyond suffixation to other morphological phenomena, prefixes, etc. Merge inflection classes of the same paradigm Identify morphophonemic changes