120 likes | 216 Views
Bridging the Gap: Cutting Edge Technologies Working for Lesser-Resourced Languages. Christian Monson, Ariadna Font Llitjós, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Kathrin Probst. MT Challenges. Interlingua.
E N D
Bridging the Gap:Cutting Edge Technologies Working for Lesser-Resourced Languages Christian Monson, Ariadna Font Llitjós, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Kathrin Probst
MT Challenges Interlingua Semantic Analysis Sentence Planning Transfer Rules Text Generation Syntactic Parsing Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT
MT Challenges Need Human expertise But high quality Interlingua Semantic Analysis Sentence Planning Transfer Rules Text Generation Syntactic Parsing Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT
MT Challenges Need Human expertise But high quality Interlingua Semantic Analysis Sentence Planning Transfer Rules Text Generation Syntactic Parsing Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT Need large bilingual corpus But fast to develop
AVENUE MT Approach Interlingua Semantic Analysis Sentence Planning Transfer Rules Text Generation Syntactic Parsing AVENUE: Automate Rule Learning Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT
AVENUE MT Approach Interlingua Semantic Analysis Sentence Planning Transfer Rules Text Generation Syntactic Parsing AVENUE: Automate Rule Learning Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT Leverage Linguistic Structure Utilize Bilingual Lingual Speakers
Mapudungun 900,000 Speakers Inupiaq 100’s of Speakers Marcello’s Languages? 100’s of Speakers Quechua 6 Million Speakers
Three Sub-Problems • Morphology Induction • Initial Syntax Learning • Syntax Refinement
Morphology Induction Paradigms Organize Morphology
e.er.erá.ido.ieron.ió 28: deb, escog, ofrec, roconoc, vend, ... e.ido.ieron.ir.irá.ió 28: asist, dirig, exig, ocurr, sufr, ... azar.e.ido.ieron.ir.ió 1: sal e.er.erá.ieron.ió 32: deb, padec, romp, ... e.erá.ido.ieron.ió 28: deb, escog, ... e.er.ido.ieron.ió 46: deb, parec, recog... e.ido.ieron.irá.ió 28: asist, dirig, ... e.ido.ieron.ir.ió 39: asist, bat, sal, ... e.ido.ieron.ió 86: asist, deb, hund,... e.erá.ieron.ió 32: deb, padec, ... er.ido.ieron.ió 58: ascend, ejerc, recog, ... ido.ieron.ir.ió 44: interrump, sal, ... Paradigm Discovery in 3 Steps • Search for partial paradigms in a network of candidates. • Cluster overlapping partial paradigms • Filter the clusters, keeping the largest clusters most likely to model true paradigms A Portion of a Spanish paradigm candidate network
Morpho Challenge 2007 Unsupervised Morphology Induction Competition • English • 3rd Place Overall • Bested the strong baseline Morfessor (Creutz, 2006) • German • 1st Place with Combined ParaMor-Morfessor System