350 likes | 466 Views
Next slide is the Neutral Avenue System Diagram with a Morphology Learning box added. Avenue Overview. Elicitation. Morphology. Rule Learning. Run-Time System. Rule Refinement. Translation Correction Tool. Word-Aligned Parallel Corpus. Learning Module. Do NOT Use. Handcrafted
E N D
Next slide is the Neutral Avenue System Diagram with a Morphology Learning box added
Avenue Overview Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus Learning Module Do NOT Use Handcrafted rules Run Time Transfer System Learning Module Transfer Rules Rule Refinement Module Elicitation Corpus Morphology Analyzer Lexical Resources Lattice Elicitation Tool
The next slide is for Ari. It has her sections highlighted but also has the extra box that I added for Morphology Learning
Rule Refinement Elicitation Morphology Rule Learning Run-Time System Rule Refinement Translation Correction Tool Word-Aligned Parallel Corpus Learning Module Do NOT Use Handcrafted rules Run Time Transfer System Learning Module Transfer Rules Rule Refinement Module Elicitation Corpus Morphology Analyzer Lexical Resources Lattice Elicitation Tool
Avenue Overview Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus Learning Module Do NOT Use Handcrafted rules Run Time Transfer System Learning Module Transfer Rules Rule Refinement Module Elicitation Corpus Morphology Analyzer Lexical Resources Lattice Elicitation Tool
The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I used to
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I used to listen
The Challenge of Morphology Mapudungun Tasks for Morphology • Segment Words • Map Morphemes onto Features Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I used to listen
The Challenge of Morphology • Learn these tasks • unsupervised • from data • for any language Tasks for Morphology • Segment Words • Map Morphemes onto Features
Our Approach Leverage the Natural Structure of Morphology • Paradigm • Set of affixes that interchangeably attach to a set of stems
Our Approach Ø.s blame solve Leverage the Natural Structure of Morphology • Paradigm • Set of affixes that interchangeably attach to a set of stems Example Vocabulary blame blamed blamesroamed roaming roams solve solves solving
Our Approach Ø.s.d blame Ø.s blame solve Leverage the Natural Structure of Morphology • Paradigm • Set of affixes that interchangeably attach to a set of stems Example Vocabulary blame blamed blames roamed roaming roams solve solves solving
Our Approach Ø.s.d blame Ø.s blame solve Leverage the Natural Structure of Morphology • Paradigm • Set of affixes that interchangeably attach to a set of stems Example Vocabulary blame blamed blames roamed roaming roams solve solves solving
Our Approach Ø.s.d blame Ø.s blame solve Leverage the Natural Structure of Morphology • Paradigm • Set of affixes that interchangeably attach to a set of stems Example Vocabulary blame blamed blamesroamed roaming roams solve solvessolving s blame roam solve
Our Approach Ø.s.d blame Ø.s blame solve Leverage the Natural Structure of Morphology • Paradigm • Set of affixes that interchangeably attach to a set of stems Example Vocabulary blame blamed blames roamed roaming roams solve solves solving s blame roam solve
Our Approach Ø.s.d blame e.es blam solv Ø.s blame solve Example Vocabulary blame blamed blamesroamed roaming roams solve solves solving s blame roam solve
Our Approach Ø.s.d blame e.es blam solv Ø.s blame solve Example Vocabulary blame blamed blames roamed roaming roams solve solves solving s blame roam solve
me.mes.med bla e.es.ed blam Ø.s.d blame e.es blam solv Ø.s blame solve me.mes bla me.med bla e.ed blam Ø.d blame s.d blame mes.med bla es.ed blam e blam solv Ø blame blames blamed roams roamed roaming solve solves solving me bla s blame roam solve es blam solv mes bla med bla roa ed blam roam d blame roame
a.as.o.os.tro 1 cas • Spanish Newswire Corpus • 40,011 Tokens • 6,975 Types a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.tro 2 cas.cen a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... tro 16 catas, ce, cen, cua, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 24
Level 5 = 5 suffixes Stem Type Count Suffixes Stems a.as.o.os.tro 1 cas a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.tro 2 cas.cen a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... tro 16 catas, ce, cen, cua, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 25
a.as.o.os.tro 1 cas a.tro 2 cas.cen tro 16 catas, ce, cen, cua, ... Adjective Inflection Class From the spurious suffix “tro” a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 26
a.as.o.os.tro 1 cas Decreasing Stem Count Increasing Suffix Count a.tro 2 cas.cen tro 16 catas, ce, cen, cua, ... Basic Search Procedure a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 27
Examples and Evaluation of Automatically Selected Suffix Sets Global Suffix Evaluation Precision: 0.506 Recall: 0.517 F1: 0.511 28
Next Steps for Morphology Induction • Improve the Quality of Induced Paradigms • Current Work • Convert Paradigms into a Segmenter • Soon • Learn Mappings from Morphemes to Features • Future Goal
Avenue Overview Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus Learning Module Do NOT Use Handcrafted rules Run Time Transfer System Learning Module Transfer Rules Rule Refinement Module Elicitation Corpus Morphology Analyzer Lexical Resources Lattice Elicitation Tool
Mapudungun • Indigenous Language of Chile and Argentina • ~ 1 Million Mapuche Speakers
Collaboration • Mapuche Language Experts • Universidad de la Frontera (UFRO) • Instituto de Estudios Indígenas (IEI) • Institute for Indigenous Studies • Chilean Funding • Chilean Ministry of Education (Mineduc) • Bilingual and Multicultural Education Program Eliseo Cañulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiquiñir Marcela Collio Calfunao Cristian Carrillan Anton Salvador Cañulef Carolina Huenchullan Arrúe Claudio Millacura Salas
Accomplishments • Corpora Collection • Spoken Corpus • Collected: Luis Caniupil Huaiquiñir • Medical Domain • 3 of 4 Mapudungun Dialects • 120 hours of Nguluche • 30 hours of Lafkenche • 20 hours of Pwenche • Transcribed in Mapudungun • Translated into Spanish • Written Corpus • ~ 200,000 words • Bilingual Mapudungun – Spanish • Historical and newspaper text nmlch-nmjm1_x_0405_nmjm_00: M: <SPA>no pütokovilu kay ko C: no, si me lo tomaba con agua M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués nmlch-nmjm1_x_0406_nmlch_00: M: Chengewerkelafuymiürke C: Ya no estabas como gente entonces!
Accomplishments • Developed At UFRO • Bilingual Dictionary with Examples • 1,926 entries • Spelling Corrected Mapudungun Word List • 117,003 fully-inflected word forms • Segmented Word List • 15,120 forms • Stems translated into Spanish
Accomplishments • Developed at LTI using Mapudungun language resources from UFRO • Spelling Checker • Integrated into OpenOffice • Hand-built Morphological Analyzer • Prototype Machine Translation Systems • Rule-Based • Example-Based • LenguasAmerindias.org