250 likes | 334 Views
The Challenge of Morphology. Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers). Allk ü tulekefun. The Challenge of Morphology. Mapudungun. Allk ü tu. -le. -ke. -fu. -n. The Challenge of Morphology. Mapudungun. Allk ü tu. -le. -ke. -fu. -n. Listen.
E N D
The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I used to
The Challenge of Morphology Mapudungun Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I used to listen
The Challenge of Morphology Mapudungun Tasks for Morphology • Segment Words • Map Morphemes onto Features Allkütu -le -ke -fu -n Listen -prog. -habitual -past -indic.1sg I used to listen
The Challenge of Morphology Tasks for Morphology • Segment Words • Map Morphemes onto Features • Learn these tasks • unsupervised • from data • for any language
Leverage the Natural Structure of Morphology • Paradigm • Set of affixes that interchangeably attach to a set of stems • English Example • Regular Verbs: Ø.s.ing.ed • Regular Adj: Ø.er.est
Example Vocabulary blame blamed blames roamed roaming roams solve solves solving
Ø.s blame solve Example Vocabulary blame blamed blamesroamed roaming roams solve solves solving
Ø.s.d blame Ø.s blame solve Example Vocabulary blame blamed blames roamed roaming roams solve solves solving
Ø.s.d blame Ø.s blame solve Example Vocabulary blame blamed blames roamed roaming roams solve solves solving
Ø.s.d blame Ø.s blame solve Example Vocabulary blame blamed blamesroamed roaming roams solve solvessolving s blame roam solve
Ø.s.d blame Ø.s blame solve Example Vocabulary blame blamed blames roamed roaming roams solve solves solving s blame roam solve
Ø.s.d blame e.es blam solv Ø.s blame solve Example Vocabulary blame blamed blamesroamed roaming roams solve solves solving s blame roam solve
Ø.s.d blame e.es blam solv Ø.s blame solve Example Vocabulary blame blamed blames roamed roaming roams solve solves solving s blame roam solve
me.mes.med bla e.es.ed blam Ø.s.d blame e.es blam solv Ø.s blame solve me.mes bla me.med bla e.ed blam Ø.d blame s.d blame mes.med bla es.ed blam e blam solv Ø blame blames blamed roams roamed roaming solve solves solving me bla s blame roam solve es blam solv mes bla med bla roa ed blam roam d blame roame
a.as.o.os.tro 1 cas • Spanish Newswire Corpus • 40,011 Tokens • 6,975 Types a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.tro 2 cas.cen a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... tro 16 catas, ce, cen, cua, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 19
Level 5 = 5 suffixes Stem Type Count Suffixes Stems a.as.o.os.tro 1 cas a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.tro 2 cas.cen a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... tro 16 catas, ce, cen, cua, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 20
a.as.o.os.tro 1 cas a.tro 2 cas.cen tro 16 catas, ce, cen, cua, ... Adjective Inflection Class From the spurious suffix “tro” a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 21
a.as.o.os.tro 1 cas Decreasing Stem Count Increasing Suffix Count a.tro 2 cas.cen tro 16 catas, ce, cen, cua, ... Basic Search Procedure a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 22
Scaling Up • Scaling Up • 1 Million word corpus • Network built on demand • New Approach to Search • High Recall initial search • Weed the results to improve precision • Results • Boost Recall of Suffixes in Spanish • from 0.5 to 0.8 • But very low precision currently
Next Steps for Morphology Induction • Clean the Selected Schemes • Current Work • Convert Paradigms into a Segmenter • Soon • Agglutinative sequences of suffixes • Soon • Learn Mappings from Morphemes to Features • Future Goal