220 likes | 303 Views
Data Collection and Analysis of Mapudungun Morphology for Spelling Correction. Christian Monson, Lori Levin, Rodolfo Vega, Ralf Brown, Ariadna Font Llitjos, Alon Lavie, Jaime Carbonell, Eliseo Cañulef, Rosendo Huisca.
E N D
Data Collection and Analysis of Mapudungun Morphology for Spelling Correction Christian Monson, Lori Levin, Rodolfo Vega, Ralf Brown, Ariadna Font Llitjos, Alon Lavie, Jaime Carbonell, Eliseo Cañulef, Rosendo Huisca
AVENUE Mapudungun • Instituto de Estudios Indígenas • Universidad de La Frontera, Temuco, Chile • Programa de Educación Intercultural Bilingüe • Ministry of Education (Mineduc), Chile • Language Technologies Institute • Carnegie Mellon University, USA
Goals of AVENUE Mapudungun Multicultural and Bilingual Education, Mineduc, Chile Basic skills taught in Spanish and mother tongue Use of technology and networking even in rural areas NLP tools for bilingual education: AVENUE Project, CMU, USA On-line dictionary Bilingual Corpus Spelling checker NLP tools for languages with low resources Machine learning of morphology and translation rules
Outline • Overview of Mapudungun language • Plan for on-line dictionary • Progress on dictionary • Plan for spelling checker • Progress on spelling checker
Mapudungun • Mapuche people • Around 900,000 • Chile and Argentina • Agglutinative/Polysynthetic • Up to 36 suffix slots (Smeets, 1989) • Typical verb has five or six suffixes • Noun incorporation • Noun goes immediately after the verb stem • Vstem+(noun)+(suffixes)+last-suffix • Last suffix for finite verb is mood and person/number of agent or patient, • Last suffix for non-finite verb is nominalization or adverbialization • Other suffixes include aspect, negation, inversive, etc.
Examples of Mapudungun verbs Amu -ke -yngün go -habitual -3plIndic They (usually) go Ngütrümtu -a -lu call -fut -adverb While calling (tomorrow), … nentu -ñma -nge -ymi extract -mal -pass -2sgIndic You were extracted (on me) ngütramka -me -a -fi -ñ tell -loc -fut -3obj -1sgIndic I will tell her (away)
Plans for Dictionary (Mineduc) • Tri-lingual (Spanish-Mapudungun-English); • Pronunciation for each word for each language • Example of use for each Mapudungun word • Specific users can exchange suggestions and alternate pronunciations • Teachers and students of schools in the PEIB/Orígenes program www.origenes.cl • Web-based, using Flash • based on shared lessons plans and network communications • Vocabulary • From the come from the corpus of spoken Mapudungun • From the Chilean curriculum for first four years of school • From the informatics domain • User interface will be designed by Mineduc
Corpus of spoken Mapudungun • 170 hours of speech • 120 hours: Nguluche dialect • 30 hours: Lafkenche dialect • 20 hours: Pewenche dialect • 0 hours: Williche dialect • Different and more endangered • Mapuche interviewer and interviewees • Dialogues about health problems treated by doctor or traditional healer. • Recorded with DAT recorder • Some recordings are poor quality • Some high enough quality for training a speech recognizer • Transcribed using TransEdit • Translated into Spanish by native speaker of Mapudungun
Examples from Mapudungun-Spanish corpus nmlch-nmjm1_x_0405_nmjm_00: M: <SPA>no pütokovilu kay ko C: no, si me lo tomaba con agua M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués nmlch-nmjm1_x_0406_nmlch_00: M: Chengewerkelafuymiürke C: Ya no estabas como gente entonces!
Progress on Dictionary • Around 3000 Mapudungun words (stems and fully inflected forms) • Spanish translation of the word • Sentence from the corpus of spoken Mapudungun containing the word form • Spanish translation of the sentence, and • Reference into the corpus of spoken Mapudungun identifying the specific cited sentence • For 1600 words • segmentation of the word into morphemes • gloss for each morpheme • Stored as a Word file with delimiters between fields. • Can be easily converted to other formats
Examples from Dictionary • Lichi: .? . / /. • leche. translation • Feychi lichi, ¿chem lichingey? example • (Esta leche ¿qué leche es?) translation • nmlch-nmfhp1_x_0051_nmlch_00. Ec/Rh/Fc. Ec/ Rh02-01-03. index
Examples from Dictionary • Kümekünueymu: • küme-künu-eymu. segmentation • bien-quedar-él(ella).a.ti .? . / /. gloss • te ha dejado muy bien. translation • Ka kümekünueymu tati. example • (Y te ha dejado muy bien). translation • nmlch-nmpll1_x_0070_nmlch_00. EC/RH03-02-03. index
Examples from Dictionary • Mongepeürkelayan: • monge-pe-ürke-la-y-a-n. segmentation • sanar-tal.vez-acaso-no-0-futuro-yo .? . / /. gloss • no mejoraré tal vez. translation • Feytüfachi operalayaymi, operaeliyu l'ayaymi" pieneu. "Mongepeürkelayan may" pin. Fey l'awen'tueneu, l'awen'tueneu; fey ka tripantun. example • ("Esta vez no te vas a operar, si te opero te vas a morir" me dijo. "No mejoraré tal vez, entonces", dije. Entonces me medicinó, me medicinó; entonces también estuve un año). translation • nmlch-nmpll1_x_0042_nmpll_00. Ec/Rh/Fc. Ec/ Rh23-12-02 index
Plans for spelling checker • Goal: identify misspellings even for morphologically complex words. • We don’t have a morphological analyzer • Mapugungun speakers don’t know computational linguistics • We don’t know Mapudungun • Currently training a field linguist from Argentina (Roberto Aranovich) in computational linguistics • Research on automated morphology learning (Christian Monson) • We want the spelling checker to be compatible with a major word processor. • Using MySpell and OpenOffice
MySpell • Open-source, standalone version of OpenOffice.org spell-checker • Functional equivalent of Unix 'ispell' • Data files specify stems and classes of affixes • each base-form word specifies valid affix classes • can condition applicability based on characters in base-form word • e.g. English plurals formed with -es if word ends in -ch • can modify base form prior to adding affix • e.g. change -y to -ie before adding -s • Limitation: at most one prefix and one suffix can be applied to each base form
Plans for Spelling Checker • MySpell for Mapudungun • Example of full segmentation • Mongepeürkelayan • monge-pe-ürke-la-y-a-n. • no mejoraré tal vez. • Example of segmentation for MySpell • mongestem • peürkelayan suffix string
Progress on Spelling Checker • Step 1: Devise spelling conventions • There are competing standards for Mapudungun spelling • First version of spelling checker: • AVENUE Mapudungun spelling standards by Cañulef, Huisca, Painequeo, and Carrasco • Step 2: Get a list of “correctly” spelled words, according to the conventions. • Currently have “correct” spelling for the 70,000 most frequent words from the corpus
Progress on Spelling Checkermost frequent70,000 words corrected by hand • Frequency Rank Transcribed Word Form Spelling Corrected Word Form • ………..…… • 103 feli feley • pichikeche pichikeche • kümey kümey • ………… • 10,001 chumkunual chumkünuael • 10,002 puedelafuy puedelafuy • 10,003 tulayin tulayiñ • 10,004 kimngepelay kimngepelay • ………… Can we use this list instead of stemming?
Progress on Spelling Checker • Step 3: Iteration of stem/suffix boundaries • Start with 1600 segmented words from the dictionary • Identify the suffix strings • For the next most frequent 1000 words • If the word ends in a known suffix string, insert a stem/suffix boundary • Oversegments because we don’t check that the remaining stem is known after the suffix string is removed • Native speakers correct the boundaries • 333 had to be corrected • Two more iterations • Next most frequent 3000 words (579 were wrong) • Next most frequent 5000 words (1175 were wrong) • Results in 9000 words with correct stem/suffix boundaries
Effect of stemming on number of types If the suffix string and the stem are in the list of 9000 correctly segmented words, treat it as an instance of the stem. Otherwise, treat it as a new type.
Conclusion • Building tools that can be used for bilingual education in Chilean schools • Large corpus of parallel corpus spoken Mapudungun translated into Spanish • Small dictionary with examples from the corpus • Can we build a spelling checker with MySpell? • We will let you know at a future conference.