30 likes | 200 Views
NICE Machine Translation for Low-Density Languages. Nice (Native Language Interpretation and Communication Environment) is a project for rapid development of machine translation for low and very low density languages Classification of MT by Language Density
E N D
NICE Machine Translation for Low-Density Languages • Nice (Native Language Interpretation and Communication Environment) is a project for rapid development of machine translation for low and very low density languages • Classification of MT by Language Density • High density pairs (E-F, E-S, E-J, …) • Statistical or traditional MT approaches are O.K. • Medium density (E-Czech, E-Croatian, …) • Example-based MT (success with Croatian, Korean) • JHU: initial success with stat-MT (Czech) • Low density (S-Mapudungun, E-Iñupiaq, …) • 10,000 to 1 million speakers • Insufficient bilingual corpora for SMT, EBMT • Partial corpus-based resources • Insufficient trained computational linguists • Machine Translation of Very Low Density Languages • No text in electronic form • Can’t apply current methods for statistical MT • No standard spelling or orthography • Few literate native speakers • Few linguists familiar with the language • Nobody is available to do rule-based MT • Not enough money or time for years of linguistic information gathering/analysis • E.g., Siona (Colombia) • Motivation for LDMT • Methods developed for languages with very scarce resources will generalize to all MT. • Policy makers can get input from indigenous people. • E.g., Has there been an epidemic or a crop failure • Indigenous people can participate in government, education, and internet without losing their language. • First MT of polysynthetic languages • New Ideas • MT without large amounts of text and without trained linguists • Machine learning of rule-based MT • Multi-Engine architecture can flexibly take advantage of whatever resources are available. • Research partnerships with indigenous communities • (Future: Exponential models for data-miserly SMT) • Approach • Machine learning • Uncontrolled corpus (Generalized Example-Based MT) • Controlled corpus elicited from native speakers (Version Space Learning) • Multi-Engine MT • Flexibly adapt to whatever resources are available • Take advantage of the strengths of different MT approaches • NICE partners • Mapudungun (Chile) • Iñupiaq (Alaska) • Siona (Colombia)
NICE • Elicitation Process • Purpose: controlled elicitation of data that will be input to machine learning of translation rules • Elicitation Interface • Native informant sees source language sentence (in English or Spanish) • Native informant types in translation, then uses mouse to add word alignments • Informant is • Literate • Bilingual • Not an expert in linguistics or in linguistics or computation • The Elicitation Corpus • List of sentences in a major language • English • Spanish • Dynamically adaptable • Different sentences are presented depending on what was previously elicited • Compositional • Joe, Joe’s brother, I saw Joe’s brother, I told you that I saw Joe’s brother, etc. • Aim for typological completeness • Cover all types of languages • Data Collection in Mapudungun • Spanish-Mapudungun parallel corpora • Total words: 223,366 • Spanish-Mapudungun glossary • About 5500 entries • 40 hours of speech recorded • 6 hours of speech transcribed • Speech data will be translated into Spanish
NICE – Current Work • Instructible Knowledge-Based Machine Translation (iKBMT) • The Learning Process Learning Instance: Hebrew: ha-yeled ha-gadol English: the big boy Acquired Transfer Rule: Hebrew: NP: N ADJ <==> English: NP: the ADJ N ;;x-side constraints (Hebrew) (X1 def) = *+ (X2 def) = *+ (X0 = X1) ;;y-side constraints (English) (Y0 = Y3) ;;x-y constraints (Hebrew-English equivalence, constituent alignments) (X:ADJ <==> Y:ADJ) (X:N <==> Y: N) • Seeded Version Space Learning • SVS is based on Mitchell-style inductive version-space learning, but instead of keeping full S and G boundaries for each concept, it starts from a seeded rule and grows by generalization, specialization and rule-bifurcation with incrementally acquired data.