250 likes | 425 Views
Developing affordable technologies for resource-poor languages. Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September 22, 2004. dot = language. Motivation. Resource-poor scenarios
E N D
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September 22, 2004
dot = language AMTA 2002
Motivation Resource-poor scenarios • Indigenous communities have difficult access to crucial information that directly affects their life (such as land laws, health warnings, etc.) • Formalize a potentially endangered language Affordable technologies, such as • spell-checkers, • on-line dictionaries, • Machine Translation (MT) systems, • computer assisted tutoring AMTA 2002
AVENUE Partners AMTA 2002
Mapudungun for the Mapuche Chile Official Language: Spanish Population: ~15 million ~1/2 million Mapuche people Language: Mapudungun AMTA 2002
What’s Machine Translation (MT)? Japanese sentence Swahili sentence AMTA 2002
Speech to Speech MT AMTA 2002
Why Machine Translation for resource-poor (indigenous) languages? • Commercial MT economically feasible for only a handful of major languages with large resources (corpora, human developers) • Benefits include: • Better government access to indigenous communities (Epidemics, crop failures, etc.) • Better indigenous communities participation in information-rich activities (health care, education, government) without giving up their languages. • Language preservation • Civilian and military applications (disaster relief) AMTA 2002
MT for resource-poor languages: Challenges • Minimal amount of parallel text (oral tradition) • Possibly competing standards for orthography/spelling • Often relatively few trained linguists • Access to native informants possible • Need to minimize development time and cost AMTA 2002
Machine Translation Pyramid Interlingua interpretation Transfer rules Corpus-based methods generation analysis I saw you Yo vi tú AMTA 2002
{VP,3} VP::VP : [VP NP] -> [VP NP] ( (X1::Y1) (X2::Y2) ((x2 case) = acc) ((x0 obj) = x2) ((x0 agr) = (x1 agr)) (y2 == (y0 obj)) ((y0 tense) = (x0 tense)) ((y0 agr) = (y1 agr))) AVENUE MT system overview V::V |: [stayed] -> [quedó] ((X1::Y1) ((x0 form) = stay) ((x0 actform) = stayed) ((x0 tense) = past-pp) ((y0 agr pers) = 3) ((y0 agr num) = sg)) \spa Una mujer se quedó en casa \map Kie domo mlewey ruka mew \eng One woman stayed at home. AMTA 2002
Interactive and Automatic Refinement of Translation RulesOr: How to recycle corrections of MT output back into the MT system by adjusting and adapting the grammar and lexical rules
Interactive elicitation of MT errors Assumptions: • non-expert bilingual users can reliably detect and minimally correct MT errors, given: • SL sentence (I saw you) • TL sentence (Yo vi tú) • word-to-word alignments (I-yo, saw-vi, you-tú) • (context) • using an online GUI: the Translation Correction Tool (TCTool) Goal: • simplify MT correction task maximally AMTA 2002
TranslationCorrectionTool Actions: AMTA 2002
SL + best TL picked by user AMTA 2002
Changing “grande” into “gran” AMTA 2002
Automatic Rule Refinement Framework • Find best RR operations given a: • grammar (G), • lexicon (L), • (set of) source language sentence(s) (SL), • (set of) target language sentence(s) (TL), • its parse tree (P), and • minimal correction of TL (TL’) such that TQ2 > TQ1 • Which can also be expressed as: max TQ(TL|TL’,P,SL,RR(G,L)) AMTA 2002
Types of RR operations • Grammar: • R0 R0 + R1 [=R0’ + contr] Cov[R0] Cov[R0,R1] • R0 R1 [=R0 + constr] Cov[R0] Cov[R1] • R0 R1[=R0 + constr= -] R2[=R0’ + constr=c +] Cov[R0] Cov[R1,R2] • Lexicon • Lex0 Lex0 + Lex1[=Lex0 + constr] • Lex0 Lex1[=Lex0 + constr] • Lex0 Lex1[Lex0 + TLword] • Lex1 (adding lexical item) AMTA 2002
Questions & Discussion Thanks! AMTA 2002
Formalizing Error Information Wi = error Wi’ = correction Wc = clue word Example: SL: the red car - TL: *el auto roja TL’: el auto rojo Wi = roja Wi’ = rojo Wc = auto AMTA 2002
Finding Triggering Features Once we have user’s correction (Wi’), we can compare it with Wi at the feature level and find which is the triggering feature. If set is empty, need to postulate a new binary feature Delta function: AMTA 2002