Developing affordable technologies for resource-poor languages

Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September 22, 2004

dot = language AMTA 2002

Motivation Resource-poor scenarios • Indigenous communities have difficult access to crucial information that directly affects their life (such as land laws, health warnings, etc.) • Formalize a potentially endangered language Affordable technologies, such as • spell-checkers, • on-line dictionaries, • Machine Translation (MT) systems, • computer assisted tutoring AMTA 2002

AVENUE Partners AMTA 2002

Mapudungun for the Mapuche Chile Official Language: Spanish Population: ~15 million ~1/2 million Mapuche people Language: Mapudungun AMTA 2002

What’s Machine Translation (MT)? Japanese sentence Swahili sentence AMTA 2002

Speech to Speech MT AMTA 2002

Why Machine Translation for resource-poor (indigenous) languages? • Commercial MT economically feasible for only a handful of major languages with large resources (corpora, human developers) • Benefits include: • Better government access to indigenous communities (Epidemics, crop failures, etc.) • Better indigenous communities participation in information-rich activities (health care, education, government) without giving up their languages. • Language preservation • Civilian and military applications (disaster relief) AMTA 2002

MT for resource-poor languages: Challenges • Minimal amount of parallel text (oral tradition) • Possibly competing standards for orthography/spelling • Often relatively few trained linguists • Access to native informants possible • Need to minimize development time and cost AMTA 2002

Machine Translation Pyramid Interlingua interpretation Transfer rules Corpus-based methods generation analysis I saw you Yo vi tú AMTA 2002

{VP,3} VP::VP : [VP NP] -> [VP NP] ( (X1::Y1) (X2::Y2) ((x2 case) = acc) ((x0 obj) = x2) ((x0 agr) = (x1 agr)) (y2 == (y0 obj)) ((y0 tense) = (x0 tense)) ((y0 agr) = (y1 agr))) AVENUE MT system overview V::V |: [stayed] -> [quedó] ((X1::Y1) ((x0 form) = stay) ((x0 actform) = stayed) ((x0 tense) = past-pp) ((y0 agr pers) = 3) ((y0 agr num) = sg)) \spa Una mujer se quedó en casa \map Kie domo mlewey ruka mew \eng One woman stayed at home. AMTA 2002

Interactive and Automatic Refinement of Translation RulesOr: How to recycle corrections of MT output back into the MT system by adjusting and adapting the grammar and lexical rules

Error correction by non-expert bilingual users AMTA 2002

Interactive elicitation of MT errors Assumptions: • non-expert bilingual users can reliably detect and minimally correct MT errors, given: • SL sentence (I saw you) • TL sentence (Yo vi tú) • word-to-word alignments (I-yo, saw-vi, you-tú) • (context) • using an online GUI: the Translation Correction Tool (TCTool) Goal: • simplify MT correction task maximally AMTA 2002

TranslationCorrectionTool Actions: AMTA 2002

SL + best TL picked by user AMTA 2002

Changing “grande” into “gran” AMTA 2002

AMTA 2002

Automatic Rule Refinement Framework • Find best RR operations given a: • grammar (G), • lexicon (L), • (set of) source language sentence(s) (SL), • (set of) target language sentence(s) (TL), • its parse tree (P), and • minimal correction of TL (TL’) such that TQ2 > TQ1 • Which can also be expressed as: max TQ(TL|TL’,P,SL,RR(G,L)) AMTA 2002

Types of RR operations • Grammar: • R0  R0 + R1 [=R0’ + contr] Cov[R0]  Cov[R0,R1] • R0  R1 [=R0 + constr] Cov[R0]  Cov[R1] • R0  R1[=R0 + constr= -]  R2[=R0’ + constr=c +] Cov[R0]  Cov[R1,R2] • Lexicon • Lex0  Lex0 + Lex1[=Lex0 + constr] • Lex0  Lex1[=Lex0 + constr] • Lex0  Lex1[Lex0 +  TLword] •   Lex1 (adding lexical item) AMTA 2002

Questions & Discussion Thanks! AMTA 2002

Formalizing Error Information Wi = error Wi’ = correction Wc = clue word Example: SL: the red car - TL: *el auto roja TL’: el auto rojo Wi = roja Wi’ = rojo Wc = auto AMTA 2002

Finding Triggering Features Once we have user’s correction (Wi’), we can compare it with Wi at the feature level and find which is the triggering feature. If  set is empty, need to postulate a new binary feature Delta function: AMTA 2002

Developing affordable technologies for resource-poor languages

Developing affordable technologies for resource-poor languages

Presentation Transcript

PDAs for Data Collection in Resource-Poor Settings

Developing Comprehensive HIV/AIDS Standard Treatment Guidelines in a Resource-Poor Setting

Technologies for Developing Systems

Developing a resource for the SWDP website

Developing Classroom Assessments for South Asian Languages

ARVs in resource-poor settings

Semantic Web Languages and Technologies

Resource Horizon for Affordable Housing Preservation

Developing photonic technologies for dielectric laser accelerators

Developing Domain-Specific Languages for the JVM

Precision Agriculture for Resource-Poor Small Holders Breakout session

TECHNOLOGIES FOR DEVELOPING EFFECTIVE SYSTEMS

Developing Morphological Analysers for South Asian Languages:

Developing TERMS: techniques for electronic resource management

Developing Classroom Assessments for South Asian Languages

Integrating Technologies in Languages

Hawaii Energy Resource Technologies for Energy Security

ATWARM Advanced Technologies for Water Resource Management

Waste management: Appropriate technologies for developing countries

Healthcare needs to be affordable for the poor

Developing Classroom Assessments for South Asian Languages

Top 10 Languages / Frameworks for Developing Website