1 / 55

Omnivorous MT

Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages. Christian Monson, Ariadna Font Llitjós, Roberto Aranovich, Lori Levin , Ralf Brown, Erik Peterson, Jaime Carbonell, and Alon Lavie. Omnivorous MT.

ike
Download Presentation

Omnivorous MT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós, Roberto Aranovich, Lori Levin, Ralf Brown, Erik Peterson, Jaime Carbonell, and Alon Lavie

  2. Omnivorous MT • Eat whatever resources are available • Eat large or small amounts of data Mapusaurus Roseae Mapu = land Mapuche = land people Mapudungun= land speech

  3. Resources Parallel corpus Monolingual corpus Lexicon Morphological Analyzer (lemmatizer) Human Linguist Human non-linguist Techniques Rule based transfer system Example Based MT Morphology Learning Rule Learning Interactive Rule Refinement Multi-Engine MT AVENUE’s Inventory This research was funded in part by NSF grant number IIS-0121-631.

  4. Startup without corpus or linguist Requires someone who is bilingual and literate

  5. The Elicitation Tool has been used with these languages • Mapudungun • Hindi • Hebrew • Quechua • Aymara • Thai • Japanese • Chinese • Dutch • Arabic

  6. Provide a small but highly targeted corpus of hand aligned data To support machine learning from a small data set To discover basic word order To discover how syntactic dependencies are expressed To discover which grammatical meanings are reflected in the morphology or syntax of the language srcsent: Tú caíste tgtsent: eymi ütrünagimi aligned: ((1,1),(2,2)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) fell srcsent: Tú estás cayendo tgtsent: eymi petu ütünagimi aligned: ((1,1),(2 3,2 3)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) are falling srcsent: Tú caíste tgtsent: eymi ütrunagimi aligned: ((1,1),(2,2)) context: tú = María [femenino, 2a persona del singular] comment: You (Mary) fell Purpose of Elicitation

  7. Feature Structures srcsent: Mary was not a leader. context: Translate this as though it were spoken to a peer co-worker; ((actor ((np-function fn-actor)(np-animacy anim-human)(np- biological-gender bio-gender-female) (np-general-type proper-noun-type)(np-identifiability identifiable)(np- specificity specific)…)) (pred ((np-function fn-predicate-nominal)(np-animacy anim- human)(np-biological-gender bio-gender-female) (np- general-type common-noun-type)(np-specificity specificity- neutral)…)) (c-v-lexical-aspect state)(c-copula-type copula-role)(c-secondary-type secondary-copula)(c-solidarity solidarity-neutral) (c-v-grammatical-aspect gram-aspect-neutral)(c-v-absolute-tense past) (c-v-phase-aspect phase-aspect-neutral) (c-general-type declarative-clause)(c-polarity polarity-negative)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)…)

  8. Current Work • Search space: • Elements of meanings that might be expressed by syntax or morphology: tense, aspect, person, number, gender, causation, evidentiality, etc. • Syntactic dependencies: subject, object • Interactions of features: • Tense and person • Tense and interrogative mood • Etc.

  9. Current Work • For a new language • For each item of the search space • Eliminate it as irrelevant or • Explore it • Using as few sentences as possible

  10. Tools for Creating Elicitation Corpora Tense & Aspect Feature Specification Clause-Level Noun-Phrase Modality … List of semantic features and values Feature Maps: which combinations of features and values are of interest XML Schema XSLT Script Feature Structure Sets Reverse Annotated Feature Structure Sets: add English sentences The Corpus Sampling SmallerCorpus

  11. Tools for Creating Elicitation Corpora Tense & Aspect Feature Specification Clause-Level Noun-Phrase Modality … List of semantic features and values Feature Maps: which combinations of features and values are of interest Feature Structure Sets Combination Formalism Reverse Annotated Feature Structure Sets: add English sentences The Corpus Sampling SmallerCorpus

  12. Tools for Creating Elicitation Corpora Tense & Aspect Feature Specification Clause-Level Noun-Phrase Modality … List of semantic features and values Feature Maps: which combinations of features and values are of interest Feature Structure Sets Feature Structure Viewer Reverse Annotated Feature Structure Sets: add English sentences The Corpus Sampling SmallerCorpus

  13. Tools for Creating Elicitation Corpora Tense & Aspect Feature Specification Clause-Level Noun-Phrase Modality … List of semantic features and values Feature Maps: which combinations of features and values are of interest Feature Structure Sets Reverse Annotated Feature Structure Sets: add English sentences The Corpus Sampling SmallerCorpus

  14. Outline • Two ideas • Omnivorous MT • Startup for low resource situation • Four Languages • Mapudungun • Quechua • Hindi • Hebrew

  15. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer The Avenue Low Resource Scenario Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT

  16. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer The Avenue Low Resource Scenario Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT

  17. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer The Avenue Low Resource Scenario Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT

  18. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer The Avenue Low Resource Scenario Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT

  19. Mapudungun Language • 900,000 Mapuche people • At least 300.000 speakers of Mapudungun • Polysynthetic sl: pe- rke- fi- ñ Maria ver-REPORT-3pO-1pSgS/IND tl: DICEN QUE LA VI A MARÍA (They say that) I saw Maria.

  20. AVENUE Mapudungun • Joint project between Carnegie Mellon University, the Chilean Ministry of Education, and Universidad de la Frontera.

  21. Mapudungun to Spanish Resources • Initially: • Large team of native speakers at Universidad de la Frontera, Temuco, Chile • Some knowledge of linguistics • No knowledge of computational linguistics • No corpus • A few short word lists • No morphological analyzer • Later: Computational Linguists with non-native knowledge of Mapudungun • Other considerations: • Produce something that is useful to the community, especially for bilingual education • Experimental MT systems are not useful

  22. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer Mapudungun Corpus: 170 hours of spoken Mapudungun Example Based MT Spelling checker Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT Spanish Morphology from UPC, Barcelona

  23. Mapudungun Products • http://www.lenguasamerindias.org/ • Click: traductor mapudungún • Dictionary lookup (Mapudungun to Spanish) • Morphological analysis • Example Based MT (Mapudungun to Spanish)

  24. V V N N I Didn’t see Maria S S VP VP NP “a” NP V VSuffG “no” V VSuffG VSuff N pe vi N VSuffG VSuff ñ Maria María fi VSuff la

  25. V V N Transfer to Spanish: Top-Down S S VP VP VP::VP [VBar NP] -> [VBar "a" NP] ( (X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 ender))) NP “a” NP V VSuffG VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la

  26. AVENUE Hebrew • Joint project of Carnegie Mellon University and University of Haifa

  27. Hebrew Language • Native language of about 3-4 Million in Israel • Semitic language, closely related to Arabic and with similar linguistic properties • Root+Pattern word formation system • Rich verb and noun morphology • Particles attach as prefixed to the following word: definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)… • Unique alphabet and Writing System • 22 letters represent (mostly) consonants • Vowels represented (mostly) by diacritics • Modern texts omit the diacritic vowels, thus additional level of ambiguity: “bare” word  word • Example: MHGR  mehager, m+hagar, m+h+ger

  28. Hebrew Resources • Morphological analyzer developed at Technion • Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary • Human Computational Linguists • Native Speakers

  29. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer Hebrew Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT

  30. Flat Seed Rule Generation

  31. Compositionality Learning

  32. Constraint Learning

  33. Quechua facts • Agglutinative language • A stem can often have 10 to 12 suffixes, but it can have up to 28 suffixes • Supposedly clear cut boundaries, but in reality several suffixes change when followed by certain other suffixes • No irregular verbs, nouns or adjectives • Does not mark for gender • No adjective agreement • No definite or indefinite articles (‘topic’ and ‘focus’ markers perform a similar task of articles and intonation in English or Spanish)

  34. Quechua examples • taki+ni (also written takiniy) sing 1sg (I sing)  canto • taki+sha+ni (takishaniy) sing progr 1sg (I am singing)  estoy cantando • taki+pa+ku+q+chu? taki sing -pa+ku to join a group to do something -q agentive -chu interrogative  (para) cantar con la gente (del pueblo)? (to sing with the people (of the village)?)

  35. Quechua Resources • A few native speakers, not linguists • A computational linguist learning Quechua • Two fluent, but non-native linguists

  36. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer Quechua Parallel Corpus: OCR with correction Elicitation Morphology Rule Learning Run-Time System RuleRefinement Word-Aligned Parallel Corpus Translation Correction Tool INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT

  37. Grammar rules cantando ;taki+sha+ni -> estoy cantando (I am singing) {VBar,3} VBar::VBar : [V VSuff VSuff] -> [V V] ( (X1::Y2) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) =c ger) ((y2 mood) = (x2 mood)) ((y1 form) =c estar) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 tense) = (x3 tense)) ((x0 tense) = (x3 tense)) ((y1 mood) = (x3 mood)) ((x3 inflected) =c +) ((x0 inflected) = +)) Spanish Morphology Generation lex = cantar mood = ger lex = estar person = 1 number = sg tense = pres mood = ind estoy

  38. Hindi Resources • Large statistical lexicon from the Linguistic Data Consortium (LDC) • Parallel Corpus from LDC • Morphological Analyzer-Generator from LDC • Lots of native speakers • Computational linguists with little or no knowledge of Hindi • Experimented with the size of the parallel corpus • Miserly and large scenarios

  39. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer EBMT Hindi Parallel Corpus SMT Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT 15,000 Noun Phrases from Penn TreeBank Supported by DARPA TIDES

  40. Manual Transfer Rules: Example NP PP NP1 NP P Adj N N1 ke eka aXyAya N jIvana NP NP1 PP Adj N P NP one chapter of N1 N life ; NP1 ke NP2 -> NP2 of NP1 ; Ex: jIvana ke eka aXyAya ; life of (one) chapter ; ==> a chapter of life ; {NP,12} NP::NP : [PP NP1] -> [NP1 PP] ( (X1::Y2) (X2::Y1) ; ((x2 lexwx) = 'kA') ) {NP,13} NP::NP : [NP1] -> [NP1] ( (X1::Y1) ) {PP,12} PP::PP : [NP Postp] -> [Prep NP] ( (X1::Y2) (X2::Y1) )

  41. Hindi-English Very miserly training data. Seven combinations of components Strong decoder allows re-ordering Three automatic scoring metrics

  42. Extra Slides

  43. Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer The Avenue Low Resource Scenario Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT

  44. Feature Specification • Defines Features and their values • Sets default values for features • Specifies feature requirements and restrictions • Written in XML

  45. Feature Specification Feature: c-copula-type(a copula is a verb like “be”; some languages do not have copulas) Values copula-n/a Restrictions: 1. ~(c-secondary-type secondary-copula) Notes: copula-role   Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. A role is something like a job or a function. "He is a teacher" "This is a vegetable peeler"copula-identity   Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. "Clark Kent is Superman" "Sam is the teacher" copula-location   Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. "The book is on the table" There is a long list of locative relations later in the feature specification. copula-description   Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. A description is an attribute. "The children are happy." "The books are long."

  46. Feature Maps • Some features interact in the grammar • English –s reflects person and number of the subject and tense of the verb. • In expressing the English present progressive tense, the auxiliary verb is in a different place in a question and a statement: • He is running. • Is he running? • We need to check many, but not all combinations of features and values. • Using unlimited feature combinations leads to an unmanageable number of sentences

  47. Evidentiality Map Lexical Aspect Assertiveness Polarity Source Tense Gram. Aspect activity-accomplishment Assertiveness-asserted, Assetiveness-neutral Polarity-positive, Polarity-negative Hearsay, quotative, inferred, assumption Visual, Auditory, non-visual-or-auditory Past Present, Future Past Present Perfective, progressive, habitual, neutral Perfective, progressive, habitual, neutral habitual, neutral, progressive habitual, neutral, progressive

  48. Current Work • Navigation • Start: large search space of all possible feature combinations • Finish: each feature has been eliminated as irrelevant or has been explored • Goal: dynamically find the most efficient path through the search space for each language.

  49. Current Work • Feature Detection • Which features have an effect on morphosyntax? • What is the effect? • Drives the Navigation process

More Related