420 likes | 531 Views
Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings. Lori Levin Robert Frederking Alison Alvarez Language Technologies Institute School of Computer Science Carnegie Mellon University. Jeff Good Department of Linguistics
E N D
Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings Lori Levin Robert Frederking Alison Alvarez Language Technologies Institute School of Computer Science Carnegie Mellon University Jeff Good Department of Linguistics Max Planck Institute for Evolutionary Anthropology
Reverse Treebank (RTB) • What? • Create the syntactic structures first • Then add sentences • Why? • To elicit data from speakers of less commonly taught languages: • Decide what meaning we want to elicit • Represent the meaning in a feature structure • Add an English or Spanish sentence (plus context notes) to express the meaning • Ask the informant to translate it
Bengali Example srcsent: The large bus to the post office broke down. context: tgtsent: ((actor ((modifier ((mod-role mod-descriptor) (mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific) (np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate) (np-person person-third)(np-function fn-actor)(np-general-type common-noun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral))) (c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a)(c-question-gap gap-n/a))
Outline • Background • The AVENUE Machine Translation System • Contents of the RTB • An inventory of grammatical meanings • Languages that have been elicited • Tools for RTB creation • Future work • Evaluation • Navigation
Type information Synchronous Context Free Rules Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) AVENUE Machine Translation System ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI) Rule learning: Katharina Probst
AVENUE • Rules can be written by hand or learned automatically. • Hybrid • Rule-based transfer • Statistical decoder • Multi-engine combinations with SMT and EBMT
AVENUE systems(Small and experimental, but tested on unseen data) • Hebrew-to-English • Alon Lavie, Shuly Wintner, Katharina Probst • Hand-written and automatically learned • Automatic rules trained on 120 sentences perform slightly better than about 20 hand-written rules. • Hindi-to-English • Lavie, Peterson, Probst, Levin, Font, Cohen, Monson • Automatically learned • Performs better than SMT when training data is limited to 50K words
AVENUE systems(Small and experimental, but tested on unseen data) • English-to-Spanish • Ariadna Font Llitjos • Hand-written, automatically corrected • Mapudungun-to-Spanish • Roberto Aranovich and Christian Monson • Hand-written • Dutch-to-English • Simon Zwarts • Hand-written
Elicitation • Get data from someone who is • Bilingual • Literate • Not experienced with linguistics
English-Hindi Example Elicitation Tool: Erik Peterson
Elicitation srcsent: Tú caíste tgtsent: eymi ütrünagimi aligned: ((1,1),(2,2)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) fell srcsent: Tú estás cayendo tgtsent: eymi petu ütrünagimi aligned: ((1,1),(2 3,2 3)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) are falling srcsent: Tú caíste tgtsent: eymi ütrunagimi aligned: ((1,1),(2,2)) context: tú = María [femenino, 2a persona del singular] comment: You (Mary) fell
Outline • Background • The AVENUE Machine Translation System • Contents of the RTB • An inventory of grammatical meanings • Languages that have been elicited • Tools for RTB creation • Future work • Evaluation • Navigation
Size of RTB • Around 3200 sentences • 20K words
Languages • The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program. • Translated (by LDC) into: • Thai • Bengali • Plans to translate into: • Seven “strategic” languages per year for five years. • As one small part of a language pack (BLARK) for each language.
Languages • Feature structures are being reverse annotated in Spanish at New Mexico State University (Helmreich and Cowie) • Plans to translate into Guarani • Reverse annotation into Portuguese in Brazil (Marcello Modesto) • Plans to translate into Karitiana • 200 speakers • Plans to translate into Inupiaq (Kaplan and MacLean)
Previous Elicitation Work • Pilot corpus • Around 900 sentences • No feature structures • Mapudungun • Two partial translations • Quechua • Three translations • Aymara • Seven translations • Hebrew • Hindi • Several translations • Dutch
Mary is writing a book for John. Who let him eat the sandwich? Who had the machine crush the car? They did not make the policeman run. Mary had not blinked. The policewoman was willing to chase the boy. Our brothers did not destroy files. He said that there is not a manual. The teacher who wrote a textbook left. The policeman chased the man who was a thief. Mary began to work. Tense, aspect, transitivity Questions, causation and permission Interaction of lexical and grammatical aspect Volitionality Embedded clauses and sequence of tense Relative clauses Phase aspect Sample: clause level
The man quit in November. The man works in the afternoon. The balloon floated over the library. The man walked over the platform. The man came out from among the group of boys. The long weekly meeting ended. The large bus to the post office broke down. The second man laughed. All five boys laughed. Temporal and locative meanings Quantifiers Numbers Combinations of different types of modifers My book Possession, definiteness A book of mine Possession, indefiniteness Sample: noun phrase level
Example srcsent: The large bus to the post office broke down. ((actor ((modifier ((mod-role mod-descriptor) (mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific) (np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate) (np-person person-third)(np-function fn-actor)(np-general-type common-noun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral))) (c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a)(c-question-gap gap-n/a))
Grammatical meanings vs syntactic categories • Features and values are based on a collection of grammatical meanings • Many of which are similar to the grammatemes of the Prague Treebanks
YES Semantic Roles Identifiability Specificity Time Before, after, or during time of speech Modality NO Case Voice Determiners Auxiliary verbs Grammatical Meanings
YES How is identifiability expressed? Determiner Word order Optional case marker Optional verb agreement How is specificity expressed? How are generics expressed? How are predicate nominals marked? NO How are English determiners translated? The boy cried. The lion is a fierce beast. I ate a sandwich. He is a soldier. Il est soldat. Grammatical Meanings
Argument Roles • Actor • Roughly, deep subject • Undergoer • Roughly, deep object • Predicate and predicatee • The woman is the manager. • Recipient • I gave a book to the students. • Beneficiary • I made a phone call for Sam.
Why not subject and object? • Languages use their voice systems for different purposes. • Mapudungun obligatorily uses an inverse marked verb when third person acts on first or second person. • Verb agrees with undergoer • Undergoer exhibits other subjecthood properties • Actor may be object. • Yes: How are actor and undergoer encoded in combination with other semantic features like adversity (Japanese) and person (Mapudungun)? • No: How is English voice translated into another language?
Argument Roles • Accompaniment • With someone • With pleasure • Material • (out) of wood • About 20 more roles • From the Lingua checklist; Comrie & Smith (1977) • Many also found in tectogrammatical representations • Around 80 locative relations • From Lingua checklist • Many temporal relations
Person Number Biological gender Animacy Distance (for deictics) Identifiability Specificity Possession Other semantic roles Accompaniment, material, location, time, etc. Type Proper, common, pronoun Cardinals Ordinals Quantifiers Given and new information Not used yet because of limited context in the elicitation tool. Noun Phrase Features
Tense Aspect Lexical, grammatical, phase Type Declarative, open-q, yes-no-q Function Main, argument, adjunct, relative Source Hearsay, first-hand, sensory, assumed Assertedness Asserted, presupposed, wanted Modality Permission, obligation Internal, external Clause level features
Other clause types(Constructions) • Causative • Make/let/have someone do something • Predication • May be expressed with or without an overt copula. • Existential • There is a problem. • Impersonal • One doesn’t smoke in restaurants in the US. • Lament • If only I had read the paper. • Conditional • Comparative • Etc.
Outline • Background • The AVENUE Machine Translation System • Contents of the RTB • An inventory of grammatical meanings • Languages that have been elicited • Tools for RTB creation • Future work • Evaluation • Navigation
Tools for RTB Creation • Change the inventory of grammatical meanings • Make new RTBs for other purposes
The Process Tense & Aspect Feature Specification Clause-Level Noun-Phrase Modality … List of semantic features and values Feature Maps: which combinations of features and values are of interest Feature Structure Sets Reverse Annotated Feature Structure Sets: add English sentences The Corpus Sampling SmallerCorpus
Feature Specification • XML Schema • XSLT Script • Human readable form • Feature: Causer intentionality • Values: intentional, unintentional • Feature: Causee control • Values: in control, not in control • Feature: Causee volitionality • Values: willing, unwilling • Feature: Causation type • Values: direct, indirect
Feature Combination • Person and number interact with tense in many fusional languages. • In English, tense interacts with questions: • Will you go?
((predicatee ((np-general-type pronoun-type common-noun-type) (np-person person-first person-second person-third) (np-number num-sg num-pl) (np-biological-gender bio-gender-male bio-gender-female))) {[(predicate ((np-general-type common-noun-type) (np-person person-third))) (c-copula-type role)] [(predicate ((adj-general-type quality-type) (c-copula-type attributive)))] [(predicate ((np-general-type common-noun-type) (np-person person-third) (c-copula-type identity)))]} (c-secondary-type secondary-copula) (c-polarity #all) (c-general-type declarative) (c-speech-act sp-act-state) (c-v-grammatical-aspect gram-aspect-neutral) (c-v-lexical-aspect state) (c-v-absolute-tense past present future) (c-v-phase-aspect durative)) Feature Combination Template Summarizes 288 feature structures, which are automatically generated.
Annotation Tool • Feature structure viewer • Various views of the feature structure • Omit features whose value is not-applicable • Group related features together • Aspect • causation
Outline • Background • The AVENUE Machine Translation System • Contents of the RTB • An inventory of grammatical meanings • Languages that have been elicited • Tools for RTB creation • Future work • Evaluation • Navigation
Evaluation • Current funding has not covered evaluation of the RTB. • Except for informal observations as it was translated into several languages. • Does it elicit the meanings it was intended to elicit? • Informal observation: usually • Is it useful for machine translation?
Hard Problems • Reverse annotating meanings that are not grammaticalized in English. • Evidentiality: • He stole the bread. • Context: Translate this as if you do not have first hand knowledge. In English, we might say, “They say that he stole the bread” or “I hear that he stole the bread.”
Hard Problems • Reverse annotating things that can be said in several ways in English. • Impersonals: • One doesn’t smoke here. • You don’t smoke here. • They don’t smoke here. • Credit cards aren’t accepted. • Problem in the Reflex corpus because space was limited.
Navigation • Currently, feature combinations are specified by a human. • Plan to work in active learning mode. • Build seed RTB • Translate some data • Do some learning • Identify most valuable pieces of information to get next • Generate an RTB for those pieces of information • Translate more • Learn more • Generate more, etc.