10 likes | 115 Views
Alison Alvarez nosila@cs.cmu.edu Lori Levin lsl@cs.cmu.edu Robert Frederking ref+@cs.cmu.edu Erik Peterson eepeter@cs.cmu.edu Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15217 . Jeff Good (MPI Leipzig) good@eva.mpg.de
E N D
Alison Alvarez nosila@cs.cmu.edu Lori Levin lsl@cs.cmu.edu Robert Frederking ref+@cs.cmu.edu Erik Peterson eepeter@cs.cmu.edu Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15217 Jeff Good (MPI Leipzig) good@eva.mpg.de Max Planck Institute for Evolutionary AnthropologyDeutscher Platz 604103 Leipzig • ((subj ((np-my-general-type pronoun-type common-noun-type) • (np-my-person person-first person-second person-third) • (np-my-number num-sg num-pl) • (np-my-biological-gender bio-gender-male bio-gender-female) • (np-my-function fn-predicatee))) • {[(predicate ((np-my-general-type common-noun-type) • (np-my-definiteness definiteness-minus) • (np-my-person person-third) • (np-my-function predicate))) (c-my-copula-type role)] • [(predicate ((adj-my-general-type quality-type))) • (c-my-copula-type attributive)] • [(predicate ((np-my-general-type common-noun-type) • (np-my-person person-third) • (np-my-definiteness definiteness-plus) • (np-my-function predicate))) (c-my-copula-type identity)]} • (c-my-secondary-type secondary-copula) (c-my-polarity #all) • (c-my-function fn-main-clause)(c-my-general-type declarative) • (c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) • (c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future) • (c-v-my-phase-aspect durative)) “Multiply out by these lists of values” Disjoint set of copula types and their predicates “Use all values of polarity” Eliciting Features from Minor Languages Overview Feature Structure Design Feature Specification This research is part of the AVENUE Machine Translation Project. AVENUE is supported by the US National Science Foundation, NSF grant number IIS-0121-631 In the field of Machine Translation fully aligned and tagged translation corpora are considered to be one of the most valuable resources for automatically training translation systems. However, among minority languages such resources are hard to find. It is possible to overcome this obstacle by using techniques inspired by field linguistics. That is, by drawing on bilingual informants to translate and align given sentences. Field linguists have relied on questionnaires that have remained relatively static over a number of years. We want the flexibility to change the questionnaire to reflect different semantic domains, different goals for machine translation systems, different levels of detail, etc. We also want the questionnaire to be available in multiple languages. For example, we would want a version of the questionnaire in Spanish for use by Latin American minority language speakers. We also want flexibility in lexical selection in order to avoid cultural bias and to choose appropriate lexical items for the major language. This paper will look at methods for specifying the scope and depth of an elicitation corpus as well as methods for quick design and implementation of elicitation corpora. The resulting can also be used as a test suite to explore existing machine translation systems or design far-reaching corpora for studying low resource languages. • <feature> • <feature-name>np-my-number • </feature-name> • <value> • <value-name>num-sg • </value-name> • </value> • <value> • <value-name>num-pl • </value-name> • </value> • <value> • <value-name>num-dual • </value-name> • </value> • <note> • Notes for analysis of data: CS, 2.1.2.4.1 page 38, seem to imply that some combinations of numbers are more expected than others • </note> • </feature> Our Goals • Tools for semi-automated corpus design: • Test suite for MT • Structured corpus for input to machine learning • A user interface for producing high quality, word-aligned parallel corpora (Elicitation Tool) • Automated learning of morpho-syntax for low-resource languages A control language is used to define the size and scope of the set of feature structures that will be used by GenKit to generate the corpus Feature Structures ((subj ((np-my-general-type pronoun-type) (np-my-person person-third) (np-my-number num-sg) (np-my-biological-gender bio-gender-male) (np-my-function fn-predicatee)(np-my-animacy anim-human) (np-my-info-function info-neutral)(np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-semantic-class NEED_VALUES)(np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-n/a))) (predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-function predicate)(np-my-animacy anim-human) (np-my-info-function info-neutral) (np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-number num-sg)(np-my-semantic-class NEED_VALUES) (np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-! n/a))) (c-my-copula-type role) (c-my-secondary-type secondary-copula) (c-my-polarity polarity-positive) (c-my-function fn-main-clause) (c-my-general-type declarative)(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state)(c-v-my-absolute-tense past)(c-v-my-phase-aspect durative)(c-my-imperative-degree imp-degree-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-sem-role actor-sem-role-neutral)(c-my-minor-type minor-n/a)(c-my-headedness-rc rc-head-n/a)(c-my-answer-type ans-n/a)(c-my-restrictivess-rc rc-restrictive-n/a)(c-my-focus-rc focus-n/a)(c-my-actor's-status actor-neutral)(c-my-gaps-function gap-n/a)(c-my-relative-tense relative-n/a)) The Elicitation Tool The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited computer skills to translate and word-align a corpus in some source language. The output of the elicitation tool is a text file containing triplets of eliciting sentence, elicited sentence, and alignment. The elicitation tool can produce bilingual glossaries based on the aligned corpus. It also has a simple "auto-align" option to add alignments for unambiguous word pairs in the same file. They are multi-level sets of feature-value pairs that are used to reflect the grammatical structures intended for elicitation. When paired with an English grammar and lexicon the above feature structure will generate ‘He was a teacher.’ Feature Detection watashi wa sensei deshita ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) I was a teacher watashi wa sensei deshita ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) I was a teacher ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) Mapping Difference Detection (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (tense past) Sentence Selection (person first) (person third) (animacy human) (identifiability -) (person first) (person third) (animacy human) (identifiability -) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (num sg) (tense past) (num sg) (animacy human) = = =≠ (animacy human) watashi wa sensei desu (person first) (person third) (animacy human) (identifiability -) ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense present)…) ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense present)…) ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (num sg) (animacy human) (tense present) I was a teacher watashi wa sensei deshita (tense past) ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense present) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) Minimal Pair Linking (person first) (person third) (animacy human) (identifiability -) (person third) (animacy human) (identifiability -) (person first) (person third) (animacy human) (identifiability -) (person first) Translation/Alignment (num sg) (num sg) (num sg) (animacy human) (animacy human) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (animacy human) (tense past) Substitution mismatch Difference is found on ME “I was a teacher” Watashi wa sensei deshita “I am a teacher” Watashi wa sensei desu (person first) (person third) (animacy human) (identifiability -) (num sg) (animacy human)