120 likes | 241 Views
Meeting the scientific responsibilities of documentation efforts Lessons from NSF-DEL projects. Jonathan D Amith, Gettysburg College / LSA meetings 7 January 2011. Responsibilities of Endangered Language Research.
E N D
Meeting the scientific responsibilities of documentation efforts Lessons from NSF-DEL projects Jonathan D Amith, Gettysburg College / LSA meetings 7 January 2011
Responsibilities of Endangered Language Research • Humanistic: Involve native speakers and communities in creating an ethnographically rich corpus of threatened genres of discourse and endangered domains of cultural knowledge of interest to speakers, their descendants, and non-linguist scholars. • Substantive: Produce dictionaries, recorded and transcribed corpora, and grammars of endangered languages • Scientific: Develop substantive archival resources in such a way as to facilitate future theoretical research and analysis of endangered or extinct languages.
Community support often depends on developing and donating materials of interest. Ethnobiological research facilitates this by providing a foundation forlocal field guides and exhibits, all with native language recordings and transcriptions. Ethnobiological language documentation also provides important material on endangered domains of cultural knowledge and a disappearing lexicon to describe it. And it addresses questions about 1. conceptualization and classification (cognitive anthropology) 2. function and use (economic botany)
kwi1yo’1yo4 ti4na4ma4 (planta) / i3ta2bi1ka1 (flor) Usos La raíz del ti4na4ma4 se utiliza para creer un tipo de shampú contra la caspa. Primero se machaca la raíz y luego se hierve. Al hervirse sale una espuma que se usa como shampú para quitar la caspa. Aunque el bejuco ya seco se puede usar para leña, es raro que se usa para este fin. La planta es bastante rara y difícil de cortar por la manera en que se enreda en las copas de los árboles.
Substantive materials to facilitate linguistic analysis Dictionaries: • Thorough: all lemmas in corpus, multiple senses, collocations and phrases • Diagnostic: limitations on productivity (e.g., absence of potential stem for Mixtec verbs; inability to accept nonreferential object markers for ditransitive Nahuatl verbs) • Language specific structure: entry architecture reflects particularities of target language Corpus: • Varied and extensive: 100 hours of endangered genres of discourse and threatened domains of cultural knowledge in texts with different speaker ages and sex • Morphologically transparent: morphology should either be (1) overtly represented (e.g., in interlinear parse) or easily to determine (e.g., utilizing a deep orthographic representation); lemmas should be easily discoverable (either through parsing or orthographic conventions) • Semantically clear: beyond simple glosses, text should include either free translation or facilitated lookup of headwords and appropriate senses and contextual use Grammar: • Corpus based: Tagged (e.g. PoS) corpus as primary source for grammar and lexicon. • Elicitation: To complement corpus, elicitation should also be considered primary material and so archived. • Testable: Particularly in regard to morphology: If electronic and executable the grammar can be tested against a corpus (ability to parse) and evaluated in its predictive powers (ability to generate)
Language specific strategies in documentation efforts:Oapan Nahuatl and Yoloxóchitl Mixtec • Yoloxóchitl Mixtec • Isolating morphology • Surface forms vary little from underlying representations (some enclitic induced harmonizaton, palatalization, labialization) • Phrasal collocations prevalent • Opaque and little studied derivational morphology, e.g., tonal variation (tio1ko4‘ant’; ña1ña4tio4ko4 ‘ant eater’; i3ta2tio14ko3; ‘ant flower’; ta1xi3 ‘to be fired’, ta3xi3 ‘to fire Documentation strategy • Deep orthography • Lexicon is finite number of headwords • Dictionary subentries non-transparent meaning of collocations and phrases • Cross-referencing of headwords is marked by XML tags in subentry phrases (e.g., under i3ni2‘heart’ <vmix>ka3ka3</vmix> <mix>i3ni2</mix> [lit., to.wander heart'] to doubt Oapan Nahuatl • Complex morphology • Surface forms are often very distinct from underlying representations {no+ikxi+pah+pa:ka+s+keh} > noxí:pa:seh • Highly productive noun incorporation • Rich derivational morphology • Documentation strategy • Shallow orthography • Necessity of limiting headwords • Dictionary subentries non-transparent meaning of affixed stems (directionals, reduplications, nonreferential objects) • Extensive cross-referencing of roots • Reliance on transducer/parser for processing corpus
The Oapan Nahuatl transducer:Looking up and looking downthrough a morphologically complex language The Nahuatl transducer (built by Mike Maxwell) can look up words (parsing) or produce surface forms (generating) Reading implications: Facilitates automatic dictionary look-up Archiving implications: Facilitates parsed and glossed texts accompanied by dictionary entries of all lemmas present Research implications: Facilitates lexical, morphological, syntactic, semantic analysis. Facilitates lexicon enrichment, checking accuracy of a grammar, and proofing corpus texts. Learning function: Enhances possibility for teaching and revitalization. Cross-training/extensible function: Facilitates language processing across variants.
Transducer analyzes a written word into is morpheme constituents and provides for automatic lookup. Texts can be archived with parsed and glossed words and with full texts of all relevant dictionary entries
Text Morphological Annotation: Advantages Words that do not parse can be tagged as 1. Misspelled (leading to correction of the transcription)2. Morphologically unanalyzable (leading to correction of the morphological grammar or word classification in the lexicon)3)Absent from the lexicon (leading to enrichment of the dictionary with new lemmas)
Yoloxóchitl MixtecConditions for use of a deep orthography and regular expression searches for concordancing • Complicated tone patterns (approximately 30 sequences on a bimoraic word). But no tone sandhi or floating tones. • Tonal elision occurs only on final tones: ku3xi2 ← ku3xi3 + 2(written ku3xi(3)=2). • Simple changes to stems (palatalization, labialization, vowel harmony) are all induced by encliticization • Example: Palatalization: Before an enclitic with initial (/a/ or /o/): • Cii=[ao] → Cyaa/Cyoo; • Ci’i=[ao] → Cya’a/Cyo’o; • CVCi=[ao] → CVCya/CVCyo Yoloxóchitl Mixtec research generously supported by Endangered Languague Documentation Programme (SOAS) and NSF
Yoloxóchitl MixtecRegular expression searches and links to time-coded transcription \b[Ii]3in\(?3\)?(=\w+?)?(=\w+?)? [\w\(\)-=']+? [\w\(\)-=']+ target wordoptional clitics (2)following two words [letters, numbers, parentheses, -, =, or ‘]
Time-coded transcription accessed directly from the regex search enabling utilization of example sentences (sound and transcription) from natural discourse