260 likes | 437 Views
Treebanking a Blackfoot Corpus. Joel Dunham UBC. Overview. Blackfoot language Online Linguistic Database (OLD) Blackfoot OLD (BOLD) BOLD Annotation/treebanking. Blackfoot language. Algonquian (Plains): Alberta & Montana Endangered: < 5000 speakers Fieldwork: UBC, UCalgary, UMontana.
E N D
Treebanking a Blackfoot Corpus • Joel Dunham • UBC
Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking
Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana
Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative
Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • ‘Why don’t you eat with her?’
OLD • Online Linguistic Database • www.onlinelinguisticdatabase.org • Web application for documenting and analyzing languages
OLD • Open source (GPL): Python (Pylons), MySQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes
Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http://blaold.webfactional.com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.
BLAOLD • Forms (morphemes & sentences): 21,788 (2011-07-25) • morphemes: 5,094 • sentences: 3,193 • unclassified: 13,501 • (word tokens: 20,577)
BLAOLD • Sources: • textual: 16,209 forms • field work: 5,569 forms (and growing...)
BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E.g., Creation Story: • http://blaold.webfactional.com/creationstory
BLAOLD • ... Collection (text) created by referencing Forms entered into the BLAOLD.
BLAOLD • Files: • Associate Forms, Collections & Files • 2,159 files (2011-07-25) • 1,744 audio • 259 image • 148 text • 4 video
Morpheme segmentation and morpheme gloss lines. Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: “prev-asp-vta drt-num nan drt-num agra-nan adt-asp-vai-oth-num” Form with morphemic analysis Associated WAV file (tagged as an object language utterance) Associated JPG (used as a stimulus in elicitation)
BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?
Morphological Parser • ‘A morphological parser for Blackfoot’ (Dunham, 2010; WAIL) • input = transcription: • kimaaksawohpokooyimasi • output = <segmentation, morph glosses, POSes>: • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • agra-adt-oth-adt-vai-fin-thm-agrb-agrb
Morphological Parser kimaaksawohpokooyimasi FST Accuracy: ca. 70% Challenges: - variations in transcription - no hard and fast spelling rules - researchers differ in the extent to which they use the standard phonemic orthography to capture phonetic detail Phonology (from a grammar) hand-coded into FST Phonology Morphotactics (lexicon) Morphotactics & lexicon extracted programmatically from the BLAOLD POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2-why-NEG-with-eat-TA-DIR-3SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb-agrb
Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i.e., during data entry): save researcher time • create more data to improve searching
Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr].*n[ai][nr].*/
Morphological Parser /n[ai][nr].*n[ai][nr].*/ Good Bad
Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: ‘S < (NP $. (VP < NP))’ S NP VP NP DT NP VBD
Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr.s) has only 5 words
Treebank S S S VP VP NP NP DEM VBZ DEM NN CC VBZ drt-num adt-asp-fin-fin-thm drt-num nan-nin und adt-adt-asp-fin-fin-thm-agrb-oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áyi ‘He is building that house and he is still building it.’
Treebank • Worth it to treebank Blackfoot?