240 likes | 487 Views
Spoken Language Corpora for the Official African Languages of South Africa. Jens Allwood Göteborg University, Department of Linguistics Leif Grönqvist Växjö University, School of Mathematics and Systems Engineering Göteborg University, Department of Linguistics. Background.
E N D
Spoken Language Corpora for the Official African Languages of South Africa Jens Allwood Göteborg University, Department of Linguistics Leif Grönqvist Växjö University, School of Mathematics and Systems Engineering Göteborg University, Department of Linguistics Allwood & Grönqvist
Background • Corpus work in Gothenburg • A project cooperation with UNISA (University of South Africa) in Pretoria • Financed by SIDA and NRF • Travel money for Göteborg • Some more money for Pretoria covering practical corpus work Allwood & Grönqvist
Why? • Creating support for survival of endangered languages • Linguistic corpora are very important resources for a language • Spoken language corpora • Unexplored • speech recognition/synthesis • language learning • standardization Allwood & Grönqvist
African Languages: Ncedile, Mmemesi Linguistics (UNISA): Rusandré Hendrikse, Mvuyesi Linguistics (Göteborg): Jens, Leif Who? Allwood & Grönqvist
Dialogue among African languages is essential: African languages must use the instrument of translation to advance communication among all people, including the disabled. THE ASMARA DECLARATION – 2000 (UNESCO) • All African children have the inalienable right to attend school and learn in their mother tongues. All effort should be made to develop African languages at all levels of education. Allwood & Grönqvist
Promoting research on African languages is vital for their development, while the advancement of African research and documentation will be best served by the use of African languages. THE ASMARA DECLARATION – 2000 (UNESCO), cont’d • The effective and rapid development of science and technology in Africa depends on the use of African languages and modern technology must be used for the development of African languages. Allwood & Grönqvist
OBJECTIVES • To develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of SA • The resources will be in the form of • Archived audio-visual recordings of activity-based natural language use • Machine-readable transcriptions of recordings for corpus-driven searches • Morphologically tagged corpora for corpus-based searches • Other kinds of analysis – manual or automatic Allwood & Grönqvist
Spoken language corpora for: • Xhosa • Zulu • Ndebele • Siswati • Southern Sotho • Tswana, Tsonga, Venda • Northern Sotho • (Pedi) • Afrikaans • English Allwood & Grönqvist
PROJECT MANAGEMENT Allwood & Grönqvist
PROJECT PHASES: 2002-2004 • Ongoing Audio-video recordings of activity-based spoken language use (min. 200hrs p/l). • Transcriptions (enriched with comment lines) of recordings in machine-readable text format. • Checking and editing of transcriptions. • Manual morphological tagging of corpora. • Automated tagging of corpora. • Research outputs. Allwood & Grönqvist
The Asmara Declaration - Ncedile What’s the point of spoken language corpora? – Jens Overview of the project and it’s phases – Rusandré Workshop overview • The recording phase – Jens/Mmemesi • The transcription phase – Jens/Mvuyesi • The checking phase – Jens/Ncedile • The tagging phase – Leif/Rusandré • Research output - Jens Allwood & Grönqvist
The workshops, etc • Seminars at UNISA, Pretoria • Rhodes University, Grahamstown • University of the Transkei, Umtata • Natal University, Durban • Other places Allwood & Grönqvist
Contacts from the workshops • Durban • IsizuluProgramme, University of Durban: • NN Gumede, CT Gumede, NP Ndimande, NN Mathonsi • IsizuluProgramme University ofNatal • NS Turner, S Naidoo, CNT Ntshangase, MP Kufa, SE Ximba • Grahamstown • African Languages, Rhodes University • Bulelwa Nosilela, John Claughton, Ntosh Mazwi • ISEA, Rhodes University • Prof Laurence Wright, Ms Cossie Rasana • Vista, Port Edward: Prof BB Mkonto • SAUL, Fort Hare: Mr Zandisile Wilberforce • Dept. Sport, Arts & Culture, Grahamstown: Vaugham Japtha • Umtata • UNITRA, African Languages: RM Nakin, N Vapi Allwood & Grönqvist
@ Recorded activity ID: V010501 @ Activity type: Informal conversation @ Recorded activity title: Getting to know each other @ Recorded activity date: 20020725 @ Recorder: Britta Zawada @ Participant: A = F2 (Lunga) @ Participant: B = F1 (Bukiwe) @ Transcriber: Mvuyisi Siwisa @ Transcription date: 20020805 @ Checker: Rusandre Hendrikse @ Checking date: 20020912 The transcription header @ Anonymised: No @ Activity Medium: face-to-face @ Activity duration: 00:44:30 @ Other time coding: Each section @ Tape: V0105 @ Section: Family affairs @ Section: Crime @ Section: Unemployment @ Section: Closing @ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga Allwood & Grönqvist
Contrastive stress, pauses and lengthening $B: abanyeke bazihlalele nje:/abanyeABAZANGE bafune sikolo //uyayiqonda ke la meko yokungabikho mzali uqhubayo /uthi aba baza emva kwam bobabini ABAZANGE bafunde kuyaphi //kodwa ke //andigxeki nto kuba ke /ndibakhona ngethuba le ngxaki nobhuti ke [2 abeyinkxaso kakhulu ]2 $A: [2 ya /m: ewe ]2 hayi izinto zikuthixo azikho kuthi nam obu bushuman bam ndiseza kutshata ndiseza kutshata Allwood & Grönqvist
Overlaps § Religion $B: uyakhonza kanene $A: ndiyakhonza owu ndiyamthand{a} [4 < uthixo > ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela $B: [4 nantso ke sisi // e: e: ]4 @ < name > Allwood & Grönqvist
Comment Lines $A: kunetha imvula sinemithwalo engaka < yebhegi >< yho yho yho >nako sisa @ < loan English: bag > @ < gesture: hand wipes > $B: esingazi lo mntwana ngoba kaloku siza apha asazi mntu < wakwandungwana > ukuba wayengekho ngesasitheni na asazi mntu< > @ < name: clan name > @ < comment: A drops her book > Allwood & Grönqvist
Current status • 20 hours of Xhosa recordings and transcriptions • A preliminary coding scheme for morphology • Ongoing work on recording, transcription and manual coding of morphology Allwood & Grönqvist
Things to do • Make transcription standards with examples for each of the nine languages • Hand tag some transcriptions for morphology for training of an experimental tagger • A frequency dictionary and/or a thesaurus for Xhosa Allwood & Grönqvist
Last slide • Summary • Long time plans Allwood & Grönqvist