220 likes | 414 Views
HLT R&D in South Africa. HLT Collaboration between South Africa and the Low Countries Workshop 24 November 2008 Noordhoek, South Africa. Overview. Specific R&D challenges Areas of active research Text processing Speech processing Applications of HLT Main projects: current and recent
E N D
HLT R&D in South Africa HLT Collaboration between South Africa and the Low Countries Workshop 24 November 2008 Noordhoek, South Africa
Overview • Specific R&D challenges • Areas of active research • Text processing • Speech processing • Applications of HLT • Main projects: current and recent • Research institutions active in HLT • Main R&D sponsors
Specific R&D Challenges • Incompleteness of basic linguistic knowledge • Scarcity of resources • Linguistic data • Technology components • Uniqueness of user populations and languages
Research areas (1) • Text processing: • Computational morphological analysis, POS tagging • Spelling checkers, grammar checkers • Machine translation, machine-aided translation • Computational lexicography • Wordnets • Research focus: • Development of basic required components and tools • Data collection and corpus development • Technology transfer, cross-language learning, bootstrapping, language distances • MA for agglutinative languages
Research areas (2) • Speech processing: • ASR, TTS, spoken dialogue systems • Phonetic investigations for HLT • Speaker verification, S-LID • Speech tools (diarization, channel normalisation, speech detection) • Research focus: • Development of basic required components and tools • Data collection and corpus development • Technology transfer, cross-language learning, bootstrapping, language distances • Timing information in speech • Multi-accent and multilingual acoustic modelling • Higher order Markov models and other non-standard acoustic models
Research areas (3) • Applications of HLT • Telephone-based information systems • Computer assisted language learning • Document proofing tools • Accessibility devices • Mobile devices
Main R&D initiatives • Department of Arts and Culture (DAC) Applications that support multilingualism, especially related to government service delivery • DAC A: Spelling checkers • DAC B: Machine-aided translation • DAC C: Lwazi: Multilingual telephony-based information delivery • Department of Science and Technology (DST) Directed research in HLT aimed at addressing SA national priorities. • National HLT Network projects • International collaborative projects • Various individual research projects
Main R&D projects • Text processing: • Computational morphological analysis: Unisa • Spellcheckers: DAC A • Machine translation: EtsaTrans, DAC B • Speech: • Phonetic investigations: NHN PAST • ASR/TTS/spoken dialogue systems: • AST, Limpopo ASR • OpenPhone, Lwazi (DAC C) • Mobile E-learning for Africa (MELFA)
UNISA Computational Morphological Analysis • Development of parsing tools for Bantu languages: • computational morphological analysers • disambiguators • syntactic parsers • Development of supporting resources for development & testing, includes extensive underlying machine-readable lexicons • Status: • Initiated in 2002 (for isiZulu morphological analyser) • Various prototypes under development (isiZulu, isiXhosa, Siswati, isiNdebele, Northern Sotho and Setswana) • Extended until 2010 • Principal researchers: • Sonja Bosch (Project Leader), Laurette Pretorius • Ansu Berg, Axel Fleisch, Albert Kotze, Petro Kotze, Memezi Mfusi, Lydia Mojapelo, Rigardt Pretorius, Linda van Huyssteen, Biffy Viljoen • Sponsor: NRF
DAC A: Spelling checkers for public administration domain • Development of spelling checkers for 10 official SA languages • Specifically for use in government departments. • Spelling checkers for isiNdebele, isiXhosa, isiZulu and Siswati include morphological analysers for effective spellchecking of these agglutinative languages • Status: • Final evaluation by client in progress • Principal researchers: • MJ Puttkammer (NWU), S Pilon (NWU), DJ Prinsloo (UP), SE Bosch (Unisa) • Sponsor: Department of Arts and Culture, CText
EtsaTrans Machine Translation • Development of a functional machine translation system. • Focus domain: mainly administrative documents • Main languages: English to Afrikaans, Afrikaans to English • Other languages: English to Xhosa, English to Southern Sotho • Harvesting previously translated information to create parallel corpora • Status: • Initiated in 2003, ongoing • Prototypes in use • Principal researchers: • JA Naudé, L Jordaan • Sponsor: UFS
DAC B: Machine-aided translation tools • Development of translation tools: • An integrated translation environment (ITE) • Word translators • Machine translation systems for three language pairs • Terminology management system • Document management system • Status: • Under development (2007-2010) • All tools, data and research output to be made available publicly • Principal researchers: • HJ Groenewald, S Pilon (NWU) • DJ Prinsloo (UP) • Sponsor: DAC
NHN PAST: Phonetics for Advanced Speech Technology • Technology-orientated investigation and description of the vowel system of the Sotho languages and tone in Sotho and Nguni language • Status: • Initiated May 2008, • Due for completion June 2009 • Principal researchers: • E. Barnard (Meraka) • B. Khoali (independent consultant) • D. Wissing (NWU) • S. Zerbian (Wits) • Sponsor: National HLT Network (DST/Meraka)
African Speech Technologies (AST) • Development of a multilingual telephone-based hotel reservation system. • Developed corpora and technology components (TTS, ASR, dialogue systems) for SAE, Afrikaans, isiZulu, isiXhosa and Sesotho. • Status: • Completed 2004 • Gave rise to commercial company: Catchword • Data available for research purposes (release imminent) • Principal researchers: • J.C. Roux, E.C. Botha, J. du Preez • Various collaborators • Sponsor: • DACST (Innovation Fund)
Limpopo ASR • Development of baseline automatic speech recognition systems for the major languages of the Limpopo Province • Languages: Sepedi (Sesotho sa Leboa), Setswana, Tshivenda and Xitsonga. • Telephone speech data collection and manual annotation • Extension to text-to-speech synthesis and domain-specific prototype dialogue systems • Status: • Baseline ASR systems completed (2004-2006) • Extension ongoing • Principal researchers: • HJ Oosthuizen and MJD Manamela • Sponsor: Telkom and other industry partners
OpenPhone • Demonstrated use of telephone-based information services in providing health information in a rural setting. • Automated health information system that provides information to caregivers looking after HIV-positive children living in the vicinity of Gabarone in Botswana • Includes Setswana TTS and ASR development • Status: • Completed 2008, currently live. • http://www.meraka.org.za/hlt_projects_ophone.htm • Principal researchers: • Etienne Barnard, Marelie Davel, Madelaine Plauche • Sponsor • OSI/OSISA, DST
Lwazi • Development and piloting of a fully Open Source multilingual telephone-based information system • ASR and TTS systems in 11 official languages • ASR and TTS integrated into a telephony platform • Open Source resources and tools • Various pilots: first significant pilot with DPSA Community Development Workers • Status: • Initiated September 2006 • On track for completion September 2009 • Principal researchers: • Etienne Barnard, Marelie Davel, Gerhard van Huyssteen • Sponsor: • DAC
Mobile E-learning for Africa (MELFA) • Mobile solutions for on-site literacy training and skills development for workers in the Building and Construction Industry • Includes text-to-speech, speech-to-speech translation • Initially 30 test persons in Western Cape are involved in testing the modules for interactive M-learning. • Status: • Initiated in 2007, completing in 2009. • Principal researchers: • JC Roux (Project leader, SA), A Visagie, H Engelbrecht, A Magnusdottir, P Scholtz. • Sponsor: Danida (Danish government organisation)
Research institutions: Text 1 Size: snr researchers / post-graduate students
Main R&D sponsors • Department of Arts and Culture (DAC) Applications that support multilingualism, especially related to government service delivery • Department of Science and Technology (DST) Directed research in HLT aimed at addressing SA national priorities. • National Research Foundation (NRF) Support for individual researchers • Industry: Addressing industry-specific needs • ASR/TTS (Telkom, Intelleca, IBM, Google and others), Spelling checkers (Microsoft) • Speech processing tools (Grintek,Armscor), Speech-to-speech translation (Armscor) • International donor funding Addressing developmental needs • Open Society Initiative (OSI/OSISA), Danish Danida, • UK Dept for International Development (DfID) • Canadian International Development Research (IDRC), and others • Host institutions (Universities, CSIR, etc)