60 likes | 194 Views
Speech Technology R&D in Pakistan. Sarmad Hussain Center for Language Engineering Al-Khawarizmi Institute of Computer Sciences University of Engineering and Technology, Lahore, Pakistan www.cle.org.pk. Overview. 66+ languages in Pakistan
E N D
Speech Technology R&D in Pakistan Sarmad Hussain Center for Language Engineering Al-Khawarizmi Institute of Computer Sciences University of Engineering and Technology, Lahore, Pakistan www.cle.org.pk OCOCOSDA 2010, Kathmandu, Nepal
Overview • 66+ languages in Pakistan • Indo-Aryan, Indo-Iranian, Sino-Tibetan and Dravidian language families • ~40 have a writing system • Mostly Nastalique style of Arabic script • Others: Tibetan, Gujrati, Gurmukhi and Latin • This presentation • Text Corpora • Speech Corpora • R&D in ASR and TTS • Conclusions and Future Directions OCOCOSDA 2010, Kathmandu, Nepal
Text Corpora • EMILLE Corpus (2003) (ELRA W0037 &W0038) (http://www.emille.lancs.ac.uk/manual.pdf) • Urdu text (1,640k words) • 200k parallel to English and other South Asian languages • Transcribed spoken data (512k words) • Automatically POS tagged with morphologically rich tagset (http://www.lancs.ac.uk/fass/projects/corpus/emille/U1tagset.htm) • CRULP Corpus (unreleased; Ijaz & Hussain, 2007) • Urdu text (18,000K words) • Online news, balanced across domains • Manually POS tagged with syntactic tagset (100k words) • PAN Localization Corpus (2009; availabe at www.cle.org.pk) • Urdu translation from Penn Treebank (150k words) • Manually POS tagged with syntactic tagset • Parallel to other Asian languages, translated in the project • Sindhi language Corpus (CLT10) • 4.1 Million words from online sources • No annotation OCOCOSDA 2010, Kathmandu, Nepal
Speech Corpora • ARL Urdu Speech Database (2007) (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S03) • 200 speakers from across Pakistan • 400 utterances each • Transcribed in Urdu • CRULP HEC Corpus (2010; open content license; presented at this conference) • 80 speakers (40 male, 40 female), Lahore suburban dialect • 47 hours (31 spontaneous, 16 read) • Greedily developed sentences for phonetic coverage • Ages betwee 20-45 • Transcribed in Urdu and IPA OCOCOSDA 2010, Kathmandu, Nepal
TTS and ASR Research • TTS (2003-2006) • Supported through E-Govt. Directorate, MoIT • Indigenous system developed, based on CELP • Rule based complete NLP engine • Diphone synthesis, based on non-sense speech (5000 diphones) • Limited work on intonation • Released open source • ASR (presented at this conference) • Supported through HEC/USAID grant • Sphinx based system • High error rate as based on 40 hours of speech • ASR (at www.cle.org.pk) • Data collection and parameter tweaking in progress OCOCOSDA 2010, Kathmandu, Nepal
Conclusions and Future Directions • Need to include speech processing coursework in curricula at graduate level • More funding needed in the areas of TTS and ASR, especially to develop linguistic resources needed to develop such systems • Annotated data for unit selection based TTS • Transcribed data for ASR • Need international collaboration for improving quality • Need linguistic resources and work on other Pakistani languages, as most work so far on Urdu OCOCOSDA 2010, Kathmandu, Nepal