Problems and Prospects in Collecting Spoken Language Data

Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India Carnegie Mellon University, USA.

Outline • Need for digital library of audio and video data • Characteristics of spoken language data • Prototype data collection • IIIT Hyderabad • IIT Madras • Lessons Learnt • Proposal to collect IL data • as a part of Jimbaker’s global project.

Need for Digital Library of Audio & Video Data • Current and future data will be in audio and video formats • Current technology makes it possible to digitize and store such large amounts of data • Collection, storage and indexing of such data makes it possible to provide information to current and future generation • Acts as test bed for several research challenges exists in organizing, indexing and retrieving such large data collections • Algorithms for quick and easier access to the information present in AV format by providing a query using text / audio / video modes • Algorithms using multi-modal data for bio-metric authentication • Development of multi-lingual speech synthesis and speech recognition systems

Characteristics of Spoken Language Data • Message - Information to be conveyed • Speaker – Who is the speaker? • His/her background – Age, gender, literacy levels, knowledge levels, mannerisms etc. • Emotions – Anger, sad, happy etc. • Idiolect – An individual distinctive style of speaking • Medium of transmission – Microphone, telephone, satellite etc. • Environment - party-environment, airport/station, • Language • Dialect – grammar and the vocabulary associated with a regional or social use of a language. • Culture and civilization – The richness of usage of vocabulary, grammar etc, indicates the times of the language and the society.

Characteristics of Spoken Language Data • How a language was spoken 25 years ago, 50 years ago, 100 years ago and beyond? • How a famous poem was recited or sung by the author? • How a particular language was spoken in different geographical locations of a state/country? • How a particular language/dialect has evolved over a period of time? • What were the rare languages/dialects (which were no more in existence)?. How they were spoken?

Phase 0: Prototype data collection at IIIT Hyd • High quality studio recordings • 2 hrs of single speaker recordings for speech synthesis • Telugu, Hindi, Tamil and Indian-English • Developed text to speech systems in these 4 languages • Telephone and Cell-phone corpus • 150 hrs (540 speakers) • Telugu, Tamil and Marathi • Developed speech recognition systems in these 3 languages

Phase 0: Prototype data collection at IIT Madras • 15 hours (72 speakers) • TV news in Tamil, Telugu and Hindi Languages • Text to speech systems (TTS) • Language Identification • Duration modeling for TTS systems

Tools Aiding for Acquisition/Correction of Speech Data • Transcription correction tool (TCT) • Spoken errors at phone, syllable, word level • Background noise, abrupt begin or end, low SNR • TCT corrects the above errors in three levels • Audio & Video Transcription Tool • Used to annotate movie databases • Correction of Segment labels • Emulabel

Lessons Learnt • Speech correction needs 3-6 times more than collection • Better to collect more data than correcting • Needs a unified framework • Standardize, processes, procedure and tools • Need larger collection of spoken and text corpora • For building practical speech systems in Indian languages

Proposal for collection of larger Spoken Language Data for IL • Focus of information present in speech mode • Collect spoken language data from all Indian languages and also from neighboring countries • Collect about 200,000 (.2 M) hours of speech • As a part of JimBaker’s global project of collecting 1 Million hours of speech

New in our approach • Collection of large speech data upto 200,000 (0.2 M) hours • All Indian languages and dialects • 23 official Indian languages • Approx. 10,000 hours per language • All types: Traditional, Read, spoken, conversational, dialog, movies, broadcast etc. • All modes: microphone, clean, telephone, cellphone, satellite etc • Standard procedure for organizing, annotating and indexing • More focus on larger collection (and elimination than of correction) • Make available this data for general public use

Key Make-A-Difference Capability • Availability of information (Stories, lectures, poems, books, articles) in spoken language • For illiterate • Vision Impaired • Collection and Storage of spoken language data of popular as well as rare languages & dialects • Promotes research and development in • Speech Technology • Speech-to-speech translation in Indian languages • Phonetic engine (Language Independent) • Speech synthesis (Text-to-speech for Indian languages) • Speaker recognition (Text independent and dependent) • Language Identification • Speech enhancement • Speech signal processing • Biometrics: • Multimodal: Audio-Video modes • Information Access, Storage and Retrieval • Audio-video data (indexing) • Data Mining (searching) • Speech Coding (Ultra-low bit coding)

Implementation Plan • Phase 1: (3.5 months) • 10 languages • 33,300 hours • Phase 2: (8 months) • 10 (of phase 1) languages • 66,000 hours • Phase 3: (10 months) • 13 - remaining languages • 80,000 hours

Mid-Term and Final Terms • Mid-Term • Phase 1, collection of 33,300 hours of speech • Collection, Storage and Indexing of speech data for public information access • Visible research output using the speech data • Demonstrations of speech technology products • Speech recognition in 10 languages • Final Term • Phase 1 + Phase 2

Q & A

Misc….

Impact of Audio Digital Library • Availability of information in spoken language form for illiterate and others • Promotes research in speech technology for Indian languages • Enable to develop speech technology products useful for common man • Examples: • Speech-speech translation systems • For information exchange • Screen readers, • For illiterate and physically challenged • Naturally speaking dialog systems • For information access over voice mode

Phase 1: Time Estimate • Phase 1: • 10 official Indian languages • Parallel collection of data • ~ 3000 hours per language • 5,000 - 10,000 speakers • > 10 min of speech each per speaker • Total: 33,300 hours • Time Estimates: (~ 3.5 months all 10 languages) • 10 persons-team per language • Each person works • 8 hours a day • 30 mins of speech recording per hour • 1-3 speakers per hour • 240 mins of speech per day • 1-24 speakers per day, • 240 speakers per day • 20,000 speakers per language in 84 working days

Phase 1: Cost Estimate • Man power cost: Rs 140 Lakhs • Equipment cost: Rs 55 Lakhs • Communication cost: Rs 40 Lakhs • Contingency (10%): Rs 25 Lakhs Total Cost: Rs 2.6 Crores (~ $ 565,000)

Man-Power Cost • Data collection Team: Rs 86 lakhs • 10 (for data collection) x Rs 10 K PM • 10 (for data correction) x Rs 10 K PM • 1 data manager (Rs 15 K PM) • 4 months cost: 8, 60, 000 per language • 5 engineers: Rs 4 Lakhs • B.Tech Level (Rs 20,000 PM) • Gifts per speaker: Rs 50 Lakhs • Rs 25 per speaker

Machines Cost • Machines: • 30 servers: Rs 30 Lakhs • 3 servers per languages • Each server has 4 ports for data collection • 30 CTI cards: Rs 20 Lakhs • Storage: 20 TB: Rs 5 Lakhs • Two copies of 20 TB

Communications Cost • Telephonic charges: Rs 20 Lakhs • Rs 1 per min (local telephonic charges) • Transportation: Rs 20 Lakhs

Problems and Prospects in Collecting Spoken Language Data