1 / 22

Problems and Prospects in Collecting Spoken Language Data

Problems and Prospects in Collecting Spoken Language Data. Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India Carnegie Mellon University, USA. Outline. Need for digital library of audio and video data Characteristics of spoken language data

skah
Download Presentation

Problems and Prospects in Collecting Spoken Language Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India Carnegie Mellon University, USA.

  2. Outline • Need for digital library of audio and video data • Characteristics of spoken language data • Prototype data collection • IIIT Hyderabad • IIT Madras • Lessons Learnt • Proposal to collect IL data • as a part of Jimbaker’s global project.

  3. Need for Digital Library of Audio & Video Data • Current and future data will be in audio and video formats • Current technology makes it possible to digitize and store such large amounts of data • Collection, storage and indexing of such data makes it possible to provide information to current and future generation • Acts as test bed for several research challenges exists in organizing, indexing and retrieving such large data collections • Algorithms for quick and easier access to the information present in AV format by providing a query using text / audio / video modes • Algorithms using multi-modal data for bio-metric authentication • Development of multi-lingual speech synthesis and speech recognition systems

  4. Characteristics of Spoken Language Data • Message - Information to be conveyed • Speaker – Who is the speaker? • His/her background – Age, gender, literacy levels, knowledge levels, mannerisms etc. • Emotions – Anger, sad, happy etc. • Idiolect – An individual distinctive style of speaking • Medium of transmission – Microphone, telephone, satellite etc. • Environment - party-environment, airport/station, • Language • Dialect – grammar and the vocabulary associated with a regional or social use of a language. • Culture and civilization – The richness of usage of vocabulary, grammar etc, indicates the times of the language and the society.

  5. Characteristics of Spoken Language Data • How a language was spoken 25 years ago, 50 years ago, 100 years ago and beyond? • How a famous poem was recited or sung by the author? • How a particular language was spoken in different geographical locations of a state/country? • How a particular language/dialect has evolved over a period of time? • What were the rare languages/dialects (which were no more in existence)?. How they were spoken?

  6. Phase 0: Prototype data collection at IIIT Hyd • High quality studio recordings • 2 hrs of single speaker recordings for speech synthesis • Telugu, Hindi, Tamil and Indian-English • Developed text to speech systems in these 4 languages • Telephone and Cell-phone corpus • 150 hrs (540 speakers) • Telugu, Tamil and Marathi • Developed speech recognition systems in these 3 languages

  7. Phase 0: Prototype data collection at IIT Madras • 15 hours (72 speakers) • TV news in Tamil, Telugu and Hindi Languages • Text to speech systems (TTS) • Language Identification • Duration modeling for TTS systems

  8. Tools Aiding for Acquisition/Correction of Speech Data • Transcription correction tool (TCT) • Spoken errors at phone, syllable, word level • Background noise, abrupt begin or end, low SNR • TCT corrects the above errors in three levels • Audio & Video Transcription Tool • Used to annotate movie databases • Correction of Segment labels • Emulabel

  9. Lessons Learnt • Speech correction needs 3-6 times more than collection • Better to collect more data than correcting • Needs a unified framework • Standardize, processes, procedure and tools • Need larger collection of spoken and text corpora • For building practical speech systems in Indian languages

  10. Proposal for collection of larger Spoken Language Data for IL • Focus of information present in speech mode • Collect spoken language data from all Indian languages and also from neighboring countries • Collect about 200,000 (.2 M) hours of speech • As a part of JimBaker’s global project of collecting 1 Million hours of speech

  11. New in our approach • Collection of large speech data upto 200,000 (0.2 M) hours • All Indian languages and dialects • 23 official Indian languages • Approx. 10,000 hours per language • All types: Traditional, Read, spoken, conversational, dialog, movies, broadcast etc. • All modes: microphone, clean, telephone, cellphone, satellite etc • Standard procedure for organizing, annotating and indexing • More focus on larger collection (and elimination than of correction) • Make available this data for general public use

  12. Key Make-A-Difference Capability • Availability of information (Stories, lectures, poems, books, articles) in spoken language • For illiterate • Vision Impaired • Collection and Storage of spoken language data of popular as well as rare languages & dialects • Promotes research and development in • Speech Technology • Speech-to-speech translation in Indian languages • Phonetic engine (Language Independent) • Speech synthesis (Text-to-speech for Indian languages) • Speaker recognition (Text independent and dependent) • Language Identification • Speech enhancement • Speech signal processing • Biometrics: • Multimodal: Audio-Video modes • Information Access, Storage and Retrieval • Audio-video data (indexing) • Data Mining (searching) • Speech Coding (Ultra-low bit coding)

  13. Implementation Plan • Phase 1: (3.5 months) • 10 languages • 33,300 hours • Phase 2: (8 months) • 10 (of phase 1) languages • 66,000 hours • Phase 3: (10 months) • 13 - remaining languages • 80,000 hours

  14. Mid-Term and Final Terms • Mid-Term • Phase 1, collection of 33,300 hours of speech • Collection, Storage and Indexing of speech data for public information access • Visible research output using the speech data • Demonstrations of speech technology products • Speech recognition in 10 languages • Final Term • Phase 1 + Phase 2

  15. Q & A

  16. Misc….

  17. Impact of Audio Digital Library • Availability of information in spoken language form for illiterate and others • Promotes research in speech technology for Indian languages • Enable to develop speech technology products useful for common man • Examples: • Speech-speech translation systems • For information exchange • Screen readers, • For illiterate and physically challenged • Naturally speaking dialog systems • For information access over voice mode

  18. Phase 1: Time Estimate • Phase 1: • 10 official Indian languages • Parallel collection of data • ~ 3000 hours per language • 5,000 - 10,000 speakers • > 10 min of speech each per speaker • Total: 33,300 hours • Time Estimates: (~ 3.5 months all 10 languages) • 10 persons-team per language • Each person works • 8 hours a day • 30 mins of speech recording per hour • 1-3 speakers per hour • 240 mins of speech per day • 1-24 speakers per day, • 240 speakers per day • 20,000 speakers per language in 84 working days

  19. Phase 1: Cost Estimate • Man power cost: Rs 140 Lakhs • Equipment cost: Rs 55 Lakhs • Communication cost: Rs 40 Lakhs • Contingency (10%): Rs 25 Lakhs Total Cost: Rs 2.6 Crores (~ $ 565,000)

  20. Man-Power Cost • Data collection Team: Rs 86 lakhs • 10 (for data collection) x Rs 10 K PM • 10 (for data correction) x Rs 10 K PM • 1 data manager (Rs 15 K PM) • 4 months cost: 8, 60, 000 per language • 5 engineers: Rs 4 Lakhs • B.Tech Level (Rs 20,000 PM) • Gifts per speaker: Rs 50 Lakhs • Rs 25 per speaker

  21. Machines Cost • Machines: • 30 servers: Rs 30 Lakhs • 3 servers per languages • Each server has 4 ports for data collection • 30 CTI cards: Rs 20 Lakhs • Storage: 20 TB: Rs 5 Lakhs • Two copies of 20 TB

  22. Communications Cost • Telephonic charges: Rs 20 Lakhs • Rs 1 per min (local telephonic charges) • Transportation: Rs 20 Lakhs

More Related