1 / 25

Clinical Applications of Speech Technology

Clinical Applications of Speech Technology. Phil Green Speech and Hearing Research Group Dept of Computer Science University of Sheffield pdg@dcs.shef.ac.uk. Talk Overview. SPandH - Speech and Hearing @ Sheffield The CAST group

hester
Download Presentation

Clinical Applications of Speech Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clinical Applications of Speech Technology Phil Green Speech and Hearing Research Group Dept of Computer Science University of Sheffield pdg@dcs.shef.ac.uk

  2. Talk Overview • SPandH - Speech and Hearing @ Sheffield • The CAST group • Building Automatic Speech Recognisers – conventional methodology • ASR for clients with speech disorders • Kinematic Maps • Voice-driven Environmental Control • VIVOCA • Customising Voices • Future Directions CAST December 2007

  3. Auditory Scene Analysis Glimpsing Missing Data Theory CAST Phonetics & Linguistics Hearing & Acoustics SPandH Electrical Engineering & Signal Processing Speech & Language Therapy CAST December 2007

  4. Prof Pam Enderby Institute of General Practice and Primary Care University of Sheffield Speech Therapy Prof Phil Green Prof Roger K Moore Speech and Hearing Research Group Department of Computer Science University of Sheffield Speech Technology Dr Stuart Cunningham Department of Human Communication Sciences University of Sheffield Speech Perception, Speech Technology Prof Mark Hawley School of Health and Related Research Assistive Technology Contact: pdg@dcs.shef.ac.uk

  5. Each speech unit is modeled by an HMM with a number of states. Standard technique uses generative statistical models: Conventional Automatic Speech Recogniser Construction Each state is characterised by a mixture Gaussian distribution over the components of the acoustic vector x. Parameters of the distributions estimated in training (EM – Baum-Welch) Training based on large pre-recorded speaker-independent speech corpus All this is the acoustic model. There will also be a language model. Decoding finds model & state sequence most likely to generate X . CAST December 2007

  6. Dysarthria • Loss of control of speech articulators • Stroke victims, cerebral palsy, MS.. • Effects 170 per 100,000 population • Severe cases unintelligible to strangers: • Often accompanied by physical disability channel lamp radio CAST December 2007

  7. STARDUST: ASR for Dysarthric Speakers • NHS NEAT Funding • Environmental control • Small vocabulary, isolated words • Speaker-dependent • Sparse training data • Variable training data CAST December 2007

  8. Train Recogniser New Recordings ConfusabilityAnalysis ClientPractice For Consistency STARDUST Methodology Initialrecordings CAST December 2007

  9. STARDUST training results ECS trial: halved the average time to execute a command CAST December 2007

  10. STARDUST Consistency Training CAST December 2007

  11. STARDUST Clinical Trial CAST December 2007

  12. OPTACIA: Kinematic Maps s ANN Mapping Signal Processing • Pronunciation Training Aid • EC Funding • Speech acoustics mapped to x,y position in map window in real time • Mapping by trained Neural Net • Customise for exercises and clients i sh Speech CAST December 2007

  13. Example: Vowel Map CAST December 2007

  14. SPECS: Speech-Driven Environmental Control Systems • NHS HTD Funding • Industrial exploitation • STARDUST on ‘balloon board’ CAST December 2007

  15. Dysarthric speech ASR Speech Synthesis TextGeneration Intelligible speech VIVOCA- Voice Input Voice Output Communication Aid • NHS NEAT funding • Assists communication with strangers; Client: ‘buy tea’ [unintelligible] VIVOCA: ‘A cup of tea with milk and no sugar please’ [intelligible synthesised speech] • Runs on a PDA CAST December 2007

  16. Voices for VIVOCA • It is possible to build voices from training data • A local voice is preferable • Yorkshire voices: • Ian MacMillan • Christa Ackroyd CAST December 2007

  17. Concatenative synthesis Festvox: http://festvox.org/ Speech recordings Unit segmentation Input data i a sh Unit database Text input Synthesised speech Unit selection Concatenation + smoothing … + + + … CAST December 2007

  18. Concatenative synthesis • High quality • Natural sounding • Sounds like original speaker • Need a lot of data (~600 sentences) • Can be inconsistent • Difficult to manipulate prosody CAST December 2007

  19. HMM synthesis y e s yes CAST December 2007

  20. HMM synthesis: adaptation HTS http://hts.sp.nitech.ac.jp/ Speech recordings Speech recordings Input data e Training Adaptation t Average speaker model Synthesised speech Text input 100 e Synthesis 200 t Adapted speaker model CAST December 2007

  21. HMM synthesis • Consistent • Intelligible • Easier to manipulate prosody • Needs relatively little input for adaptation data (>5 sentences) • Less natural than concatenative CAST December 2007

  22. Personalisation for individuals with progressive speech disorders • Voice banking • Before deterioration • Capturing the essence of a voice • During deterioration CAST December 2007

  23. HMM synthesis: adaptation for dysarthric speech HTS http://hts.sp.nitech.ac.jp/ Speech recordings Speech recordings Input data e Training Adaptation t Average speaker model Duration, phonation and energy information Synthesised speech Text input e Synthesis t Adapted speaker model CAST December 2007

  24. Future directions • Personal Adaptive Listeners (PALS) • ‘Home Service’ • Companions CAST December 2007

  25. The PALS Concept A PAL is a portable (PDA, wearable..) device which you own Your PAL is like your valet • It knows a lot about you.. • The way you speak, the words you like to use • Your interests, contacts, networks • You talk with it • The knowledge makes conversational dialogues viable • It does things for you • Bookings, appointments, reminders • Communication • Access to services.. • It learns to do a better job • By explicit training (this is how I refer to things, these are the names I use..) USER-AS-TEACHER • By Automatic Adaptation: acoustic models, language models, dialogue models CAST December 2007

More Related