1 / 63

3D Articulatory Speech Synthesis

This project aims to develop a 3D articulatory speech synthesizer that is modular, data-driven, and flexible, utilizing cutting-edge imaging techniques and computational speed to create high-quality and visually coordinated speech in real-time. The approach involves a user-centered design, studying anatomy, and implementing a software architecture with distinct components for simulation and synthesis. By identifying stakeholders, scenarios, and user needs, the project seeks to revolutionize speech synthesis technology.

kangelo
Download Presentation

3D Articulatory Speech Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3D Articulatory Speech Synthesis Towards an Extensible Infrastructure for a Three-dimensional Face and Vocal-Tract Model for Speech Synthesis Sid Fels, ECE, Bryan Gick, Linguistics, Eric Vatikiotis-Bateson, Linguistics Florian Vogt, ECE, Ian Wilson, Linguistics, Carol Jaeger, ECE University of British Columbia Vancouver, BC, Canada V6T 1Z4

  2. Introduction and Motivation • Want speech synthesis that is: • low bandwidth, • high quality, • visually coordinated, • physically based, • real-time • Important developments • computational speed increases • new imaging techniques • better understanding of vocal tract • Articulatory speech synthesis shows promise • Edge of what is possible now…

  3. Goals for Project • Develop 3D Articulatory Speech Synthesizer that is: • Modular • Physically based • Data driven • Accessible • Coupled • Flexible • Open to community

  4. Structure of problem: • 6 major components: • Model infrastructure • Rigid body models and configuration • Soft-tissue models • reprentation, configuration and simulation • Data extraction • Aeroacoustics • Human interface

  5. Approach • Top-down, user centred approach • Define stakeholders • Define scenarios • Explore literature on usage patterns • Bottom-up • Anatomy • Modelling techniques • Data available • Middle layers • Infrastructure • Model representations

  6. Top-Down: User Centred Design • Identify stakeholders/users • Speech researchers • Linguistics, Engineering, Computer Science, etc. • Industry, Academia and Government • Medical practitioners • ENT, dentist • Computer animators • Programmers • Needs assessment

  7. Top-down: User Centred Design • Identify scenarios for use • Joe Lingus • add new model component and evaluate • Deb Dentist • Surgery prediction • Visual aid for learning to speak • A/V speech engineer • Video phone creation • Add more here…

  8. Top-Down: Simple prototypes • Paper prototypes • Computer mock-up • Modules prototyped • Interpreters • Continual refinement with input from stakeholders

  9. Top-Down: product of process • Specification of requirements • Naturally dynamic • Initial design documents • GUI development • Documentation • Engagement of community

  10. Top-Down: Simple Example • Mock user interface • Animation model

  11. Top-Down: Some needs: • Interface to add new models • Control over specific modules • Acoustic output • Simulation parameters • Configuration • Initial geometry • Data extraction methods • Minimal programming overhead to get started

  12. Bottom-Up: • Studying anatomy • Gather pre-existing data sets • Establish state-of-art of modelling techniques • Middle layers • Infrastructure definition • Model representations • Soft-tissue models • Aeroacoustic methods • Etc.

  13. Software Architecture • 5 main components to deal with: 1. simulator engine, 2. three-dimensional geometry module 3. graphical user interface (GUI) module, 4. synthesis engine and 5. numerics engine.

  14. Graphics 3D geometry simulator numerics GUI Structure of Simulator aeroacoustics Imagingdata sources

  15. Graphics 3D geometry simulator numerics GUI Structure of Simulator aeroacoustics Imagingdata sources

  16. 3D Geometry: Scene Graph • Base model notation on Scene Graph • basis of 3D animation • nodes for specifying graphical models including • shapes, cameras, lights, properties, transformation, engines, selection, view etc. • Extend and add nodes to represent relationships • muscles, constraints, nerves, dynamics • may need multiple passes per iteration

  17. 3D Geometry: Scene Graph • Example decomposition • Head • Skull • Mouth • jaw • teeth • tongue • hyoid • cheek, other soft structures • Pharynx • Nose • Larynx • Respiratory tract

  18. Using Image Data • Data from MRI, ultrasound, EMA and other imaging devices • used in real-time or offline • create geometry • provide constraints on system

  19. Graphics 3D geometry simulator numerics GUI Structure of Simulator aeroacoustics Imagingdata sources

  20. Graphics • Separate out rendering of model • Integrate with other 3D animation toolse.g. Blender, Maya, 3DMax.

  21. Graphics 3D geometry simulator numerics GUI Structure of Simulator aeroacoustics Imagingdata sources

  22. GUI • Separate out to make simulation code clean • Allow multiple access points to reduce dependencies • command line, GUI, scripted, stdin/stdout, files (save state) • Control module behaviour • Automate as much as possible • extensions get support for GUIs

  23. Graphics 3D geometry simulator numerics GUI Structure of Simulator aeroacoustics Imagingdata sources

  24. Numerics • All numerical processes separated out from simulation • allow for improvements to numerical routines • allow flexibility to switch methods • i.e. Implicit vs. Explicit Euler integration, FEM solver • Each pass through scene graph builds up state • numerics operate on state and return state back to scene graph

  25. Graphics 3D geometry numerics GUI Structure of Simulator simulator aeroacoustics Imagingdata sources

  26. Simulation • As simple as possible: • infinite loop that updates state of simulation • traverses scene graph • calculate new state • render • simulate airflow • can be thought of as an independent module as well

  27. 3D Articulator Techniques & Issues 1. Vocal Tract Model • geometric versus physical model • static versus dynamic model • parameter extraction/tuning from • real data • X-Ray, MRI, EMA, Ultrasound, Electropalatography (Stone and Lundberg, 1996) • anatomy

  28. 3D Articulatory Synthesis Model • Our direction • use physical model of vocal tract and tongue • articulator/muscle based model • match face model • use dynamic modeling of soft tissues • tongue, lips, cheeks, pharynx, etc. • include volume constraints • collision detection

  29. 3D Articulatory Vocal Tract Model • Which technique to use for soft tissues? (see Gibson and Mirtich, 1997 for review) a)Non-physical models • splines, patches • difficult to get deformations correct • may be good for representing static 3D shape of vocal tract b)Spring-mass models • use a collection of point masses connected by springs • popular in facial animations • i.e. Waters, 1987 and Lee, Terzopolous, Waters, 1998 • may be difficult to model stiff areas well • numerical instabilities • volumetric constraints difficult to model well

  30. 3D Articulatory Vocal Tract Model c)Boundary Line Integral and Boundary Element Method (James and Pai, 1999) • Use boundary integral equation formulation • of static linear elasticity • use boundary element method to solve • limited to boundary only • how to deal with heterogeneous tissue? Tongue may be difficult • Limited to linear elasticity • should be OK for small deformations d)Continuum models and Finite Element Methods • some human tissue models have been created: • Payan et al 2001, Gouret et. al., 1989, Chen and Zeltzer, 1992, Bro-Nielsen, 1997.

  31. 3D Articulator Synthesis: Engwall / Badin • One attempt by (Engwall / Badin, 1999) • geometric model of vocal tract derived from articulator parameters • array of vertices (polygonal mesh) • symmetric around midsagittal plane • tongue model is set of filtered vertices • 5 parameters • synthesis model • acoustic tube

  32. 3D Articulatory Synthesis Model Anatomical Models for simulation • FEM-Tongue models e.g Dung 2000 • Ultrasound Tongue models e.g. Stone 1998 • Anatomical Tongue model Takemoto 2001

  33. 3D Articulatory Synthesis Model 2) Synthesis Model • simulate propagation of air pressure waves through the 3D vocal tract model • simplified source model • fluid dynamic models • maybe modification of ray-tracing • 2D acoustic tube model • dynamic models and time domain source • classical electrical analog • enhanced with airflow model (Jackson, 2000) • 2.5D and 3D acoustic tube • with and without source models • FEM or BEM analysis for flow and turbulence

  34. 3D Articulatory Synthesis: Applications • Surgical prediction (Payan et al 2002) • compression • videophone, multimedia data • speech research tool • text-to-speech synthesis • lip synchronization in movies Haber, at al, 2001 • new musical instruments Vogt, etal 2001 • based on 3D models + wave propagation • Some applications may not require complete, 3D articulatory synthesis model

  35. 3D Articulatory Synthesis: Summary • Continue building framework for public domain 3D Articulatory Synthesizer • Construction of modular S/W architecture • integrating simple models of vocal tract behaviour • providing support for multiple level-of-detail modeling • 2D tube model for synthesis • Developing 3D tube models • Building anatomically based vocal tract models • development of scene graph semantics and syntax

  36. Speech Synthesis Techniques • Three main synthesis techniques: • Time domain • LPC (Markel & Gray, 1972), CELP (Schroeder & Atal), Multipulse (Atal & Remde) • CELP used in cell phones - good quality at 4.8Kbps • used in text-to-speech applications too • concatenation based systems • Frequency domain • Articulatory domain

  37. Time domain Text To Speech (TTS) • TTS in time domain • concatenate prerecorded speech segments • pitch change and transitions tricky • research on different ways to do this • Here’s a few examples: • CSTR Edinburgh: diphone synthesis/ non uniform unit selection 1, 2, 3 • CHATR ATR-ITL Kyoto / Japan, non uniform unit selection 1 • BellLabs-TTS-System,LPC diphone synthesis 1, 2 • PSOLA (Verhelst, 1990) and more

  38. Frequency domain TTS • TTS using formants: • spectral changes are slow • should interpolate well • rules for transitions can’t be simply linear • change speaker characteristics easily • not so natural though • change intonation (somewhat) • Main synthesizers: (Klatt, 1980) and (Rye & Holmes, 1982) • Examples: • Infovox, Telia Promotor / KTH Stockholm 1 • Multilingual TTS system, TI Uni Duisburg 1 • DECtalk: regular 12, affective modification (Cahn) 123

  39. Articulatory Synthesis • Parameterize human vocal tract, glottis and lungs • mechanical or electrical systems

  40. 2D Articulatory Synthesis • Articulatory model • Mermelstein, 1971 model used at Haskins • Coker, 1976 • C: Tongue body Center • H: Hyoid • J: Jaw • L: Lips • T: Tongue Tip • V: Velum

  41. 2D Articulatory Synthesis • To synthesize speech: • convert to area function • use acoustic tube model • activate with sound source • Sound/excitation source • waveform • model • Glottal and Lung model from (Flanagan, Ishizaka, and Shipley, 1975) • 2 masses and springs (oscillator)

  42. Articulatory Synthesis: Haskins 1st frame • Examples: • /da/ • about 75msec from start to end • sound • utterance interpolated frames 2nd frame

  43. Articulatory Synthesis: Haskins • Problems: • vowel sounds OK • plosives, fricatives, liquids and aspirants not OK • where to get articulator data? • Measurements • MRI • electromagnetic articulograph • ultrasound • X-ray • acoustics • Models • only use rigid 2D models

  44. Articulatory Synthesis: History • Interesting history • sometimes hot (1700s, 2000) and sometimes not (1800s, 1970-1980s) • Important now because of “Talking Head” research • McGurk effect

  45. Articulatory Synthesis: Talking Heads • Visual and auditory signals interact • visual signal can make auditory signal hard to hear • McGurk Effect Demo • Talking heads important for: • more natural interaction • dubbing new voices • compact encoding of voice and image • Can we create good talking head from acoustic signal? • Not so easy: i.e., Bregler, Slaney and Covell • articulatory synthesis provides necessary articulatory movement with audio waveform • see “Speech Recognition and Sensory Integration”, Massaro and Stork, American Scientist, Vol. 86, 1998.

  46. Articulatory Synthesis: History • Kratzenstein resonators (1770 - Imperial Academy of St. Petersburg contest)

  47. Articulatory Synthesis: AVTs • von Kempelen’s AVT (1791)

  48. Articulatory Synthesis: more T.H. • R. R. Riesz's talking mechanism, 1937

  49. Articulatory Synthesis: electronic AVTs • The Voder (Dudley, Riesz and Watson, 1939) • Example

More Related