1 / 58

Adaptation of orofacial clones to the morphology and control strategies

Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation. Julián Andrés VALDÉS VARGAS Jury: Michel DESVIGNES (President) Yves LAPRIE (Reviewer) Rudolph SOCK (Reviewer) Thierry LEGOU (Examiner) Pierre BADIN (Thesis Director). 1.

miach
Download Presentation

Adaptation of orofacial clones to the morphology and control strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation Julián Andrés VALDÉS VARGAS Jury: Michel DESVIGNES (President) Yves LAPRIE (Reviewer) Rudolph SOCK (Reviewer) Thierry LEGOU (Examiner) Pierre BADIN (Thesis Director) 1

  2. Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 2

  3. Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 3

  4. Context • Mastery of articulators for speech production • Skill maintained/improved by Perception-action loop (Matthies et al., 1996) • Feedback in speech • Auditory • proprioceptive 4

  5. Vision of articulators • Augmented speech  Visual feedback • Display of articulators • Vision of lips and face • Improves speech intelligibility (Sumby and Pollack, 1954) • Speech imitation is faster (Fowler et al., 2003) • Vision of hidden articulations • Increases intelligibility (Badin et al.,2010) 5

  6. Visual articulatory feedback system • System of visual articulatory feedback (Ben Youssef et al., 2011) • Applications • Speech rehabilitation • Computer Aided Pronunciation Training (CAPT) Speech sound signal of a given speaker Visual articulatory feedback system Clone’s animation 6

  7. Problem of articulatory adaptation • Animation of clone based on a single speaker • Adaptation to several speakers Animation based on entry speaker Animation based on reference speaker Mismatchbetweenclone’s animationand real speakers Speech sound speaker 1 Speech sound speaker 2 Visual articulatory feedback system Acoustic Adaptation (Atef BEN YOUSSEF) Speech sound speaker n Articulatory adaptation 7

  8. Morphology Different vocal tracts Size, vertical / horizontal lengths ratios Shape (e.g. concave / flat palates) Articulatory control strategies Cope with morphology  different articulatory strategies to achieve sounds considered equivalent for speech communication purposes Inter-speaker variability 8

  9. Illustration of speaker differences Speaker PB Speaker AA Speaker YL /a/ /i/ /u/ 9

  10. Objectives • Articulatory adaptation (Initial objective) •  normalization: extraction of common components (patterns) to control the articulators of several speakers. • To acquire knowledge about inter-speaker variability 10

  11. Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 11

  12. Articulatory data • Type of data  Articulatory data  Building articulatory models • Inter-speaker variability: • 11 French speakers (6 males and 5 females) • Articulatory phonetic coverage: • 13 vowels • 10 consonants in 5 vocalic contexts (vowel-consonant-vowel) • 63 articulations in total 12

  13. Recording Methods • Several recording methods considered: • X-ray (Meyer (1907) ,Mosher (1927)) • Difficult to accurately identify the contours • Electro-Magnetic Articulography (EMA) • No recording of the whole vocal tract • Magnetic Resonance Imaging (MRI) (Rokkaku et al., 1986) • Tomographic (imaging by sections) • Maintained vocal tract positions • Speakers in supine position • Gravitational effect is moderate (Engwall (2003; 2006) ) 13

  14. Decision to use MRI • Whole vocal tract information ≠ EMA • Contours easier to identify compared to X-ray • No health hazard compared to X-ray • Recording parameters: • Midsagittal image of the vocal tract • Slice thickness: 4 mm • Spatial resolution: 1 mm / pixel • Acquisition time: 8 -16 seconds 14

  15. MRI Recording • The speaker is asked to go through several stages • Speakers lay in supine position • Bed shifted into the MRI machine • Setting up of alignment recording properties • Maintained pronunciation of articulations for 8-16 seconds. • Speakers are asked not to move their heads 15

  16. Processing of MRI • Rigid contours are drawn once for a given speaker • Positioning of palate using skull bones as reference • Rotation and translation • Positioning of jaw by means of rototranslations • Edition of deformable contours: Lips, tongue, velum, etc. • Palate of all articulations are aligned • Avoidance of noise introduced by head moving • Midsagittal contours manually edited /a/ /i/ /u/ 16

  17. Contours modelled • Upper tongue: 150 (x,y) points • Lips: 100 (x,y) points • Velum: 150 (x,y) points • Static data  Articulatory study/models 17

  18. Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 18

  19. Universal control parameters • Extraction of common set of patterns (components) • Goals: • Building individual-speaker articulatory models • Controlling all individual articulatory models from a universal set of components CP/a/ CP /i/ CP/u/ CP/a/ CP/i/ CP/u/ CP/a/ CP/i/ CP/u/ Speaker 1 Mspeaker1 Mspeaker2 Speaker 1 Universal model /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ Speaker specific weights Speaker 2 Speaker 2 Universal Set of Components Individual articulatory models Articulator contours of individual speakers Articulator contours of individual speakers Components 19

  20. Method for individual models of speakers • Principal component analysis (PCA) • dimensionality reduction  extraction of orthogonal components 20

  21. Assessment of models • Evaluation of model for a individual speaker X • Variance explanation • Root Mean Square Error (RMSE) 21

  22. Generalization properties of models • Performance of models to reconstruct data that was not used for training • Leave-one-out cross validation procedure (a.k.a. Jackknife) • Observation left out  Reconstruction of observation left out by inverting the model  Validation of generalization properties  Valuable predictors retained 22

  23. Individual tongue models • First component extracted by Linear regression • Jaw Height (predictor) •  Three degrees of freedom: x,y translation and rotation (Edwards & Harris, 1990) •  Normalized value of the y-coordinate of the lower incisor (Badin & Serrurier (2006)) • Guided PCA model (Badin & Serrurier (2006)) • 4 components extracted Corr(Y,θ) ≈ 0.92 (X,Y) 23

  24. Individual tongue models • Other 3 components extracted by PCA from the residue: • Tongue Body (TB) • Tongue Dorsum (TD) • Tongue Tip (TT) 24

  25. Individual tongue models • Other 3 components extracted by PCA from the residue: • Tongue Body (TB) • Tongue Dorsum (TD) • Tongue Tip (TT) 25

  26. Individual tongue models • Other 3 components extracted by PCA from the residue: • Tongue Body (TB) • Tongue Dorsum (TD) • Tongue Tip (TT) 26

  27. Comparison between components Speaker LD Speaker RL Speaker AK Y-Tongue = Coefficients_LR * JH • JH component: • Max. variance: LD • Min. variance: RL, MG, AK • Compensation strategy of MG • TB component: • Represents more variance than other components • Horizontal/diagonal back-front movement • TD component: • vertical/diagonal arching movement • TT component: • Used in different proportion according to the speaker • Nomograms: graphical representation of components • Variation between -3 to 3 27

  28. Individual lips models Speaker LD Speaker RL • 3 components extracted by Guided PCA model (Badin et al., 2012) • Jaw Height • More influence on LL than UL • Little influence on UL for RL • Protrusion • ULP > LLP for speaker LD • LLP > ULP for speaker RL • Lip height • ULH > LLH for all speakers • Except for speaker LD 25.2% 52.7% 12.7% 28.6% 15.4% 44.6% 1.7% 21.9% 55% 20.5% 31% 34.8% 28

  29. Individual velum models • 2 components extracted by PCA (Serrurier & Badin, 2008): • Velum levator (Oblique movement) - VL • Superior pharyngeal constrictor (horizontal movement) - VS VS VL 29

  30. Individual velum models: consonant /ʁ/ VS VS VL VL /ʁa/ 30 Speaker AA Speaker HL

  31. Conclusions: individual models • Tongue PCA models: 4 components (JH,TB,TD,TT) • Variance Explained: 93%, RMSE: 0.13 cm • Lip models: 3 components (JH, Protrusion, Height) • Variance Explained: 94%, RMSE: 0.04 cm • Velum models: 2 components (VL, VS) • Variance Explained: 90%, RMSE: 0.08 cm 31

  32. Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 32

  33. Literature on multi-speaker models • PARAFAC models : 2 components extracted • Studies based on EMA (Hoole(1998), Geng(2000), Hu(2006)) • 6-7 speakers, 10-15 vowels, 3-4 sensors on the tongue, 80%-96% variance explained. • Study based on X-ray: Harshman(1977) • 5 speakers, 10 vowels, 13 points, 92.7% • Studies based on MRI (Hoole(2000), Zheng(2003), Ananth(2010)) • 3-9 speakers, 7-13 vowels, 13-150 points, 71%-87% of variance exp. 33

  34. Multi-speaker decomposition methods • Extraction of common set of components • PARAFAC (Harshman,1970) (three-way factor analysis, diagonal speaker adaptation matrix) 34

  35. Multi-speaker decomposition methods • TUCKER 3 • Extension of PARAFAC • Decomposition in all modes of variation 35

  36. Multi-speaker decomposition methods • Joint PCA (two-way analysis adapted to multi-speaker models) (Ananthakrishnan et al. (2010) – KTH(Sweden)) • All speakers articulatory measurements for one phoneme considered as one set of data • forces common components 36

  37. RMSE and Variance Explained (VarEx) multi-speaker model (red, green, black) vs. average of individual speakers’ models (blue) Comparison of performance between methods VarEx RMSE 37

  38. Reference PCA model with 4 components Total number of components: 11 x 4 = 44 Student's t-test for RMSE at 5% signif. level Joint PCA: 14 – 21 components ( TUCKER ) PARAFAC: 21 components Multi-speaker Tongue models • Student's t-test -> determine if the RMSE of models are significantly different from each other VarEx RMSE 38

  39. Individual models: Reference PCA model with 44 (11 x 4) components VarEx: 93.23 % RMSE: 0.13 cm Multi-speaker models: Joint PCA with 4 components VarEx: 72.16 % RMSE: 0.27 cm Interpretation of components: JH, TB, TD and TT Equivalent solution: Joint PCA, 21 components VarEx: 94.88% RMSE: 0.12 cm Lack of interpretation from the 5th component Multi-speaker Tongue models Literature No. Components: 2 VarExp: 71% - 96% Corpus: 7-15 vowels Speakers: 3-9 Present study Corpus: 63 articulations (vowels and consonants) Speakers: 11 speakers 39

  40. Multi-speaker modelslips and velum • Lips and velum models comparable with tongue models • Lips •  individual models: 33 components (3 * 11) •  multi-speaker joint PCA models: equivalent with 21 components •  Reduced no. of components: 3 interpretable components • (JH, protrusion, lip height) • Velum •  individual models: 22 components (2 * 11) •  multi-speaker joint PCA models: equivalent with 14 components •  Reduced no. of components: 2 components • (Oblique, horizontal) 40

  41. Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 41

  42. Conclusions • Data • Unique set of articulatory data for French • MRI for the whole vocal tract for 11 French speakers • Contours • Vowels and consonants • More speakers compared to the literature • Characterisation of different speakers’ strategies • Tongue • Upper and lower lip • Velum • Multi-speaker models (normalisation) of tongue, lips and velum contours • No work in the literature on lips and velum 42

  43. Perspectives • More speakers • Relation between articulatory strategies and acoustics • Cross-speaker velum variability • Influence of the tongue movement • Nasality •  new modelling solutions • Non-linear methods: • Kernel PCA • Artificial Neural Networks (ANN) • Support Vector Machines (SVM) 43

  44. Acknowledgments • Laurent Lamalle (IRMaGe, Grenoble) • Speakers • ARTIS project (GIPSA-lab, LORIA) 43

  45. Thank you for your attention Questions? 44

  46. Grid system • Maeda S. (1979)  Fix grid • Busset J.(2013) : Adaptive grid system • Euclidean coordinates (intersections) • Distances and extreme angles • Polar coordinates (distances and angles for each grid line) • Beautemps et al. (2001): adapted to each articulation  Euclidean coordinates  Distances and TngAdv + TngBot 46

  47. Corr(Y-jaw,Angle_rotation) PB = 0.6611 YL = 0.7385 LH = 0.7174 RL = 0.3946 LD = 0.8423 BR = 0.7764 HL = 0.7913 AA = 0.4952 MG = 0.4151 AK = 0.8317 MGO = 0.9228 (X,Y) 47

  48. Acoustic simulation • Grid system  Midsagittal function • vocal tract area function (series of areas and lengths of each sagittal section) • α , β models (Beautemps et al.1995; Heinz & Stevens, 1965) A = Area of a given grid section, d = midsagittal distance α , β coefficients depending on subject and vocal tract location  α , β according to speaker of reference: PB • vocal tract acoustic transfer function (Fant, 1960; Badin & Fant, 1984) • Formants 48

  49. No. Coefficients by method 49

  50. “Essentially, all models are wrong, but some are useful“ George Edward Pelham Box 50

More Related