600 likes | 692 Views
Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation. Julián Andrés VALDÉS VARGAS Jury: Michel DESVIGNES (President) Yves LAPRIE (Reviewer) Rudolph SOCK (Reviewer) Thierry LEGOU (Examiner) Pierre BADIN (Thesis Director). 1.
E N D
Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation Julián Andrés VALDÉS VARGAS Jury: Michel DESVIGNES (President) Yves LAPRIE (Reviewer) Rudolph SOCK (Reviewer) Thierry LEGOU (Examiner) Pierre BADIN (Thesis Director) 1
Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 2
Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 3
Context • Mastery of articulators for speech production • Skill maintained/improved by Perception-action loop (Matthies et al., 1996) • Feedback in speech • Auditory • proprioceptive 4
Vision of articulators • Augmented speech Visual feedback • Display of articulators • Vision of lips and face • Improves speech intelligibility (Sumby and Pollack, 1954) • Speech imitation is faster (Fowler et al., 2003) • Vision of hidden articulations • Increases intelligibility (Badin et al.,2010) 5
Visual articulatory feedback system • System of visual articulatory feedback (Ben Youssef et al., 2011) • Applications • Speech rehabilitation • Computer Aided Pronunciation Training (CAPT) Speech sound signal of a given speaker Visual articulatory feedback system Clone’s animation 6
Problem of articulatory adaptation • Animation of clone based on a single speaker • Adaptation to several speakers Animation based on entry speaker Animation based on reference speaker Mismatchbetweenclone’s animationand real speakers Speech sound speaker 1 Speech sound speaker 2 Visual articulatory feedback system Acoustic Adaptation (Atef BEN YOUSSEF) Speech sound speaker n Articulatory adaptation 7
Morphology Different vocal tracts Size, vertical / horizontal lengths ratios Shape (e.g. concave / flat palates) Articulatory control strategies Cope with morphology different articulatory strategies to achieve sounds considered equivalent for speech communication purposes Inter-speaker variability 8
Illustration of speaker differences Speaker PB Speaker AA Speaker YL /a/ /i/ /u/ 9
Objectives • Articulatory adaptation (Initial objective) • normalization: extraction of common components (patterns) to control the articulators of several speakers. • To acquire knowledge about inter-speaker variability 10
Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 11
Articulatory data • Type of data Articulatory data Building articulatory models • Inter-speaker variability: • 11 French speakers (6 males and 5 females) • Articulatory phonetic coverage: • 13 vowels • 10 consonants in 5 vocalic contexts (vowel-consonant-vowel) • 63 articulations in total 12
Recording Methods • Several recording methods considered: • X-ray (Meyer (1907) ,Mosher (1927)) • Difficult to accurately identify the contours • Electro-Magnetic Articulography (EMA) • No recording of the whole vocal tract • Magnetic Resonance Imaging (MRI) (Rokkaku et al., 1986) • Tomographic (imaging by sections) • Maintained vocal tract positions • Speakers in supine position • Gravitational effect is moderate (Engwall (2003; 2006) ) 13
Decision to use MRI • Whole vocal tract information ≠ EMA • Contours easier to identify compared to X-ray • No health hazard compared to X-ray • Recording parameters: • Midsagittal image of the vocal tract • Slice thickness: 4 mm • Spatial resolution: 1 mm / pixel • Acquisition time: 8 -16 seconds 14
MRI Recording • The speaker is asked to go through several stages • Speakers lay in supine position • Bed shifted into the MRI machine • Setting up of alignment recording properties • Maintained pronunciation of articulations for 8-16 seconds. • Speakers are asked not to move their heads 15
Processing of MRI • Rigid contours are drawn once for a given speaker • Positioning of palate using skull bones as reference • Rotation and translation • Positioning of jaw by means of rototranslations • Edition of deformable contours: Lips, tongue, velum, etc. • Palate of all articulations are aligned • Avoidance of noise introduced by head moving • Midsagittal contours manually edited /a/ /i/ /u/ 16
Contours modelled • Upper tongue: 150 (x,y) points • Lips: 100 (x,y) points • Velum: 150 (x,y) points • Static data Articulatory study/models 17
Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 18
Universal control parameters • Extraction of common set of patterns (components) • Goals: • Building individual-speaker articulatory models • Controlling all individual articulatory models from a universal set of components CP/a/ CP /i/ CP/u/ CP/a/ CP/i/ CP/u/ CP/a/ CP/i/ CP/u/ Speaker 1 Mspeaker1 Mspeaker2 Speaker 1 Universal model /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ Speaker specific weights Speaker 2 Speaker 2 Universal Set of Components Individual articulatory models Articulator contours of individual speakers Articulator contours of individual speakers Components 19
Method for individual models of speakers • Principal component analysis (PCA) • dimensionality reduction extraction of orthogonal components 20
Assessment of models • Evaluation of model for a individual speaker X • Variance explanation • Root Mean Square Error (RMSE) 21
Generalization properties of models • Performance of models to reconstruct data that was not used for training • Leave-one-out cross validation procedure (a.k.a. Jackknife) • Observation left out Reconstruction of observation left out by inverting the model Validation of generalization properties Valuable predictors retained 22
Individual tongue models • First component extracted by Linear regression • Jaw Height (predictor) • Three degrees of freedom: x,y translation and rotation (Edwards & Harris, 1990) • Normalized value of the y-coordinate of the lower incisor (Badin & Serrurier (2006)) • Guided PCA model (Badin & Serrurier (2006)) • 4 components extracted Corr(Y,θ) ≈ 0.92 (X,Y) 23
Individual tongue models • Other 3 components extracted by PCA from the residue: • Tongue Body (TB) • Tongue Dorsum (TD) • Tongue Tip (TT) 24
Individual tongue models • Other 3 components extracted by PCA from the residue: • Tongue Body (TB) • Tongue Dorsum (TD) • Tongue Tip (TT) 25
Individual tongue models • Other 3 components extracted by PCA from the residue: • Tongue Body (TB) • Tongue Dorsum (TD) • Tongue Tip (TT) 26
Comparison between components Speaker LD Speaker RL Speaker AK Y-Tongue = Coefficients_LR * JH • JH component: • Max. variance: LD • Min. variance: RL, MG, AK • Compensation strategy of MG • TB component: • Represents more variance than other components • Horizontal/diagonal back-front movement • TD component: • vertical/diagonal arching movement • TT component: • Used in different proportion according to the speaker • Nomograms: graphical representation of components • Variation between -3 to 3 27
Individual lips models Speaker LD Speaker RL • 3 components extracted by Guided PCA model (Badin et al., 2012) • Jaw Height • More influence on LL than UL • Little influence on UL for RL • Protrusion • ULP > LLP for speaker LD • LLP > ULP for speaker RL • Lip height • ULH > LLH for all speakers • Except for speaker LD 25.2% 52.7% 12.7% 28.6% 15.4% 44.6% 1.7% 21.9% 55% 20.5% 31% 34.8% 28
Individual velum models • 2 components extracted by PCA (Serrurier & Badin, 2008): • Velum levator (Oblique movement) - VL • Superior pharyngeal constrictor (horizontal movement) - VS VS VL 29
Individual velum models: consonant /ʁ/ VS VS VL VL /ʁa/ 30 Speaker AA Speaker HL
Conclusions: individual models • Tongue PCA models: 4 components (JH,TB,TD,TT) • Variance Explained: 93%, RMSE: 0.13 cm • Lip models: 3 components (JH, Protrusion, Height) • Variance Explained: 94%, RMSE: 0.04 cm • Velum models: 2 components (VL, VS) • Variance Explained: 90%, RMSE: 0.08 cm 31
Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 32
Literature on multi-speaker models • PARAFAC models : 2 components extracted • Studies based on EMA (Hoole(1998), Geng(2000), Hu(2006)) • 6-7 speakers, 10-15 vowels, 3-4 sensors on the tongue, 80%-96% variance explained. • Study based on X-ray: Harshman(1977) • 5 speakers, 10 vowels, 13 points, 92.7% • Studies based on MRI (Hoole(2000), Zheng(2003), Ananth(2010)) • 3-9 speakers, 7-13 vowels, 13-150 points, 71%-87% of variance exp. 33
Multi-speaker decomposition methods • Extraction of common set of components • PARAFAC (Harshman,1970) (three-way factor analysis, diagonal speaker adaptation matrix) 34
Multi-speaker decomposition methods • TUCKER 3 • Extension of PARAFAC • Decomposition in all modes of variation 35
Multi-speaker decomposition methods • Joint PCA (two-way analysis adapted to multi-speaker models) (Ananthakrishnan et al. (2010) – KTH(Sweden)) • All speakers articulatory measurements for one phoneme considered as one set of data • forces common components 36
RMSE and Variance Explained (VarEx) multi-speaker model (red, green, black) vs. average of individual speakers’ models (blue) Comparison of performance between methods VarEx RMSE 37
Reference PCA model with 4 components Total number of components: 11 x 4 = 44 Student's t-test for RMSE at 5% signif. level Joint PCA: 14 – 21 components ( TUCKER ) PARAFAC: 21 components Multi-speaker Tongue models • Student's t-test -> determine if the RMSE of models are significantly different from each other VarEx RMSE 38
Individual models: Reference PCA model with 44 (11 x 4) components VarEx: 93.23 % RMSE: 0.13 cm Multi-speaker models: Joint PCA with 4 components VarEx: 72.16 % RMSE: 0.27 cm Interpretation of components: JH, TB, TD and TT Equivalent solution: Joint PCA, 21 components VarEx: 94.88% RMSE: 0.12 cm Lack of interpretation from the 5th component Multi-speaker Tongue models Literature No. Components: 2 VarExp: 71% - 96% Corpus: 7-15 vowels Speakers: 3-9 Present study Corpus: 63 articulations (vowels and consonants) Speakers: 11 speakers 39
Multi-speaker modelslips and velum • Lips and velum models comparable with tongue models • Lips • individual models: 33 components (3 * 11) • multi-speaker joint PCA models: equivalent with 21 components • Reduced no. of components: 3 interpretable components • (JH, protrusion, lip height) • Velum • individual models: 22 components (2 * 11) • multi-speaker joint PCA models: equivalent with 14 components • Reduced no. of components: 2 components • (Oblique, horizontal) 40
Summary • Context of visual articulatory feedback • Articulatory data • Individual models and characterisation • Multi-speaker models • Conclusions and perspectives 41
Conclusions • Data • Unique set of articulatory data for French • MRI for the whole vocal tract for 11 French speakers • Contours • Vowels and consonants • More speakers compared to the literature • Characterisation of different speakers’ strategies • Tongue • Upper and lower lip • Velum • Multi-speaker models (normalisation) of tongue, lips and velum contours • No work in the literature on lips and velum 42
Perspectives • More speakers • Relation between articulatory strategies and acoustics • Cross-speaker velum variability • Influence of the tongue movement • Nasality • new modelling solutions • Non-linear methods: • Kernel PCA • Artificial Neural Networks (ANN) • Support Vector Machines (SVM) 43
Acknowledgments • Laurent Lamalle (IRMaGe, Grenoble) • Speakers • ARTIS project (GIPSA-lab, LORIA) 43
Thank you for your attention Questions? 44
Grid system • Maeda S. (1979) Fix grid • Busset J.(2013) : Adaptive grid system • Euclidean coordinates (intersections) • Distances and extreme angles • Polar coordinates (distances and angles for each grid line) • Beautemps et al. (2001): adapted to each articulation Euclidean coordinates Distances and TngAdv + TngBot 46
Corr(Y-jaw,Angle_rotation) PB = 0.6611 YL = 0.7385 LH = 0.7174 RL = 0.3946 LD = 0.8423 BR = 0.7764 HL = 0.7913 AA = 0.4952 MG = 0.4151 AK = 0.8317 MGO = 0.9228 (X,Y) 47
Acoustic simulation • Grid system Midsagittal function • vocal tract area function (series of areas and lengths of each sagittal section) • α , β models (Beautemps et al.1995; Heinz & Stevens, 1965) A = Area of a given grid section, d = midsagittal distance α , β coefficients depending on subject and vocal tract location α , β according to speaker of reference: PB • vocal tract acoustic transfer function (Fant, 1960; Badin & Fant, 1984) • Formants 48
“Essentially, all models are wrong, but some are useful“ George Edward Pelham Box 50