400 likes | 494 Views
Un androïde doué de parole A speech-gifted android. Institut de la Communication Parlée. Laplace. Robotic tools (theories, algorithms, paradigms) applied to a human cognitive system (speech) instead of a human “artefact” (a “robot”). The goal of the project.
E N D
Un androïde doué de paroleA speech-gifted android Institut de la Communication Parlée Laplace
Robotic tools (theories, algorithms, paradigms) applied to a human cognitive system (speech) instead of a human “artefact” (a “robot”) The goal of the project Or: study speech as a robotic system (a speaking android)
Speech: not an information processing system, but a sensori-motor system plugged on language This system deals with control, learning, inversion, adaptation, multisensoriality, communication …hence robotics!
" In studying human intelligence, three common conceptual errors often occur:lreliance on monolithic internal models, lon monolithic control,land on general purpose processing. A modern understanding of cognitive science and neuroscience refutes these assumptions. « Cog » at MIT (R. Brooks) http://www.ai.mit.edu/projects/cog/methodology.html
Our alternative methodology is based on evidence from cognitive science and neuroscience which focus on four alternative attributes which we believe are critical attributes of human intelligence: embodiment and physical coupling, multimodal integration, developmental organization, and social interaction.
Talking Cog, a speaking android ICP: Speech modelling, speech robotics Laplace: Bayesian Robotics Austin: Speech ontogenesis
« Talking Cog » articulatory model Tongue tip Tongue dorsum Lip protrusion Lip separation Tongue body Jaw height Larynx height
[ i ] [ u ] [ a ]
50 0 -50 5000 0 1000 2000 3000 4000 « Talking Cog » sensors Formants Audition F2 F1 F3 F4 Vision Touch
Learning: Bayesien inference A sensori-motor agent (M, P) learning sensori-motor relationships through active exploration p (M, P) (M) motor (P) perceptual
Acquire controls from percepts : p (M / P) ? Perceptual input (target ?) (M) motor (P) perceptual
Regularise percepts from actions : p (P / M) ? Incomplete perceptual input (M) motor (P) perceptual
Predict one modality from another one : p (P2 / P1) P1 : orosensorial P2 : audio
Coherently fuse two or more modalities : p (M / P1, P2) s1 (P1) (M) s2 (P2)
The route towards adult speech: learning control 0 mth : imitation of the three major speech gestures 4 mth : vocalisation, imitation 7 mth : jaw cycles (babbling) Later: control of carried articulators (lips, tongue) for vowels and consonants Exploration & imitation
First experiment: simulating exploration from 4 to 7 months Phonetic data (sounds and formants) on 4- and 7-months babies ’ vocalisations
True data Max. acoustical space F1 F2 F1 Acoustical framing F2 F1 F2 Acoustical framing
Pre-babbling (4 months) Babbling Onset (7 months) F1 High FrontBack Low High FrontBack Low F2 F2 Central Mid-high Central High-Low Results Black: android capacities Color: infant productions
Various sub-models: Which one is the best? Articulatory framing
Real Distribution Theoretical Distribution Comparison F1 F1 F2 F2 P(M/f1f2) Selection of the BEST M Method
Too restricted Too wide The best !
Pre-babbling (4 months) Babbling Onset (7 months) Lips and tongue Lips and tongue + Jaw (J) F1 + J F2 F2 Results
Conclusion I • 1.Acoustical framing: • cross-validation of the data and model 2. Articulatory framing: articulatory abilities / exploration 4 months: Tongue dorsum / body+ Lips 7 months: idem + Jaw 3. More on early sensori-motor maps
From visuo-motor imitation at 0 months to audiovisuo-motor imitation at 4 months Second experiment: simulating imitation at 4 months
Hearing/seing Adult speech [a] 3 - 5 months babies [a i u] Early vocal imitation [Kuhl & Meltzoff, 1996] About 60% « good responses »
Questions 1. Is the imitation process visual, auditory, audio-visual? 2. How much exploration is necessary for imitation? 3. Is it possible to reproduce the experimental pattern of performances ?
INVERSION Lip area Al Al_i Al_a Al_u 4 mths-model Lips - Tongue Categorisation f1 f2 i a u Testing visual imitation
Experimental data Simulation data Total Total Productions Total Total u i a Al Experimental data do not concord with visual imitation response profiles Visual imitation: simulation results
Articulatory inputs Lh TbTd Xh, Yh Intermediary control variables Xh Yh Al Al Vocal tract F1 F2 Auditory outputs Testing audio imitation The three intermediary control variables correspond to crucial parameters for control, connected to orosensorial channels, and able to simplify the control for the 7-parameters articulatory model
Joint probability : P ( Lh Tb Td Xh Yh Al F1 F2 ) Parametrisation and decomposition Articulatory variables : Lh, Tb & Td -> Gaussian Control variables : Xh, Yh & Al -> Laplace Auditory variables : F1 & F2 -> Gaussian P (Xh Yh Al ) = P (Xh) * P (Yh) * P(Al) P (Lh Tb Td F1 F2 / Xh Yh Al) = P (Lh Tb Td / Xh Yh Al)* P (F1 F2/ Xh Yh Al) P (Lh / Xh Yh Al) = P (Lh / Al) P (Tb Td / Xh Yh Al) = P (Tb Td / Xh Yh)
Dependance Structure P ( Lh Tb Td Xh Yh Al F1 F2 ) = P (Xh) * P (Yh) * P(Al) * P (Lh / Al)* P(Tb / Xh Yh)*P(Td / Xh Yh Tb) * P (F1 / Xh Yh Al) * P (F2 / Xh Yh Al) Learning Description of the sensori-motor behaviour
The challenge From the exploration defined by Exp. 1, what is the amount of data (self-vocalisations) necessary for learning enough to produce 60% correct responses in Exp. 2? The idea If your amount of learning data is small, the discretisation of your control space should be rough
Inversion results RMS Audio error (F1, F2) of the inversion process (Bark) Size of the control space 4 4 32 32 256 256 2048 2048 Size of the learning space
Optimal learning space size vs. control space size random 4 32 256 2048 Size of the learning space Size of the control space
INVERSION F12_i F12_a F12_u F1 4 mths-model Lips - Tongue F2 Categorisation f1 f2 i a u Simulating audio-motor imitation Audio targets [i a u] i u a
Simulation results 2048 256 32 4 infants
Results Réalité Cibles Audio-Visuels Simulations Cibles Auditives connues Productions
Conclusion II 1. 10 to 30 vocalisations are enough for an infant to learn to produce 60% good vocalisations in the audio-imitation paradigm! 2.Three major factors intervene in the baby android performances : learning size, control size, and variance distribution in the learning set (not shown here)
Final conclusions and perspectives 1. Some of the exploration and imitation of human babies reproduced by their android cousins (Feasibility / Understanding) 2.The developmental path must be further explored, and the baby android must be questioned about what it really learned, and what it can do at the output of the learning process