750 likes | 880 Views
Multimodal Expressive Embodied Conversational Agents. Université Paris 8. Catherine Pelachaud. Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini Magalie Ochs, FT Christopher Peters Radek Niewiadomski. ECAs Capabilities. Anthropomorphic autonome figures
E N D
Multimodal Expressive Embodied Conversational Agents Université Paris 8 Catherine Pelachaud Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini Magalie Ochs, FT Christopher Peters Radek Niewiadomski
ECAs Capabilities • Anthropomorphic autonome figures • New form on human-machine interaction • Study of human communication, human-human interaction • ECAs ought to be endowed with dialogic and expressive capabilities • Perception: an ECA must be able to pay attention to, perceive user and the context she is placed in.
ECAs capabilities • Interaction: • speaker and addressee emits signals • speaker perceives feedback from addressee • speaker may decide to adapt to addressee’s feedback • consider social context • Generation: expressive synchronized visual and acoustic behaviors. • produce expressive behaviours • words, voice, intonation, • gaze, facial expression, gesture • body movements, body posture
Synchrony tool - BEAT • Cassell et al, Media Lab MIT • Decomposition of text into theme and rheme • Linked to WordNet • Computation of: • intonation • gaze • gesture
Virtual Training Environments MRE(J. Gratch, L. Jonhson, S. Marsella…, USC)
Interactive System • Real state agent • Gesture synchronized with speech and intonation • Small talk • Dialog partner
MAX, S. Kopp, U of Bielefeld Gesture understanding and imitation
Problem to Be Solved • Human communication is endowed with three devices to express communicative intention: • Verbs and formulas • Intonation and paralinguistic • Facial expression, gaze, gesture, body movement, posture… • Problem: For any communicative act, the Speaker has to decide: • Which nonverbal behaviors to show • How to execute them
Verbal and Nonverbal Communication • Suppose I want to advise a friend to put on her coat because it is snowing. • Which signals do I use? • Verbal signal: use of a syntactically complex sentence: Take your umbrella because it is raining • Verbal + nonverbal signals: Take your umbrella + point out to the window to show the rain by a gesture or by gaze
Multimodal Signals • The whole body communicates by using: • Verbal acts (words and sentences) • Prosody, intonation (nonverbal vocal signals) • Gesture (hand and arm movements) • Facial action (smile, frown) • Gaze (eyes and head movements) • Body orientation and posture (trunk and leg movements) • All these systems of signals have to cooperate in expressing overall meaning of communicative act.
Multimodal Signals • Accompany flow of speech • Synchronized at the verbal level • Punctuate accented phonemic segments and pauses • Substitute for word(s) • Emphasize what is being said • Regulate the exchange of speaking turn
Synchronization • There exists an isomorphism between patterns of speech, intonation and facial actions • Different levels of synchrony: • Phoneme level (blink) • Word level (eyebrow) • Phrase level (hand gesture) • Interactional synchrony: Synchrony between speaker and addressee
Taxonomy of Communicative Functions (I. Poggi) • The speaker may provide three broad types of information about: • Information about the world: deictic, iconic (adjectival),… • Information about the speaker’s mind: • belief (certainty, adjectival) • goal (performative, rheme/theme, turn-system, belief relation) • emotion • meta-cognitive • Information about speaker’s identity (sex, culture, age…)
Multimodal Signals (Isabella Poggi) • Characterization of multimodal signals by their placement with respect to linguistic utterance and significance in transmitting information. Eg: • Raised eyebrow may signal surprise, emphasis, question mark, suggestion… • Smile may express happiness, be a polite greeting, be a backchannel signal… • Need two information to characterize multimodal signals: • Their meaning • Their visual action
Expression meaning deictic: this, that, here, there adjectival: small, difficult certainty: certain, uncertain… performative: greet, request topiccomment: emphasis Beliefrelation: contrast,… turn allocation: take/give turn affective: anger, fear, happy-for, sorry-for, envy, relief, …. Expression signal Deictic: gaze direction Certainty: Certain: palm up open hand; Uncertain: raised eyebrow adjectival:small eye aperture Belief relation:Contrast: raised eyebrow Performative:Suggest: small raised eyebrow, head aside; Assert: horizontal ring Emotion: Sorry-for: head aside, inner eyebrow up; Joy: raising fist up Emphasis: raised eyebrows, head nod, beat Lexicon=(meaning, signal)
Representation Language • Affective Presentation Markup Language – APML • describes the communicative functions • works at meaning level and not the signal level <APML> <turn-allocation type="take turn"> <performative type="greet"> Good Morning, Angela. </performative> <affective type="happy"> It is so <topic-comment type="comment"> wonderful </topic-comment> to see you again. </affective> <certainty type="certain"> I was <topic-comment type="comment"> sure </topic-comment> we would do so, one day!</certainty> </turn-allocation> </APML>.
Facial Description Language • Facial expressions defined as (meaning, signal) pairs stored in library • Hierarchical set of classes: • Facial basis FB class: basic facial movement • An FB may be represented as a set of MPEG-4 compliant FAPs or recursively, as a combination of other FBs using the `+' operators • FB={fap3=v1,…,fap69=vk}; • FB'=c1*FB1+c2*FB2; • where c1 and c2 are constants and FB1 and FB2 can be: • Previous defined FBs • FB of the form: {fap3=v1,…,fap69=vk}
Facial basis class • Facial basis class • Examples of facial basis class: • Eyebrow: small_frown, left_raise, right_raise • Eyelid: upper_lid_raise • Mouth: left_corner_stretch, left_corner_raise = +
Facial Displays • Every facial display (FD) is made up of one or more FBs: • FD=FB1 + FB2 + FB3 + … + FBn; • surprise=raise_eyebrow+raise_lid+open_mouth; • worried=(surprise*0.7)+sadness;
Facial Displays • Probabilistic mapping between the tags and signals: • Es: happy_for = (smile*0.5, 0.3) + (smile*0.25) + (smile*2 + raised_eyebrow, 0.35) + (nothing, 0.1) • Definition of a function class for addressee association (meaning, signal) • Class communicative function: • Certainty • Adjectival • Performative • Affective • …
Gestural Lexicon • Certainty: • Certain: palm up open hand • Uncertain: showing empty hands while lowering forearms • Belief-relation: • List of items of same class: numbering on fingers • Temporal relation: fist with extended hand moves back and forth behind one’s shoulder • Turn-taking: • Hold the floor: raise hand, palm toward hearer • Performative: • Assert: horizontal ring • Reproach: extended index, palm to left, rotating up & down on wrist • Emphasis: beat
Gesture Specification Language • Scripting language for hand-arm gestures, based on formational parameters [Stokoe]: • Hand shape specified using HamNoSys [Prillwitz et. al.] • Arm position: concentric squares in front of agent [McNeill] • Wrist orientation: palm and finger base orientation • Gestures are defined by a sequence of timed key poses: gesture frame • Gestures are broken down temporally into distinct (optional) phases: • Gesture phase: preparation, stroke, hold, retraction • Change of formational components over time
Gesture Temporal Course stroke start – stroke end rest position preparation retraction rest position
ECA Architecture • Input to the system: APML annotated text • Output to the system: Animation files and WAV file for the audio • System: • Interprets APML tagged dialogs, i.e. all communicative functions • Looks in a library the mapping between the meaning (specified by the XML-tag) and signals • Decides which signals to convey on which modalities • Synchronizes the signals with speech at different levels (word, phoneme or utterance)
Modules • APML Parser: XML parser • TTS Festival: manages the speech synthesis and give us the list of phonemes and phonemes duration. • Expr2Signal Converter: given a communicative function and its meaning, this module returns the list of facial signals • Conflicts Resolver: resolves the conflicts that may happened when more than one facial signals should be activated on same facial parts • Face Generator: converts the facial signals into MPEG-4 FAP values • Viseme Generator: converts each phoneme, given by Festival, into a set of FAPs • MPEG4 FAP Decoder: is an MPEG-4 compliant Facial Animation Engine
TTS Festival • Drive the synchronization of facial expression • Synchronization implemented at word level • Timing of facial expression connected to the text embedded between the markers • Use of the tree structure of Festival to compute expressions duration
Expr2Signal Converter • Instantiation of APML tags: meaning of a given communicative function Converts markers into facial signals • Use of a library containing the lexicon of the type (meaning, facial expressions)
Gaze Model • Based on communicative functions’ model of Isabella Poggi • This model predicts what should be the value of gaze in order to have a given meaning in a given conversational context. • For example: • agent wants to emphasize a given word, the model will output that the agent should gaze at her conversant.
Gaze Model • Very deterministic behavior model: at every Communicative Function associated with a meaning correspond the same signal (with probabilistic changes) • Event-driven model: only when a Communicative Function is specified the associated signals are computed only when a Communicative Function is specified, the corresponding behavior may vary
Gaze Model • Several drawbacks as there is no temporal consideration: • No consideration of past and current gaze behavior to compute the new one • No consideration of how long the current gaze state of S and L has lasted
Gaze Algorithm • Two steps: • Communicative prediction: • Apply the communicative function model to compute the gaze behavior as to convey a given meaning for S and L • Statistical prediction: • The communicative gaze model is probabilistically modified by a statistical model defined with constraints: • what is the communicative gaze behavior of S and L • in which gaze behavior S and L were • the duration of the current state of S and L
Temporal Gaze Parameters • The gaze behaviors depend on the communicative functions, general purpose of the conversation (persuasion discours, teaching...), personality, cultural root, social relations... • Very, too, complex model propose parameters that control the gaze behavior overall • TS=1,L=1max: maximum duration the mutual gaze state may remain active. • TS=1max : maximum duration of gaze state S=1. • TL=1max : maximum duration of gaze state L=1 . • TS=0max : maximum duration of gaze state S=0. • TL=0max : maximum duration of gaze state L=0.
Gesture Planner • Adaptive instantiation: • Preparation and retraction phase adjustments • Transition key and rest gesture insertion • Joint-chain follow-through • Forward time shifting of children joints in time • Stroke of gesture on stressed word • Stroke expansion • During planning phase, identify rheme clauses with closely repeated emphases/pitch accents • Indicate secondary accents by repeating the stroke of the primary gesture with decreasing amplitude
Gesture Planner • Determination of gesture: • Look in dictionary • Selection of gesture • Gestures associated with most embedded tags have priority (except beat): adjectival, deictic • Duration of gesture: • Coarticulation between successive gestures closed in time • Hold for gestures belonging to higher up tag hierarchy (e.g. performative, belief-relation) • Otherwise go to rest position
Behavior Expressivity • Behavior is related to the (Wallbott, 1998): • quality of the mental state (e.g. emotion) it refers to • quantity (somehow linked to the intensity factor of the mental state) • Behaviors encode: • content information (the ‘What is communicating’) • expressive information (the ‘How it is communicating’) • Behavior expressivity refers to the manner of execution of the behavior
Expressivity Dimensions • Spatial: amplitude of movement • Temporal: duration of movement • Power: dynamic property of movement • Fluidity: smoothness and continuity of movement • Repetitiveness: tendency to rhythmic repeats • Overall Activation: quantity of movement across modalities
Overall Activitation • Threshold filter on atomic behaviors during APML tag matching • Determines the number of nonverbal signals to be executed.
Spatial Parameter • Amplitude of movement controlled through asymmetric scaling of the reach • space that is used to find IK goal positions • Expand or condense the entire space in front of agent
Temporal parameter • Determine the speed of the arm movement of a gesture's • meaning-carrying stroke phase • Modify speed of stroke Stroke shift / velocity control of a beat gesture Y position of wrist w.r.t. shoulder [cm] Frame #
Fluidity • Continuity control of TCB interpolation splines and gesture-to-gesture • Continuity of arms’ trajectory paths • Control the velocity profiles of an action coarticulation X position of wrist w.r.t. shoulder [cm] Frame #
Power • Tension and Bias control of TCB splines; • Overshoot reduction • Acceleration and deceleration of limbs Hand shape control for gestures that do not need hand configuration to convey their meaning (beats).