380 likes | 505 Views
SmartKom: Fusion and Fission of Speech, Gestures, and Facial Expressions. International Workshop on Man-Machine Symbiotic Systems Kyoto, 26 November 2002, p. 213. Wolfgang Wahlster. German Research Center for Artificial Intelligence DFKI GmbH Stuhlsatzenhausweg 3
E N D
SmartKom: Fusion and Fission of Speech, Gestures, and Facial Expressions International Workshop on Man-Machine Symbiotic Systems Kyoto, 26 November 2002, p. 213 Wolfgang Wahlster German Research Center for Artificial Intelligence DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbruecken, Germany phone: (+49 681) 302-5252/4162 fax: (+49 681) 302-5341 e-mail: wahlster@dfki.de WWW:http://www.dfki.de/~wahlster
SmartKom: Merging Various User Interface Paradigms Graphical User interfaces Gestural Interaction Spoken Dialogue Facial Expressions Biometrics Multimodal Interaction
Symbolic and Subsymbolic Fusion of Multiple Modes Facial Expression Recognition Speech Recognition Prosody Recognition Gesture Recognition Lip Reading Subsymbolic Fusion Symbolic Fusion - Graph Unification - Bayesian Networks - Neuronal Networks - Hidden Markov Models Reference Resolution and Disambiguation Modality-Free Semantic Representation
Outline of the Talk • Using all Human Senses for Symbiotic Man-Machine Interaction • SmartKom: Multimodal, Multilingual and Multidomain Dialogues • Modality Fusion in SmartKom • Multimodal Discourse Processing • 5. Plan-based Modality Fission in SmartKom • 6. Conclusions
SmartKom: A Highly Portable Multimodal Dialogue System SmartKom-Mobile SmartKom-Public SmartKom-Home/Office Application Layer MM Dialogue Back- Bone Public: Cinema, Phone, Fax, Mail, Biometrics Mobile: Car and Pedestrian Navigation Home: Consumer Electronics EPG
SmartKom: Intuitive Multimodal Interaction Project Budget: € 25.5 million, funded by BMBF (Dr. Reuse) and industry Project Duration: 4 years (September 1999 – September 2003) The SmartKom Consortium: Main Contractor Scientific Director W. Wahlster DFKI Saarbrücken MediaInterface Saarbrücken Berkeley Dresden European Media Lab Uinv. Of Munich Univ. of Stuttgart Heidelberg Univ. of Erlangen Munich Stuttgart Ulm Aachen
SmartKom`s SDDP Interaction Metaphor Webservices Service 1 Personalized Interaction Agent User specifies goal delegates task Service 2 cooperate on problems asks questions presents results Service 3 SDDP = Situated Delegation-oriented Dialogue Paradigm Anthropomorphic Interface = Dialogue Partner See: Wahlster et al. 2001 , Eurospeech
Multimodal Input and Output in the SmartKom System Where would you like to sit?
Symbiotic Interaction with a Life-like Character I‘d like to reserve tickets for this performance. Where would you like to sit? I‘d like these two seats. Smartakus Output: Speech, Gesture and Facial Expressions User Input: Speech, Gesture, and Facial Expressions User Input: Speech, Gesture, and Facial Expressions
Multimodal Input and Output in SmartKomFusion and Fission of Multiple Modalities Input by the User Output by the Presentation agent + + Speech + + Gesture Facial Expressions + +
SmartKom‘s Data Collection of Multimodal Dialogs Face-tracking Camera with Microphone Bird’s-eye Camera SIVIT- Camera LCD Beamer Screen Face-tracking Camera Microphone Array User Projected Webpage Side-view Camera Loudspeaker User Microphone Array Environmental Noise
Personalized Interaction with WebTVs via SmartKom (DFKI with Sony, Philips, Siemens) Example: Multimodal Access to Electronic Program Guides for TV User: Switch on the TV. Smartakus: Okay, the TV is on. User: Which channels are presenting the latest news right now? Smartakus: CNN and NTV are presenting news. User: Please record this news channel on a videotape. Smartakus: Okay, the VCR is now recording the selected program.
Using Facial Expression Recognition forAffective Personalization Processing ironic or sarcastic comments (1) Smartakus: Here you see the CNN program for tonight. (2) User: That’s great. • Smartakus: I’ll show you the program of another channel for tonight. • (2’) User: That’s great. (3’) Smartakus: Which of these features do you want to see?
Recognizing Affect: A Negative Facial Expression of the User neutral negative
The SmartKom Demonstrator System Multimodal Control of TV-Set Camera for Gestural Input Multimodal Control of VCR/DVD Player Microphone Camera for Facial Analysis
Combination of Speech and Gesture in SmartKom This one I would like to see. Where is it shown?
Multimodal Input and Output in SmartKom Please show me where you would like to be seated.
Getting Driving and Walking Directions via SmartKom SmartKom can be used for Multimodal Navigation Dialogues in a Car User: I want to drive to Heidelberg. Smartakus: Do you want to take the fastest or the shortest route? User: The fastest. Smartakus: Here you see a map with your route from Saarbrücken to Heidelberg.
Getting Driving and Walking Directions via SmartKom Smartakus: You are now in Heidelberg. Here is a sightseeing map of Heidelberg. User: I would like to know more about this church! Smartakus: Here is some information about the St. Peter's Church. User: Could you please give me walking directions to this church? Smartakus: In this map, I have high-lighted your walking route.
SmartKom: Multimodal Dialogues with a Hybrid Navigation System
Salient Characteristics of SmartKom • Seamless integration and mutual disambiguation of multimodal input and output on semantic and pragmatic levels • Situated understanding of possibly imprecise, ambiguous, or incom- plete multimodal input • Context-sensitive interpretation of dialog interaction on the basis of dynamic discourse and context models • Adaptive generation of coordinated, cohesive and coherent multimodal presentations • Semi- or fully automatic completion of user-delegated tasks through the integration of information services • Intuitive personification of the system through a presentation agent
SmartKom’s Multimodal Dialogue Back-Bone Communication Blackboards Data Flow Context Dependencies Analyzers • Speech • Gestures • Facial Expressions • Speech • Graphics • Gestures Generators Dialogue Manager Modality Fusion Discourse Modeling Action Planning Modality Fission External Services
Unification of Scored Hypothesis Graphs for Modality Fusion in SmartKom Clause and Sentence Boundaries with Prosodic Scores Scored Hypotheses about the User‘s Emotional State Gesture Hypothesis Graph with Scores of Potential Reference Objects Word Hypothesis Graph with Acoustic Scores Modality Fusion Mutual Disambiguation Reduction of Uncertainty Intention Hypotheses Graph Intention Recognizer Selection of Most Likely Interpretation
SmartKom‘s Computational Mechanisms for Modality Fusion and Fission M3L: Modality-Free Semantic Representation Ontological Inferences Modality Fission Modality Fusion Planning Unification Overlay Operations Constraint Propagation
The Overlay Operation Versus the Unification Operation Nonmonotonic and noncommutative unification-like operation Inherit (non-conflicting) background information two sources of conflicts: conflicting atomic values overwrite background (old) with covering (new) type clash assimilatebackground to the type of covering; recursion Unification Overlay cf. J. Alexandersson, T. Becker 2001
Overlay Operations Using the Discourse Model Augmentation and Validation compare with a number of previous discourse states: fill in consistent information compute a score for each hypothesis - background pair: Overlay (covering, background) Intention Hypothesis Lattice Covering: Background: Selected Augmented Hypothesis Sequence
An Example of the Overlay Operation Films on TV tonight U: What films are shown on TV tonight? .... U: I‘d rather go to the movies. Generalisation and Specialisation Go to the movies
Smartkom‘s Three-Tiered Discourse Model Domain Layer DomainObject2 DomainObject1 Discourse Layer DO2 DO10 DO11 DO12 DO1 DO3 DO9 . . . Modality Layer VO1 LO4 LO5 LO6 LO2 GO1 LO3 . . . . . . reserve ticket first list heidelberg System: This [] is a list of films showing in Heidelberg. User: Please reserve a ticket for the first one. DO = Discourse Object, LO = Linguistic ObjectGO = Gestural Object, VO = Visual Object cf. M. Löckelt et. al. 2002, N. Pfleger 2002
Smartakus is a Self-Animated Interface Agent Presentation Navigation Idle Time System State Smartakus uses body language to notify the user that it is waiting for his input, that it is listening to him, that it has problems to understand his input, or that it is trying hard to find an answer to his question.g
Some Complex Behavioural Patterns of the Interaction Agent Smartakus
M3L Representation of the Multimodal Discourse Context Blackboard with Presentation Context of the Previous Dialogue Turn <?xml version="1.0"?> <presentationContent> [...] <abstractPresentationContent> <movieTheater structId="pid1234”> <entityKey> cinema_17a </entityKey> <name> Europa </name> <geoCoordinate> <x> 225 </x> <y> 230 </y> </geoCoordinate> </movieTheater> </abstractPresentationContent> [...] <panelElement> <map structId="PM23"> <boundingShape> <leftTop> <x> 0.5542 </x> <y> 0.1950 </y> </leftTop> <rightBottom> <x> 0.9892 </x> <y> 0.7068 </y> </rightBottom> </boundingShape> <contentReference> pid1234 </contentReference> </map> </panelElement> [...] </presentationContent>
M3L Specification of a Presentation Task <presentationTask> <subTask> <presentationGoal> <inform> ... </inform> <abstractPresentationContent> ... <result> <broadcast id="bc1"> <channel> <name>EuroSport</name> </channel> <beginTime> <time> <at>2000-12-05T14:00:00</at> </time> </beginTime> <endTime> <time> <at>2000-12-05T15:00:00</at> </time> </endTime> <avMedium> <title>Sport News</title> <avType>sport</avType> ... </abstractPresentationContent> <interactionMode>leanForward</interactionMode> <goalID>APGOAL3000</goalID> <source>generatorAction</source> <realizationType>GraphicsAndSpeech</realizationType>
SmartKom‘s Presentation Planner The Presentation Planner generates aPresentation Plan by applying a set of Presentation Strategies to the Presentation Goal. GlobalPresent Present AddSmartakus .... DoLayout EvaluatePersonaNode ... PersonaAction ... Inform ... Speak SendScreenCommand Smartakus Actions TryToPresentTVOverview ShowTVOverview ShowTVOverview SetLayoutData ... SetLayoutData Generation of Layout ShowTVOverview GenerateText SetLayoutData ... SetLayoutData cf. J. Müller, P. Poller, V. Tschernomas 2002
SmartKom‘s Use of Semantic Web Technology M3L high Content XML medium Structure HTML low Layout Three Layers of Annotations Personalized Presentation cf.: Dieter Fensel, James Hendler, Henry Liebermann, Wolfgang Wahlster (eds.) Spinning the Semantic Web, MIT Press, November 2002
Conclusions • Various types of unification, overlay, constraint processing, planning and ontological inferences are the fundamental processes involved in SmartKom‘s modality fusion and fission components. • The key function of modality fusion is the reduction of the overall uncertainty and the mutual disambiguation of the various analysis results based on a three-tiered representation of multimodal discourse. • We have shown that a multimodal dialogue sytsem must not only understand and represent the user‘s input, but its own multimodal output.
First International Conference on Perceptive &Multimodal User Interfaces (PMUI’03) November 5-7th, 2003 Delta Pinnacle Hotel, Vancouver, B.C., Canada Conference Chair Sharon Oviatt, Oregon Health & Science Univ., USA Program Chairs Wolfgang Wahlster, DFKI, Germany Mark Maybury, MITRE, USA PMUI’03 is sponsored by ACM, and will be co-located in Vancouver with ACM’s UIST’03. This meeting follows three successful Perceptive User Interface Workshops (with PUI’01 held in Florida) and three International Multimodal Interface Conferences initiated in Asia (with ICMI’02 held in Pittsburgh).