450 likes | 465 Views
SmartKom is a flexible and adaptive multimodal dialogue shell that enables symmetric multimodality, combining graphical user interfaces, gestural interaction, spoken dialogue, facial expressions, and biometrics.
E N D
SmartKom: Dialog-based Human Computer Interaction by Coordinated Analysis and Generation of Multiple Modalities BMBF Status Conference "Human Computer Interaction" 2003 June 3, Berlin Symmetric Multimodality in an Adaptive and Reusable Dialogue Shell Wolfgang Wahlster German Research Center for Artificial Intelligence DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbruecken, Germany phone: (+49 681) 302-5252/4162 fax: (+49 681) 302-5341 e-mail: wahlster@dfki.de WWW:http://www.dfki.de/~wahlster
SmartKom: Merging Various User Interface Paradigms Graphical User interfaces Gestural Interaction Spoken Dialogue Facial Expressions Biometrics Multimodal Interaction
The SmartKom Consortium Project duration: September 1999 – September 2003 Final presentation focusing on the mobile version: 5th September, Stuttgart Main Contractor DFKI Saarbrücken MediaInterface Saarbrücken Berkeley Dresden European Media Lab Uinv. Of Munich Univ. of Stuttgart Heidelberg Univ. of Erlangen Munich Stuttgart Ulm Aachen
SmartKom‘s Major Scientific Goals Explore and design new symbolic and statistical methods for the seamless fusion and mutual disambiguation of multimodal input on semantic and pragmatic levels. Generalize advanced discourse models for spoken dialogue systems so that they can capture a broad spectrum of multimodal discourse phenomena. Explore and design new constraint-based and plan-based methods for multimodal fission and adaptive presentation layout. Integrate all these multimodal capabilities in a reusable, efficient and robust dialogue shell, that guarantees flexible configuration, domain independence and plug-and-play functionality. MAJOR SCIENTIFIC GOALS
Outline of the Talk • Towards Symmetric Multimodality • SmartKom: A Flexible and Adaptive Multimodal • Dialogue Shell • 3. Perception and Action under Multimodal Conditions • 4. Multimodal Fusion and Fission in SmartKom • 5. Ontological Inferences and the Three-Tiered • Discourse Model of SmartKom • 6. The Economic and Scientific Impact of SmartKom • 7. Conclusions
SmartKom Provides Full Symmetric Multimodality Input Output Facial Expressions Facial Expressions Gestures Gestures Speech Speech Multimodal Fusion Multimodal Fission Symmetric multimodality means that all input modes (speech, gesture, facial expression) are also available for output, and vice versa. USER The modality fission component provides the inverse functionality of the modality fusion component. SYSTEM Challenge: A dialogue system with symmetric multimodality must not only understand and represent the user's multimodal input, but also its own multimodal output.
SmartKom Covers the Full Spectrum of Multimodal Discourse Phenomena mutual disambiguation of modalities multimodal deixis resolution and generation crossmodal reference resolution and generation multimodal anaphora resolution and generation multimodal ellipsis resolution and generation multimodal turn-taking and backchannelling Multimodal Discourse Phenomena Symmetric multimodality is a prerequisite for a principled study of these discourse phenomena.
SmartKom’s Multimodal Input and Output Devices Infrared Camera for Gestural Input, Tilting CCD Camera for Scanning, Video Projector Multimodal Control of TV-Set Microphone Multimodal Control of VCR/DVD Player Camera for Facial Analysis Projection Surface 3 dual Xeon 2.8 Ghz processors with 1.5 GB main memory Speakers for Speech Output
SmartKom: A Flexible and Adaptive Shell for Multimodal Dialogues SmartKom-Mobile Mobile Travel Companion that helps with navigation SmartKom-Public: Communication Companion that helps with phone, fax, email, and authetification SmartKom-Home/Office: Infotainment Companion that helps select media content Application Layer MM Dialogue Back- Bone Public: Cinema, Phone, Fax, Mail, Biometrics Mobile: Car and Pedestrian Navigation Home: Consumer Electronics EPG
SmartKom`s SDDP Interaction Metaphor Webservices Service 1 Personalized Interaction Agent User specifies goal delegates task Service 2 cooperate on problems asks questions presents results Service 3 SDDP = Situated Delegation-oriented Dialogue Paradigm Anthropomorphic Interface = Dialogue Partner See: Wahlster et al. 2001 , Eurospeech
SmartKom‘s Language Model and Lexicon is Augmented on the Fly with Named Entities TV Info - names of TV features - actor names SmartKom‘s Basic Vocabulary 5500 Words Cinema Info - movie titles - actor names e.g. all cinemas in one city > 200 new words e.g. TV programm of one day > 200 new words Geographic Info - street names - names of points-of-interest e.g. one city > more than 500 new names After a short dialogue sequence the lexicon includes > 10 000 words.
The German Federal President E-mailing a Scanned Image with SmartKom’s Help Now you can remove the document.
Interactive Biometric Authentication by Hand Contour Recognition Please place your hand with spread fingers on the marked area.
SmartKom bridges the full loop from multimodal perception to physical action: My name is Norbert Reithinger. I have found the record of Norbert Reithinger. I require authentication from you. The documents was successfully scanned. I require a signature authentication for Norbert Reithinger. Please sign in the write-in field. The authentication was successful. The document has now been sent. I like to send a document to Wolfgang Wahlster. Please place the document on the marked area. I have found the record for Wolfgang Wahlster. Please remove it now. Scanning a Document and Sending the Captured Image as an Email Attach- ment
Unification of Scored Hypothesis Graphs for Modality Fusion in SmartKom Clause and Sentence Boundaries with Prosodic Scores Scored Hypotheses about the User‘s Emotional State Gesture Hypothesis Graph with Scores of Potential Reference Objects Word Hypothesis Graph with Acoustic Scores Modality Fusion Mutual Disambiguation Reduction of Uncertainty Intention Hypotheses Graph Intention Recognizer Selection of Most Likely Interpretation
M3L Representation of an Intention Lattice Fragment <intentionLattice> […] <hypothesisSequences> <hypothesisSequence> <score> <source> acoustic </source> <value> 0.96448 </value> </score> <score> <source> gesture </source> <value> 0.99791 </value> </score> <score> <source> understanding</source> <value> 0.91667 </value> </score> <hypothesis> <discourseStatus> <discourseAction> set </discourseAction> <discourseTopic><goal> epg_info </goal></discourseTopic> […] <event id="dim868"> <informationSearch id="dim869"> <pieceOfInformation> <broadcast id="dim863"> <avMedium> <avMedium id="dim866"> <avType> featureFilm </avType> <title> Enemy of the State </title> […] </hypothesisSequence> […] </hypothesisSequences> </intentionLattice> I would like to know more about this Confidence in the Speech Recognition Result Confidence in the Gesture Recognition Result Confidence in the Speech Understanding Result Planning Act Object Reference
Fusing Symbolic and Statistical Information in SmartKom Early Fusion on the Signal Processing Level Multiple Recognizers for a Single Modality Speech Signal Microphone Face Camera Facial Expressions Speech Recognition Emotional Prosody Boundary Prosody Emotional Prosody Affective User State time-stamped and scored hypotheses - anger - joy
SmartKom‘s Computational Mechanisms for Modality Fusion and Fission Modality Fission Modality Fusion Planning Unification Overlay Operations Constraint Propagation M3L: Modality-Free Semantic Representation Ontological Inferences
The Markup Language Layer Model of SmartKom MultiModalMarkupLanguage M3L OntologyInferenceLayer OIL eXtendedMarkupLanguage Schema ResourceDescriptionFramework Schema XMLS RDFS ResourceDescriptionFramework eXtendedMarkupLanguage XML RDF HypertextMarkupLanguage HTML
Mapping Digital Content Onto a Variety of Structures and Layouts Personalization M3L Content XML2 XMLn XML1 Structure Layout HTML1m HTML21 HTML2o HTML31 HTML3p HTML11 From the “one-size fits-all“ approach of static presentations to the “perfect personal fit“ approach of adaptive multimodal presentations
The Role of the Semantic Web Language M3L M3L (Multimodal Markup Language) defines the data exchange formats used for communication between all modules of SmartKom M3L is partioned into 40 XML schema definitions covering SmartKom‘s discourse domains The XML schema event.xsd captures the semantic representation of concepts and processes in SmartKom‘s multimodal dialogs
OIL2XSD: Using XSLT Stylesheets to Convert an OIL Ontology to an XML Schema
Using Ontologies to Extract Information from the Web Film.de-Movie MyOnto-Movie :o-title :title :description :title :description :director Kinopolis.de-Movie :actors :critics :name MyOnto-Person :main actor :name :birthday Mapping of Metadata
M3L as a Meaning Representation Language for the User‘s Input I would like to send an email to Dr.Reuse <domainObject> <sendTelecommunicationProcess> <sender>....................</sender> <receiver>..............</receiver> <document>..........</document> <email>...........</email> </sendTelecommunicationProcess> </domainObject>
Exploiting Ontological Knowledge to Understand and Answer the User‘s Queries <domainObject> <epg> <broadcastDefault> <avMedium> <actors> <actor><name>Schwarzenegger/name></actor> </actors> </avMedium> <channel><name>Pro7</name></channel> </broadcastDefault> </epg> </domainObject> <beginTime> <time> <function> <at> 2002-05-10T10:25:46 </at> </function> </beginTime> Which movies with Schwarzenegger are shown on the Pro7 channel?
SmartKom’s Multimodal Dialogue Back-Bone Communication Blackboards Data Flow Context Dependencies Analyzers • Speech • Gestures • Facial Expressions • Speech • Graphics • Gestures Generators Dialogue Manager Modality Fusion Discourse Modeling Action Planning Modality Fission External Services
A Fragment of a Presentation Goal, as specified in M3L <presentationTask> <presentationGoal> <inform> <informFocus> <RealizationType>list </RealizationType> </informFocus> </inform> <abstractPresentationContent> <discourseTopic> <goal>epg_browse</goal> </discourseTopic> <informationSearch id="dim24"><tvProgram id="dim23"> <broadcast><timeDeictic id="dim16">now</timeDeictic> <between>2003-03-20T19:42:32 2003-03-20T22:00:00</between> <channel><channel id="dim13"/> </channel> </broadcast></tvProgram> </informationSearch> <result> <event> <pieceOfInformation> <tvProgram id="ap_3"> <broadcast> <beginTime>2003-03-20T19:50:00</beginTime> <endTime>2003-03-20T19:55:00</endTime> <avMedium> <title>Today’s Stock News</title></avMedium> <channel>ARD</channel> </broadcast>……..</event> </result> </presentationGoal> </presentationTask>
A Dynamically Generated Multimodal Presentation based on a Presentation Goal Here is a listing of tonight's TV broadcasts. Today's Stock News Yes, Dear Evening News Down to Earth The King of Queens Everybody Loves Raymond Crossing Jordan Still Standing Bonanza Mr. Personality Passions Weather Forecast Today
An Excerpt from SmartKom’s Three-Tiered Multimodal Discourse Model OO2 Broadcast of „The King of Queens“ on 20/3/2003 OO1 TV broadcasts on 20/3/2003 Domain Layer DO4 DO5 DO3 DO2 DO1 DO11 Discourse Layer DO12 DO13 Modality Layer GO1 here (pointing) VO1 <TV listing> LO1 listing LO2 tonight LO3 TV broadcast LO4 tape LO5 third one
Overlay Operations Using the Discourse Model Augmentation and Validation compare with a number of previous discourse states: fill in consistent information compute a score for each hypothesis - background pair: Overlay (covering, background) Intention Hypothesis Lattice Covering: Background: Selected Augmented Hypothesis Sequence
The Overlay Operation Versus the Unification Operation Nonmonotonic and noncommutative unification-like operation Inherit (non-conflicting) background information two sources of conflicts: conflicting atomic values overwrite background (old) with covering (new) type clash assimilatebackground to the type of covering; recursion Unification Overlay cf. J. Alexandersson, T. Becker 2001
Example for Overlay User: "What films are on TV tonight?" System: [presents list of films] User: "That‘s a boring program, I‘d rather go to the movies." How do we inherit “tonight” ?
Overlay Simulation Films on TV tonight Assimilation Background Go to the movies Covering
Overlay - Scoring • Four fundamental scoring parameters: • Number of features from Covering (co) • Number of features from Background (bg) • Number of type clashes (tc) • Number of conflicting atomic values (cv) Codomain [-1,1] Higher score indicates better fit (1 overlay(c,b) unify(c,b))
SmartKom‘s Presentation Planner The Presentation Planner generates aPresentation Plan by applying a set of Presentation Strategies to the Presentation Goal. GlobalPresent Present AddSmartakus .... DoLayout EvaluatePersonaNode ... PersonaAction ... Inform ... Speak SendScreenCommand Smartakus Actions TryToPresentTVOverview ShowTVOverview ShowTVOverview SetLayoutData ... SetLayoutData Generation of Layout ShowTVOverview GenerateText SetLayoutData ... SetLayoutData cf. J. Müller, P. Poller, V. Tschernomas 2002
Adaptive Layout and Plan-Based Animation in SmartKom‘sMultimodal Presentation Generator
Salient Characteristics of SmartKom • Seamless integration and mutual disambiguation of multimodal input and output on semantic and pragmatic levels • Situated understanding of possibly imprecise, ambiguous, or incom- plete multimodal input • Context-sensitive interpretation of dialog interaction on the basis of dynamic discourse and context models • Adaptive generation of coordinated, cohesive and coherent multimodal presentations • Semi- or fully automatic completion of user-delegated tasks through the integration of information services • Intuitive personification of the system through a presentation agent
The Economic and Scientific Impact of SmartKom Economic Impact Scientific Impact 51 patents + 29 spin-off products 246 publications 13 speech recognition 117 keynotes / invited talks 10 dialogue management 66 masters and doctoral theses 6 biometrics 27 new projects use results 3 video-based interaction 5 tenured professors 2 multimodal interfaces 10 TV features 81 press articles 2 emotion recognition
An Example of Technology Transfer: The Virtual Mouse The virtual mouse has been installed in a cell phone with a camera. When the user holds a normal pen about 30cm in front of the camera, the system recognizes the tip of the pen as a mouse pointer. A red point then appears at the the tip on the display.
Former Employees of DFKI and Researchers from the SmartKom Consortium have Founded Five Start-up Companies Eyeled (www.eyeled.com) Location-aware mobile information systems CoolMuseum GmbH (www.coolmuseum.de) Mineway GmbH (www.mineway.de) Multimodal systems for music rerieval Sonicson GmbH (www.sonicson.com) Agent-based middleware Agent-based middleware Quadox AG (www.quadox.com)
SmartKom’s Impact on International Standardization SmartKom‘s Multimodal Markup Language M3L ISO W3C Standard for Multimodal Content Representation Scheme ISO, TC37, SC4 Standard for Natural Markup Language w3.org/TR/nl-spec
SmartKom‘s Impact on Software Tools and Resources for Research on Multimodality 1.6 Terabytes - audio transcripts - gesture and emotion labeling Software Framework MULTIPLATFORM 448 WOZ Sessions ... Germany BAS 15 Sites all over Europe ELRA Europe COMIC, EU, FP5 Conversational Multimodal Interaction with Computers World LDC
Conclusions • Various types of unification, overlay, constraint processing, planning and ontological inferences are the fundamental processes involved in SmartKom‘s modality fusion and fission components. • The key function of modality fusion is the reduction of the overall uncertainty and the mutual disambiguation of the various analysis results based on a three-tiered representation of multimodal discourse. • We have shown that a multimodal dialogue sytsem must not only understand and represent the user‘s input, but its own multimodal output.