230 likes | 623 Views
Multi-Modal Dialogue in Personal Navigation Systems. Arthur Chan. Introduction . The term “multi-modal” General description of an application that could be operated in multiple input/output modes. E.g Input: voice, pen, gesture, face expression. Output: voice, graphical output.
E N D
Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan
Introduction • The term “multi-modal” • General description of an application that could be operated in multiple input/output modes. • E.g • Input: voice, pen, gesture, face expression. • Output: voice, graphical output
Multi-modal Dialogue (MMD) in Personal Navigation System • Motivation of this presentation • Navigation System provides MMD • an interesting scenario • a case why MMD is useful • Structure of this presentation • 3 system papers • AT&T MATCH • speech and pen input with pen gesture • Speechworks Walking Direction System • speech and stylus input • Univ. of Saarland REAL • Speech and pen input • Both GPS and a magnetic tracker were used.
Multi-modal Language Processing for Mobile Information Access
Overall Function • A working city guide and navigation system • Easy access restaurant and subway information • Runs on a Fujitsu pen computer • Users are free to • give speech command • draw on display with stylus
Types of Inputs • Speech Input • “show cheap italian restaurants in chelsea” • Simultaneous Speech and Pen Input • Circle and area • Say “show cheap italian restaurants in neighborhood” at the same time. • Functionalities include • Review • Subway routine
Input Overview • Speech Input • Use AT&T Watson speech recognition engine • Pen Input (electron Ink) • Allow usage of pen gesture. • It could be a complex, pen input • Use special aggregation techniques for all this gesture. • Inputs would be combined using lattice combination.
Pen Gesture and Speech Input • For example: • U: “How do I get to this place?” • <user circled one of the restaurant displayed on the map> • S: “Where do you want to go from?” • U “25th St & 3rd Avenue” • <user writes 25th St & 3rd Avenue> • <System compute the shortest route >
Summary • Interesting aspects of the system • Illustrate the real life scenario where multi-modal inputs could be used • Design issue: • how different inputs should be used together? • Algorithmic issue: • how different inputs should be combined together?
Overview • Work by Speechworks • Jointly conducted by speech recognition and user interface folks • Two distinct elements • Speech recognition • In a embedded domain, which speech recognition paradigm should be used? • embedded speech recognition? • network speech recognition? • distributed speech recognition? • User interface • How to “situationlize” the application?
Overall Function • Walking Directions Application • Assume user walking in an unknown city • Compaq iPAQ 3765 PocketPC • Users could • Select a city, start-end addresses • Display a map • Control the display • Display directions • Display interactive directions in the form of list of steps. • Accept speech input and stylus input • Not pen gesture.
Choice of speech recognition paradigm • Embedded speech recognition • Only simple commands could be used due to computation limits. • Network speech recognition • Bandwidth is required • Sometimes network would be cut-off • Distributed speech recognition • Client takes care of front-end • Server takes care of decoding • <Issues: higher complexity of the code. >
User Interface • Situationalization • Potential scenario • Sitting at a desk • Getting out of a cab, building, subway and preparing to walk somewhere • Walking somewhere with hands free • Walking somewhere carrying things • Driving somewhere in heavy traffic • Driving somewhere in light traffic • Being the passenger in a car • Being in highly noisy environment.
Their conclusion • Balances of audio and visual information • Could be reduced to 4 complementary components • Single-modal • 1, Visual Mode • 2, Audio Mode • Multi-modal • 3, Visual dominant • 4, Visual dominant
Summary • Interesting aspects • Great discussion on • how speech recognition could be used in an embedded domain • how the user would use the dialogue application
Overview • Pedestrian Navigation System • Two components: • IRREAL : indoor navigation system • Use magnetic tracker • ARREAL: outdoor navigation system • Use GPS
Speech Input/Output • Speech Input: • HTK / IBM Viavoice embedded and Logox was being evaluated • Speech Output: • Festival
Visual output • Both 2D and 3D spatialization supported
Interesting aspects • Tailor the system for elderly people • Speaker clustering • to improve recognition rate for elderly people • Model selection • Choose from two models based on likelihood • Elderly models • Normal adult models
Conclusion • Aspects of multi-modal dialogue • What kind of inputs should be used? • How speech and other inputs could be combined/interacted? • How users would use the system? • How the system should respond to the users?