Multi-Modal Dialogue in Personal Navigation Systems

Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan

Introduction • The term “multi-modal” • General description of an application that could be operated in multiple input/output modes. • E.g • Input: voice, pen, gesture, face expression. • Output: voice, graphical output

Multi-modal Dialogue (MMD) in Personal Navigation System • Motivation of this presentation • Navigation System provides MMD • an interesting scenario • a case why MMD is useful • Structure of this presentation • 3 system papers • AT&T MATCH • speech and pen input with pen gesture • Speechworks Walking Direction System • speech and stylus input • Univ. of Saarland REAL • Speech and pen input • Both GPS and a magnetic tracker were used.

Multi-modal Language Processing for Mobile Information Access

Overall Function • A working city guide and navigation system • Easy access restaurant and subway information • Runs on a Fujitsu pen computer • Users are free to • give speech command • draw on display with stylus

Types of Inputs • Speech Input • “show cheap italian restaurants in chelsea” • Simultaneous Speech and Pen Input • Circle and area • Say “show cheap italian restaurants in neighborhood” at the same time. • Functionalities include • Review • Subway routine

Input Overview • Speech Input • Use AT&T Watson speech recognition engine • Pen Input (electron Ink) • Allow usage of pen gesture. • It could be a complex, pen input • Use special aggregation techniques for all this gesture. • Inputs would be combined using lattice combination.

Pen Gesture and Speech Input • For example: • U: “How do I get to this place?” • <user circled one of the restaurant displayed on the map> • S: “Where do you want to go from?” • U “25th St & 3rd Avenue” • <user writes 25th St & 3rd Avenue> • <System compute the shortest route >

Summary • Interesting aspects of the system • Illustrate the real life scenario where multi-modal inputs could be used • Design issue: • how different inputs should be used together? • Algorithmic issue: • how different inputs should be combined together?

Multi-modal Spoken Dialog with Wireless Devices

Overview • Work by Speechworks • Jointly conducted by speech recognition and user interface folks • Two distinct elements • Speech recognition • In a embedded domain, which speech recognition paradigm should be used? • embedded speech recognition? • network speech recognition? • distributed speech recognition? • User interface • How to “situationlize” the application?

Overall Function • Walking Directions Application • Assume user walking in an unknown city • Compaq iPAQ 3765 PocketPC • Users could • Select a city, start-end addresses • Display a map • Control the display • Display directions • Display interactive directions in the form of list of steps. • Accept speech input and stylus input • Not pen gesture.

Choice of speech recognition paradigm • Embedded speech recognition • Only simple commands could be used due to computation limits. • Network speech recognition • Bandwidth is required • Sometimes network would be cut-off • Distributed speech recognition • Client takes care of front-end • Server takes care of decoding • <Issues: higher complexity of the code. >

User Interface • Situationalization • Potential scenario • Sitting at a desk • Getting out of a cab, building, subway and preparing to walk somewhere • Walking somewhere with hands free • Walking somewhere carrying things • Driving somewhere in heavy traffic • Driving somewhere in light traffic • Being the passenger in a car • Being in highly noisy environment.

Their conclusion • Balances of audio and visual information • Could be reduced to 4 complementary components • Single-modal • 1, Visual Mode • 2, Audio Mode • Multi-modal • 3, Visual dominant • 4, Visual dominant

A glance of UI

Summary • Interesting aspects • Great discussion on • how speech recognition could be used in an embedded domain • how the user would use the dialogue application

Multi-modal Dialog in a Mobile Pedestrian Navigation System

Overview • Pedestrian Navigation System • Two components: • IRREAL : indoor navigation system • Use magnetic tracker • ARREAL: outdoor navigation system • Use GPS

Speech Input/Output • Speech Input: • HTK / IBM Viavoice embedded and Logox was being evaluated • Speech Output: • Festival

Visual output • Both 2D and 3D spatialization supported

Interesting aspects • Tailor the system for elderly people • Speaker clustering • to improve recognition rate for elderly people • Model selection • Choose from two models based on likelihood • Elderly models • Normal adult models

Conclusion • Aspects of multi-modal dialogue • What kind of inputs should be used? • How speech and other inputs could be combined/interacted? • How users would use the system? • How the system should respond to the users?

Multi-Modal Dialogue in Personal Navigation Systems

Multi-Modal Dialogue in Personal Navigation Systems

Presentation Transcript

Multi-Modal Radioactive Shipping

Multi-Modal Assessment

Grounding in dialogue systems

Navigation Systems

Innovations in Multi-Modal Transit Mapping

Experiments on Building Language Resources for Multi-Modal Dialogue Systems

Multi-modal Information Systems

Personal Navigation System

Navigation Systems

MULTI MODAL TRANSPORTATION PROBLEMS

Multi-Modal Visualization Methods

Navigation Systems

Multi-Modal transportation

Innovations in Multi-Modal Transit Mapping

Multi-Modal Corridor Study

Navigation Systems

Dialogue Systems

Multi-modal Interfaces

Multi-Modal Sensory Stimulation