Multimodal Interfaces for Maps: Enhancing User Experience with Speech and Pointing

Speech & Multimodal Scott Klemmer · 16 November 2006

Some hci definitions • Multimodal generally refers to an interface that can accept input from two or more combined modes • Multimedia generally refers to an interface that produces output in two or more modes • The vast majority of multimodal systems have been speech + pointing (pen or mouse) input, with graphical (and sometimes voice) output

Canonical App: Maps • Why are maps so well-suited? • A visual artifact for computation (Hutchins)

What is an interface • Is it an interface if there’s no method for a user to tell if they’ve done something? • What might an example be? • Is it an interface if there’s no method for explicit user input? • example: health monitoring apps

Sensor Fusion • multimodal = multiple human channels • sensor fusion = multiple sensor channels • Example app: Tracking people (1 human channel) • might use: RFID + vision + keyboard activity + … • I disagree with the Oviatt paper • Speech + lips is sensor fusion, not multimodality

What constitutes a modality? • To some extent, it’s a matter of semantics • Is pen a different modality than a mouse? • Are two mice different modalities if one is controlling a gui, and the other controls a tablet-like ui? • Is a captured modality the same as an input modality? • How does the audio notebook fit into this?

Input modalities • mouse • pen: recognized or unrecognized • speech • non-speech audio • tangible object manipulation • gaze, posture, body-tracking • Each of these experiences has different implementing technologies • e.g., gaze tracking could be laser-based or vision-based

Output modalities • Visual displays • Raster graphics, Oscilloscope, paper printer, … • Haptics: Force Feedback • Audio • Smell • Taste

Dual Purpose Speech

Why multimodal? • Hands busy / eyes busy • Mutual disambiguation • Faster input • “More natural”

On Anthropomorphism • The multimodal community grew out of the AI and speech communities • Should human communication with computers be as similar as possible to human-human communication?

Multimodal Software Architectures • OAA, AAA, OOPS

Next Time… Vision-Based Interaction Computer Vision for Interactive Computer Graphics, William T. Freeman, Yasunari Miyake, Ken-ichi Tanaka, David B. Anderson, Paul A. Beardsley, Chris N. Dodge, Michal Roth, Craig D. Weissman, William S. Yerazunis, Hiroshi Kage, Kazuo Kyuma A Design Tool for Camera-based Interaction, Jerry Alan Fails and Dan R. Olsen

Multimodal Interfaces for Maps: Enhancing User Experience with Speech and Pointing