Multimodal Input Analysis

Multimodal Input Analysis Making Computers More Humane Shubha Tandon Youlan Hu TJ Thinakaran

Roadmap • Basis for Multimodal interfaces and media. • Differences between Multimodal and conventional interfaces. • Multimedia input analysis • Cognitive basis to Multimodal Interfaces • Architectures for information processing

Multimodal Fundamentals

What is a Multimodal System? • Multimodal systems process two or more combined user input modes – such as speech, pen, touch, manual gestures, gaze, and head and body movements – in a coordinated manner with multimedia system output.

Schematic Multimodal System

Multimodal Systems – Why? • Provide transparent, flexible, and powerfully expressive means of HCI. • Easier to learn and use. • Robustness and Stability. • If used as front-ends to sophisticated application systems, conducting HCI in modes all users are familiar with, then the cost of training users would be reduced. • Potentially user, task and environment adaptive.

Multimodal Interface Terminology Multimodal interfacesprocess two or more combined user input modes— such as speech, pen, touch, manual gestures, gaze, and head and body movements— in a coordinated manner with multimedia system output. They are a new class of interfaces that aim to recognize naturally occurring forms of human language and behavior, and which incorporate one or more recognition-based technologies (e.g., speech, pen, vision). Active input modesare ones that are deployed by the user intentionally as an explicit command to a computer system (e.g., speech). Passive input modesrefer to naturally occurring user behavior or actions that are recognized by a computer (e.g., facial expressions, manual gestures). They involve user input that is unobtrusively and passively monitored, without requiring any explicit command to a computer. Blended multimodal interfacesare ones that incorporate system recognition of at least one passive and one active input mode. (e.g., speech and lip movement systems). Temporally-cascaded multimodal interfacesare ones that process two or more user modalities that tend to be sequenced in a particular temporal order (e.g., gaze, gesture, speech), such that partial information supplied by recognition of an earlier mode (e.g., gaze) is available to constrain interpretation of a later mode (e.g., speech). Such interfaces may combine only active input modes, only passive ones, or they may be blended.

Multimodal Interface Terminology Mutual disambiguationinvolves disambiguation of signal or semantic-level information in one error-prone input mode from partial information supplied by another. Mutual disambiguation can occur in a multimodal architecture with two or more semantically rich recognition-based input modes. It leads to recovery from unimodal recognition errors within a multimodal architecture, with the net effect of suppressing errors experienced by the user. Visemesrefers to the detailed classification of visible lip movements that correspond with consonants and vowels during articulated speech. A viseme-phoneme mapping refers to the correspondence between visible lip movements and audible phonemes during continuous speech. Feature-level fusionis a method for fusing low-level feature information from parallel input signals within a multimodal architecture, which has been applied to processing closely synchronized input such as speech and lip movements. Semantic-level fusionis a method for integrating semantic information derived from parallel input modes in a multimodal architecture, which has been used for processing speech and gesture input.

Multimodal Interface Terminology Frame-based integrationis a pattern matching technique for merging attribute-value data structures to fuse semantic information derived from two input modes into a common meaning representation during multimodal language processing. Unification-based integrationis a logic-based method for integrating partial meaning fragments derived from two input modes into a common meaning representation during multimodal language processing. Compared with frame-based integration, unification derives from logic programming, and has been more precisely analyzed and widely adopted within computational linguistics.

Trends… • Hardware, software and integration technology advances have fueled research in this area. • Research Trends: • Earliest systems: supported speech input along with keyboard or mouse GUI interfaces. • In ’80s and ’90s systems were developed to use spoken input as an alternative to text via keyboard. E.g.: CUBRICON, XTRA, Galaxy, Shoptalk and others. • Most recent system designs are based on two parallel input steams both capable of conveying rich semantic information. • The most advanced systems have been produced using speech and pen input, and speech and lip movement.

Recent Speech and Pen Based Systems

Other Systems and Future Directions • Speech and Lip movement systems used to build animated characters used as interface design vehicles. • Use of vision based technologies, such as interpretation of gaze, facial expressions etc – passive vs. active modes. • Blended multimodal interfaces with temporal cascading • New pervasive and mobile interfaces, capable of adapting processing to user and environmental context.

Advantages and Goals • Choice of modality to user for conveying different types of information, use of combined modes, alternate between modes as required. • Potential to accommodate broader range of users different users like to use different modes to interact. • Prevents overuse and physical damage to any single modality. • Ability to accommodate continuously changing conditions for mobile use. • Efficiency gains especially noticeable in certain domains. • Superior error handling.

Error Handling- Reasons for Improved Performance • User centered reasons: • Users intuitively select input mode less error prone in lexical context. • User language is simpler when interacting multimodally – reduced complexity. • Users have a tendency to switch modes after system error recognition – good error recovery. • System Centered reasons: • Multimodal architecture support mutual disambiguation.

GUIs Assume that there is a single event stream that controls event loop with processing being sequential. Assume interface actions (e.g. selection of items) are atomic and unambiguous. Built to be separable from application software and reside centrally on one machine. Do not require temporal constraints. Architecture not time sensitive. Multimodal Interface Typically process continuous and simultaneous input from parallel incoming streams. Process input modes using recognition based technology, good at handling uncertainty. These have large computational and memory requirements and are typically distributed over the network. Require time stamping of input and development of temporal constraints on mode fusion operations. Differences Between Multimodal Interfaces and GUIs

Put-That-There

Put-That-There • One of the earliest multimodal concept demonstration using speech and pointing. • Created by Architecture Machine Group at MIT. • Quote from Richard Bolt: “Even after 17 years, looking at the video of the demo, you sense something special when Chris, seated before our media room screen, raises his hand, points, and says “ Put that (pointing to a blue triangle)…there (pointing to a spot above and to the left),” and lo, the triangle moves to where he told it to. I have yet to see an interface demo that makes its point as cleanly and succinctly as did that very first version of Put-That-There.”

Media Room • Size of personal office • Walls (not in picture) have loudspeakers on either side of the wall sized, frosted glass projection screen. • TV monitors on either side of user’s chair. • User Chair – arms have one-inch high joystick sensitive to pressure and direction. • Near each joystick, a square shaped touch sensitive pad.

Features of Media Room • Two spatial orders: virtual graphical space and user’s immediate real space. • Key Technologies used: • DP -100 Connected Speech recognition System (CSRS) by NEC America, Inc. – capable of limited amount of connected speech recognition. • ROPAMS: Remote Object Position Attitude Measurement System) for space position and orientation sensing – to track where the user is pointing. • Basic Items system recognizes: circles, squares, diamonds etc. • Variable attributes: color and sizes.

Commands “Create”: “Create a blue square there.” Effect of complete utterance is a “call” to the create routine which needs object to be created (with attributes) as well as x, y pointing input from wrist-borne space sensor. “Move”: “Move the blue triangle to the right of the green square” Pronomialized version: “ Move that there” (User does not even have know what “that” is.) Note: Pronomialization: • Makes utterances shorter • No need for reference objects

Some more commands “Make that …”: “Make that blue triangle smaller” “Make that smaller” “Make that like that” – Internally object indicated by second that is the model, the first object is deleted and replaced by a copy of the second. “Delete”: “Delete that green circle” “Delete that”

Commands… Command: “Call that …the calendar” Processing steps involved: • On hearing Call that recognizer sends code to host system indicating a naming command. The x, y coordinates of item signal are noted by host. • Host system directs speech recognition unit to switch from recognition mode to training mode to learn the (possibly new) name to be given to the object. • After completion of naming, recognizer is directed to go back to recognition mode. • Improvement : If recognizer could itself switch from recognition to training mode and back (without direction from the host system)

Possible Uses • Moving ships about in a harbor map in planning the harbor facility. • Moving battalion formations • Facilities planning moving rooms and hallways about.

CUBRICON

CUBRICON • System integrating deictic and graphic gestures with simultaneous NL for both user input and system output. • Unique Interface capabilities: • Accept and understand multimedia input – references to entities in NL can include pointing. Also disambiguate unclear references and infer intended referent. • Dynamically compose and generate multimodal language – synchronously present spoken NL, gestures and graphical expressions in output. Also distinguish between spoken and written NL

CUBRICON Dialogue example

COLOR- GRAPHICS DISPLAY MONOCHROME DISPLAY SPEECH INPUT DEVICE KEYBOARD DEVICE MOUSE POINTING DEVICE SPEECH OUTPUT DEVICE KNOWLEDGE SOURCES COORDINATED OUTPUT GENERATOR INPUT COORDINATOR LEXICON 5 GRAMMAR 1 DISCOURSE MODEL MULTIMEDIA PARSER INTERPRETER MULTIMEDIA OUTPUT PLANNER USER MODEL 2 OUTPUT PLANNING STRATEGIES 4 KB OF GENERAL KNOWLEDGE KB OF DOMAIN- SPECIFIC KNOWLEDGE EXECUTOR AND COMMUNICATOR TO TARGET SYSTEM 3 INTELLIGENT MULTIMEDIA INTERFACE TARGET APPLICATION SYSTEM DBMS MISSION PLANNING SYSTEM CUBRICON Architecture

CUBRICON- System Overview • 3 input and 3 output devices. • Primary Data Path: • Input coordinator: Fusing input streams • Multimedia Parser and Interpreter: Interpreting the compound stream. • Executor/ Communicator to the target system: Actions may include commands to mission planning system, database queries, etc • Multimedia Output Planner: plans the expression of result of the action of executor module. • Coordinated Output Generator: Produces multimedia output in coordinated real time manner.

CUBRICON Knowledge Sources • Used for understanding input and generating output. • Knowledge Source: • Lexicon • Grammar: defines multimodal language. • Discourse Model: Representation of “attention focus space” of dialogue. Has a “focus list” and “display model” – tries to retain knowledge pertinent to the dialogue.

CUBRICON Knowledge Sources • User Model: Has dynamic “Entity Rating Module” to evaluate relative importance of entities to user dialogue and task – tailors output and responses to user’s plans, goals and ideas. • Knowledge Base: Information about task domain (Air Force mission planning). Concepts like, SAMs, radars, air bases, missions.

Multimodal Language – Features in CUBRICON • Multimodal Language: Spoken or written NL and gestures. • Variety in objects that can be pointed to: windows, form slots, table entries, icons, points. • Variety in number of point gestures allowed per phrase. • Variety in number of multimodal phrases allowed per sentence.

Examples of referent determination • Example 1: User: “What is the mobility of these <point>, <point>, <point> ? (Use of more than one point gesture in a phrase). System uses “mobility” to select from candidate referents of the point gesture (if gestures are ambiguous) – users the display model and knowledge base. Note: Takes care of pointing ambiguities. Also, takes care of pointing inconsistent with NL by using information from the sentence as filtering criteria for candidate objects.

Examples of Referent Determination • Example 2: User: “Enter this <point-map-icon> here <point-form-slot>.” • Uses more than one phrase per sentence. • Uses more than one CRTs. • Two feature used to process this: • Display model containing semantic information about all CRTs. • All objects and concepts represented in single knowledge representation language (SNePS knowledge base) – shared by all modules.

Multimodal Language Generation • In output NL and gestures are integrated to provide unified multimodal language. • To compose reference for an object: • If object is an icon on display: points to icons and simultaneously generates NL expression. • If object is a part of an icon on display, points to the “parent” icon and generates NL describing the relation of the reference to the “parent” icon.

Multimodal Language Generation • Situation: If system wants to point to object which is represented in more than one windows on CRT: • Selects all relevant windows • Filters out non- active or non-exposed windows. • If some exposed windows contain object, uses weak gestures (highlighting) for all and select most important window and gestures strongly towards it (blink the icon plus text box) • If no exposed windows, then systems determines most important de-exposed window, exposes it and points to it.

When is Graphical Representation Generated? • If information being represented is: • Locative Information • Path traversal information. • Example (Locative Information): User: “Where is the Fritz Steel Plant?” • CUBRICON: • “The Fritz Steel plant (figure object) is located here <point-highlighting/blinking icon>, 45 miles southwest of Dresden (ground object)<graphical expression – arrow between two icons>.

Multimedia Input Analysis

Multimedia Analysis • The processing and integration of multiple input modes for the communication between a user and the computer. • Examples: • Speech and pointing gestures(Put-That-There, CUBRICON, XTRA) • Eye Movement based Interaction(Jacob, 1990) • Speech, gaze and hand gestures (ICONIC) • Speech and Lip Movement

Eye Movement-Based Interaction • Highly Interactive, Non-WIMP, Non-Command • Benefits • Extremely rapid • Natural, little conscious effort • Implicitly indicate focus of attention • WYLAIWYG

Issues of Using Eye Movement in HCI • Midas Touch • Eyes continually dart from point to point, not like relatively slow and deliberate operation of manual input devices • People not accustomed to operating devices simply by moving their eyes; if poorly done, could be very annoying • Need to extract useful dialogue information (fixation, intention) from noisy eye data • Need to design and study new interaction techniques • Costs of eye tracking(equipments)

Measuring Eye Movement • Electronic • Skin electrodes around eye • Mechanical • Non-slipping contact lens • Optical/Video - Single Point • Track some visible feature on eyeball; head stationary • Optical/Video - Two Point • Can distinguish between head and eye movements

Hardware Components A corneal reflection-plus-pupil eye tracker

Types of Eye Movements • Saccade • Rapid, ballistic, vision suppressed • Interspersed with fixations • Fixation (200-600ms) • Steady, but some jitter • Other movements • Eyes always moving; stabilized image disappears

Approach to Using EM • Philosophy • Use natural eye movements as additional user input • trained movements as explicit commands • Technical approach • Process noisy, jittery eye tracker data stream to filter, recognize fixations, and turn into discrete dialogue tokens that represent user's higher-level intentions • Then, develop generic interaction techniques based on the tokens

X time Eye position X-coordinates (~3 secs) Processing the EM data – Fixation Recognition • Fixation starts when theeye position stays within 0.5o > 100 ms (spatialand temporal thresholds filter the jitter) • Fixation continues as long as the position stays within 1o • 200 ms failures totrack the eye does not terminate the fixation

Processing the EM Data – Input Tokens • The fixations are then turned into input tokens • start of fixation • continuation of fixation (every 50 ms) • end of fixation • failure to locate eye position • entering monitored regions • The tokens formulate eye events • are multiplexed into the event queue stream with other input events • The eye events also carry information of the fixated screen object

Eye as an Input Mode • Faster than manual devices • Implicitly indicates focus of attention, not just a pointing device • Less conscious/precise control • Eye moves constantly, even when user thinks he/she is staring at a single object • Eye motion is necessary for perception of stationary objects • Eye tracker is always "on" • No analogue of mouse buttons • Less accurate/reliable than mouse

Eye as a Control Device • A taxonomy of approaches to eye movement- based interaction Unnatural response Natural response Unnatural (learned)eye movement A. Commandbased interfaces Naturaleye movement C. Virtual environments B. Noncommandinterfaces

Object Selection • Select object from among several on screen • After user is looking at the desired object, press button to indicate choice • Alternative = dwell time: if look at object for sufficiently long time, it is selected without further commands • Poor alternative = blink. • Found: 150-250 ms of dwell time feels instantaneous, but provides enough time to accumulate data for accurate fixation recognition • Found:Gaze selection is faster than mouse selection

Moving an Object • Two methods, both use eye position to select which object to be moved • Hold button down, “drag” object by moving eyes, release button to stop dragging • Eyes select object, but moving is done by holding button, dragging with mouse, then releasing button • Found: Surprisingly, first works better • Use filtered “fixation” tokens, not raw eye position, for dragging

Multimodal Input Analysis