Component Description Multimodal Interface Carnegie Mellon University

Component Description Multimodal Interface Carnegie Mellon University Prepared by: Michael Bett mbett@cs.cmu.edu 3/26/99

1 - Overview • Description of the Multimodal Toolkit (MMI) What MMI is ... • Integrated Speech, Handwriting, and Gesture Recognizer Java Based API • Integrated Recording Feature • Plug-n-Play Recognizer Interface. Allows recognizers to be replaced • Internet Enabled Interface. Recognizers may run remotely over the internet • Simultaneous Multiple User Support • Supports Natural Interface Development

2 - Architecture Overview • MMI is a toolkit that allows multiple modalities to be easily integrated into applications. • Applications can mixed modalites (speech, gesture, and handwriting) Sample Application Which Uses Multimodal Error Repair Speech Janus/Speech Recognizer Acoustic Model The Java based API communicates directly with each recognizer The multimodal applet is the user interface; the applet window presents a view onto a domain-dependent representation of application data and state in the form of objects to be manipulated. Handwriting Vocabulary Multimodal Applet Multimodal Server Handwriting Recognizer Language Model Gesture Recognizer Gestures

3 - Component Description The following modalites have the following level of support in multimodal toolkit

4 - External Interfaces • The user defines their grammer using six probabilistically weighted nodes: • A Toplevel represents an entire input model and contains one or more sequences, each of which contains exactly one AFrame; • An AFrame represents an action frame and contains one or more sequences, each of which consists of one or more PSlots; • A PSlot represents a parameter slot and contains one or more UnimodalNodes (at most one for each input modality); • A UnimodalNode specifies a sub-grammar for a single input modality and has the same structure as a NonTerm, with the addition of a label specifying the modality; • A NonTerm is a non-terminal node consisting of one or more sequences, each of which contains zero or more NonTerms or Literals; • A Literal is a terminal node containing a text string representing one or more input tokens.

4 - External Interfaces • The Multimodal Server sends a series of points to the pen and gesture recognizers. • The audio is sent to the speech recognizer. • The pen, gesture and speech recognizers return their hypothesis to the multimodal toolkit which is responsible for integrating the results in an optimizing programming search as shown below. [Minh Tue Voh Dissertation 1998 CMU]

5 - Existing Software “Bridges” • The multimodal toolkit uses a Java API which allows applets or applications to incorporate multimodal functionality

6 - Information Flow • Part 1 - Specify how other CPOF components can send and receive data to your system - Please be explicit • Components may directly interface with the multimodal server • Part 2 - What are the inputs to your system - Please specify formats and protocol - provide details • Multimodal grammar • Part 3 - What are the outputs of your system - Please specify format and protocol - provide details • Hypothesis according to the multimodal grammer

7 - Plug-n-play • Part 1 - We have not currently identified how our components interact with other CPOF components. • Please present a diagram that shows this interaction TBD • Part 2 - Are there components in your system that are functionally “similar” to another CPOF component? TBD • Part 3 - Are any of your components complementing other CPOF components? (e.g ZUI and Sage/Visage) TBD

8 - Operating Environments and COTS Component Name Required Hardware Operating System Required COTS Language Multimodal Server JDK 1.1.* PC or Sun Independent Java Tcl/tk C Janus Sun - Ultra 60 Solaris 2.5.1 Tcl/Tk Solaris 2.5.1 or Windows NT NPen++ Sun or PC None C++ Gesture Recognizer Solaris 2.5.1 or Windows NT Sun or PC None C++

9 - Hardware Platform Requirement • Specify the hardware required to support your system: • MMI can run on a PC with a minimum of 32 Meg RAM and 200 Mhz processor. • The Speech Recognizer requires a Sun Ultra 60 dual processor with 500 Meg RAM minimum. (Current recognizer under development will require 500 Mhz Pentium III with a 128 Meg minimum, 256 Meg preferred) • Video capture cards, Soundblaster compatitable sound cards, table top and lapel microphones, pan tilt and stationary cameras are required.

Component Description Multimodal Interface Carnegie Mellon University

Component Description Multimodal Interface Carnegie Mellon University

Presentation Transcript

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University