370 likes | 402 Views
“Put-that-there”: Voice and Gesture at the Graphics Interface Richard A. Bolt. Presented by: Kathleen Murray Tracey Gordon Grainne Sharkey.
E N D
“Put-that-there”: Voice and Gesture at the Graphics InterfaceRichard A. Bolt Presented by: Kathleen Murray Tracey Gordon Grainne Sharkey
What is described in this presentation involves the user commanding simple shapes about a large graphics display with the use of voice and simultaneous pointing • This allows the free usage of pronouns which makes the interaction more natural • Gesture aided by voice gains precision in its power to reference, as you will see later in the presentation
Introduction • Basis of this presentation has been brought about by the Architecture Machine Group at the Massachusetts Institute of Techology • They have been experimenting with the conjoint use of voice-input and gesture –recognition to command events on large graphics displays • Central Interest is the combination of voice and gesture into one modality • Approach involves significant use of pronouns – as temporary variables to reference items on the display
Interactions to be described in this presentation are staged in the Architecture Machine Group’s “Media Room” • Physical facility • Users terminal is literally a room into which they step
Media Room • Size of a personal office • Cabling, from devices used in the room, are hidden underfloor • Walls house banks of loudspeakers on either side of a large projection screen to the front of the user • Users chair incorprates two small joysticks on each arm sensitive to pressure and direction • Beside each joystick is a square-shaped touch sensitive pad
Colour TV monitors are situated on either side of the users chair with a transparent, touch-sensitive pad
The Media Room with its user chair plays a key role in the researching into a Spatial Data Management System (SDMS) • Spatially Indexing Data • Derives from our everyday experience of retrieiving items • Retrieval is natural and automatic • Even with a messy desk, for example, the user will have a well-known knowledge of where the items are located on the desk. They will have developed a mental image of the layout of the desk
SDMS • Navigate about to specific information • Used by the Media Room • Information appears in its entirety upon one of the colour TV monitors near the users chair • The user can move a “you-are-here” marker (transparent rectangular overlay) about the information using the chairs right-hand joystick or can directly touch the TV screen • This information can be displayed in increased detail on the large screen • Use the left-hand joystick to zoom-in on the information
The two spatial orders within the room – • Virtual graphical space • The users immediate real space can converge to become effectively one continuous interactive space • User awareness of this common space is implicit • User points, gestures, references “up”, “down” and so on, freely and naturally as the user is situated in a real space • With this interactive situation, it leads to two new technical offerings, those being • Connected speech recognition • Position sensing in space
Speech and Space: The Technologies • Speech Recognisers • Two categories – • Those which recognise discrete or isolated utterances - the speaker must talk to the system in a “clipped” or word-by-word style • Those which recognise connected speech • Connected speech recognisers allows up to five words or “utterances” per spoken sentence and no pauses between words
Speech Recognisers (Cont..) • Response time is about 300m/secs and output is a display of the text • Devices vocabulary • Held in recognisers active memory • A set of word reference patterns • Maximum of 120 words • If there is an optional ‘discete utterance’ mode, then the size of the vocabulary may be larger, about 1000 words • System comes with a lightweight, head-mounted microphone which is used by the Media Room
Space • Space position and orientation sensing technology suitable for this Media Room was made by Polhemus Navigation Science Inc. of Essex, Vermont • Called Remote Object Position Attitude Measurement System (ROPAMS) • Essentials of the system are: • Three coils are cross-linked into a plastic cube, their mountings mutually corresponds to the x,y and z spatial axes • Two of these cubes are involved: one that acts as a transmitter and another that functions as a sensor • Arrangement of the coils in each cube creates an antenna that is sensitive in all three orientations
Space (Cont..) • Transmitter cube • Transmits a signal to the sensor cube • If signal isnt strong enough, this will generate an error and require to re-aim • In the Media Room, this cube sits on a block to the right of the users chair • Sensor Cube • Very lightweight • Point in space is determined by the three coils on this cube • In the Media Room, the user has the space-sensing cube attached to a wristband
We have now described to you what physical aspects are involved with the Media Room and the technologies developed to allow voice and gesture inputs at the graphics interface • We will now move on and present to you how this room (users terminal) and its related technologies (connected-speech recogniser and position sensing cube) work together to generate a natural and powerful interaction
Commands • The user is seated before the Media Room’s large screen. • They have a space-sensing cube attached to a watchband on their wrist, the system’s microphone is ready and listening. • Some systemcommands, that demonstrate voice and pointingare the following. • Create • Move • Make that
Create • In the demonstration system, the large screen will initially be clear or have a simple backdrop such as a map. • Simple items are placed against this background. • These items will be basic shapes, such as circles, squares or diamonds. • These items can be moved about, replicated, their attributes altered, or ordered to vanish. • Variable attributes are colour and size.
The user points to some spot on the large screen. • A small, white “x” cursor on the screen provides running visual feedback for pointing. • The user then says “Create a blue square there.” • A blue square appears on the spot where the user is pointing. • As the size of the square is not given explicitly, the default size, which is medium will be used. • There is no default colour or shape, so these must be specified.
The position of the feed-back cursor on the screen at the time the spoken “there” occurs, becomes the spot where the item that is being created is placed. • Thus the occurrence of the spoken “there” is functionally a “when”. • In effect the command is a call to a create routine. The routine will require certain parameters to be supplied. • Before the user recites “there”, the parameter values are supplied, such as the shape, colour and size of the object that is going to be created. • The parameter input, completes the conjunction of the position pointing input with the utterance of “there”.
Move • There are a number of ways the user can move items about the screen. • Example:“Move the blue triangle to the right of the green square.” • This example relies on voice mode only. • When the blue triangle is addressed, it de-saturates as immediate feedback, disappears from its present site to re-appear centered in a spot to the right of the green square. • A reasonable placement of the exact positioning “to the right” is executed. • The item is now where the user ordered it to be.
The user could also say: “Move that to the right of the green square.” • This option employs the pronoun “that,” the user simultaneously points to what is intended. • In this mode of giving the command the user may not only omit the word “blue” and “triangle,” they do not need to know what the thing is, or what it is called. • “That” is thus defined as the item that is currently being pointed to.
The entire command “move the blue triangle to the right of the green square” can be shortened into “put that there ” • A mini-thesaurus of common synonyms, such as “move,” “put,” etc., is built into the system vocabulary. • The “Copy…” command is a variant of the move action, except that the image of the item to be moved also remains in place at the original spot.
Make that • Attributes of any item that the user has created by voice and gesture can be modified. Here, the attributes are colour and size. • Example: “Make the blue triangle smaller” • This command causes the referenced item to be reduced in size. • Example: “Make that a large blue diamond” • If this is uttered while the user is pointing at a small yellow circle, it will be transformed.
In the command line “make that like that” • the second “that” is, functionally, a when to read the x,y coordinate of pointing. • The item indicated when the second that is uttered becomes the “model” for change. • The first referenced item, is replaced in a “copy” like fashion by the second referenced item. • “Delete …” • The “delete” command allows the user to drop selected items from the display. It can be accessed via voice or pointing.
I have described how the system generates a natural and powerful interaction, with the use of the commands that I have talked about. • Tracey will now discuss the Naming of the items and give a summary of the overall paper.
Naming • Using the ‘Call that’ command the user can name objects that he/she points to on the screen • Eg: “Call that (pointing to a blue square)…the calendar” • ‘Call that’ is processed by the recogniser • This tells the system that the naming command has been issued • The x, y co-ordinates of the object being pointed to are recorded by the host system • When the naming command is issued the system switches from recogniser mode to training mode
Recogniser and Training modes • Recogniser mode: • the system is listening for keywords and commands that it recognises - “call that” • Training mode: • the system records new words and adds them to a file - “the calendar” • The system records the last part of the sentence • It adds this as a new entry to its file of word reference patterns
It then returns to recogniser mode • It is ready for the next verbal input • Switching between recogniser mode and training mode takes a finite amount of time • A brief pause is required in the spoken command line to accommodate this • User tends to pause at that point anyway • waiting for feedback that they have actually contacted the item that they are pointing to • This pause masks the system’s need to pause between recognition and training modes • It does however suggest a break down in general convenience of continuous V’s discrete speech input
Intelligent systems • Hopefully the system will one day be intelligent enough to somehow interpret as well as recognise • when it hears a certain keyword it will automatically switch from recognition mode to training mode, without the need to pause.
For the ‘Call that’ command the intelligent system would • truncate the recognised keyword from the spoken sentence - e.g.: “Call that” • and take the rest of the sentence to be the new name • associate this new name with the object that is being pointed to at the time • the blue square is now ‘the calendar’ • To maintain overall co-ordination with the host system • the recogniser transmits ASCII codes for recognised or learned words • also transmits any relevant ‘control’ codes
Advantage • eliminates the need for the user to pause within the spoken dialogue • Disadvantage • problem of ‘Coarticulation’ • when the user speaks, the sound of the word is influenced by the words that either precede or follow the word in question • it is important that the user speaks very clearly, especially when adding a new word
Summary • The example used in this presentation shows simple objects being moved about a blank screen • It shows how easy it would be to use a similar system in real world situations • moving ships about a harbour map in planning a harbour facility • moving battalion formations about as overlays on a terrain map
The main advantage of using the “Put-that-there” system, is that you can indicate what you want to do with items in a natural, spontaneous way • pointing to them and addressing them in spoken words - not typed symbols • The use of pronouns make it even more simple and natural to use • An example of a system that uses Voice and Gesture recognition is IBM’s DreamSpace • The system “hears” the users’ voice commands and “sees” their gestures
Critical Analysis • This paper suggests a system that provides a more efficient and natural method of communicating commands to a computer • The author doesn’t provide an evaluation of the approach. No explicit evidence is offered to prove the system works as it is described. • A weakness of the system is that the speech recognition mechanism requires more intelligence to provide a truly natural interaction • the user has to pause when giving certain commands to allow time to process
The descriptions are convincing and the underlining argument for the success of such a system seems sound • but it is impossible to fully determine the success of such a system without any proof • A weakness of the paper is that it may spend too much time describing the specifics of hardware technologies of the Media Room • it doesn’t seem necessary to go as in-depth as it does • the central part of the paper can be conveyed well enough without it • The paper would be most useful for people who have an interest in enhancing the quality of human computer interactions
How it relates to our Final Year Projects • One of the projects includes an intelligent spoken interface for an on-line banking system. • It already involves the use of voice directed to the interface, rather than the use of keyboard/mouse • To incorporate gesture recognition, the user could point at items on the interface to make the interaction more natural • e.g: point to the ‘Balance’ icon and say “Show me this”
Another of the projects provides a property buy & sell on-line. • It does not currently use either voice or gesture recognition, but they could be included to add a more powerful and natural means of interacting with the system • the user could point to a property and ask to “Show me the details of this” • the user could point to the scroll arrows and say “scroll up” or “scroll down”
Another of the projects is to generate a 2D map from a 3D virtual environment. • The map will show the users position in the environment. • This system could be used in this project. • The user could point to an area on the map and ask “what is this?” a description of the area being pointed to could then be given. • The user can flag different areas on the map that they wish to go to, the user could point to a place where they wish to insert a flag and say “Insert a flag there”. • The map could be used for automatic navigation, where the user could go to a place by selecting it from the map. The user could point to the area that they wish to go to, and say “Go there”. The user will then go to that location In the virtual environment.