audio

Saint-Petersburg Institute for Informatics and Automationof the Russian Academy of Sciences, Russia A MULTI-MODAL SYSTEM ICANDO :Intellectual Computer AssistaNt for Disabled OperatorsAlexey Karpov, Andrey Ronzhin and Alexandre CadiouEmail:karpov@iias.spb.suWeb: www.spiiras.nw.ru/speech 20 September 2006, Interspeech’2006-ICSLP, Pittsburgh PA, USA

Objective of the ICANDO system Assistance to persons without hands or with disabilities of arms in HCI. Instead of keyboard and mouse a human can use speech and head motions captured (audio + video) by an USB web-camera and recognized by the system. no keyboard audio video no mouse

General Architecture Users who have difficulties with standard PC control devices could manipulate screen cursor merely by moving the head and giving speech commands instead of clicking the buttons. Human’s speech Head motions Speech recognition Head tracking Synchro- nization Speech command Cursor coordinates (x,y) Information fusion Multi-modal command to GUI

Automatic Speech Recognition module SIRIUS: SPIIRAS Interface for Recognition and Integral Understanding of Speech engine is used to recognize voice commands in Russian, English and French. • Multi Gaussian continuous HMMs for acoustical modeling • Mel-frequency cepstral features • SAMPA phonetic alphabet for each language • Morphemic language model and vocabulary for Russian language • Russian morphemes of 3 kinds: prefix + root + suffix (ending) [ A. Karpov, A. Ronzhin, Speech Interface for Internet Service Yellow Pages, Intelligent Information Processing and Web Mining: Advances in Soft Computing, SpringerVerlag, 2005, pp.219-228.]

List of main voice commands

Problem of entering the text At present a user has to use a virtual keyboard on a desktop and control by head and voice command “Left”. Dictation systemfor Russian has being developed based on the SIRIUS engine.

Head tracking systems Hardware-based trackers SmartNAV system InterTrax system Using LEDs Software-based trackers • Optical flow methods • Retina filter • etc.

Head tracking module Intel OpenCV library (Open Source Computer Vision Library) USB web-camera Logitech QuickCam for Notebooks Pro with 640x480x30 fps 2 stages of system’s work: • calibration: Haar-based face detection, capturing the tracking points on a face. • tracking: Iterative Lucas-Kanade algorithm for optical flow for tracking five facial points: center of upper lip, the tip of nose, point between eyebrows, left eye and right eye. Motion of these point allows controlling the mouse cursor on the desktop.

Robust tracking method • Restoration of the lost tracking points in their corresponding operation areas (rectangles) taking into account other points. • Cursor velocity adaptation based on delta of points positions for consecutive image frames. Two cursor velocities: fast to move cursor and slow to select an object.

Synchronization of modalities The synchronization of modalities: concrete marker position is calculated at beginning of the phrase input (i.e. at the moment of triggering the algorithm for speech endpoint detection). Information fusion: frame method is used when the fields of some structure are filled by required data and on completion of recognition the signal for command execution is given.

Task performance results Scenario: finding the weather forecast at web-portal in Internet, selecting, copying and saving this information in a text document and printing this file. In experiments the multi-modal way of interaction was in 1.9 times slower than the traditional way. However, this decrease of interaction velocity is acceptable since the developed system is mainlyintended for impaired users.

Video fragment of system’s usage www.spiiras.nw.ru/speech/demo/assistive.html

Conclusions The result of the research is the assistive multimodal system. The experiments have shown that in spite of some decreasing of operationspeed the multimodal system allows working with a computer without usinga mouse and keyboard. ICANDO can be successfully used for hands-free PC control for users with disabilities of their hands or arms for hands-free work with a computer supporting the socio-ecomonic integration of impaired people in the information society and increasing their independence from other people.

Acknowledgements - SIMILAR Network of Excellence www.similar.cc SIMILARNoE studies the multimodal interfaces, which combine speech, gestures, vision, haptic interfaces and other modalities and contains over 40 partners from all Europe.

Thank you for your attention! E-mail: karpov@iias.spb.su Web: www.spiiras.nw.ru/speech Welcome to Saint- Petersburg !

audio

audio

Presentation Transcript

Audio

Audio

Audio

Audio

Audio Slideshow: Audio Tips

Audio

Audio

Audio

Audio

Audio

Audio

Audio

AUDIO

Audio

Audio

Audio

Audio

Audio