Multimodal corpora and speech technology

Multimodal corpora and speech technology Kristiina Jokinen University of Art and Design Helsinki kristiina.jokinen@uiah.fi

Metaphors for Human-computer interaction • Computer as a tool • Passive and transparent • Supports the human goals, human control • Computer as an agent • Intelligent software mediating interaction between the human user and an application • Models of beliefs, desires, intentions (BDI) • Complex interaction • Cooperation, negotiation • Multimodal communication NordTalk NorFA Course "Using Spoken Language Corpora"

Research at UIAH • Interact: • Cooperation with Finnish universities, IT companies, Association of the Deaf, Arla Institute • Finnish dialogue system • Rich interaction situation • Adaptive machine learning techniques • Agent-based architecture • www.mlab.uiah.fi/interact/ • DUMAS: • EU IST-project (SICS, UIAH, UTA, UMIST, Etex, Conexor, Timehouse) • User modelling for AthosMail (Interactive email application) • Reinforcement learning and dialogue strategies • www.sics.se/~dumas/ NordTalk NorFA Course "Using Spoken Language Corpora"

Multimodal Museum Interfaces • Marjo Mäenpää, Antti Raike • Study projects • New ways of relating art that is both visually interesting and accessible in terms of contents: • virtual human (avatar) that interactively guides the user through the exhibition using both spoken and sign language • Design for all: accessibility to virtual visitors on museum web sites NordTalk NorFA Course "Using Spoken Language Corpora"

MUMIN Network • NorFA network on MUltiModal INterfaces • Support for contacts, cooperation, education, and research on multimodal interactive systems • MUMIN PhD-course in Tampere 18-22 November (lectures and hands-on exercises on eye-tracking, speech interfaces, electromagnetogram, virtual world) • More information and application forms: http://www.cst.dk/mumin NordTalk NorFA Course "Using Spoken Language Corpora"

Content of the lecture • Definitions and terminology • Why multimodality • Projects and tools • Multimodal annotations • Conclusions and references NordTalk NorFA Course "Using Spoken Language Corpora"

Definitions and Terminology

What is multi-modality? Mark Maybury: Dagstuhl seminar 2001 NordTalk NorFA Course "Using Spoken Language Corpora"

Human-computer interaction Gibbon et al. (2000) Handbook of Multimodal and Spoken Dialogue Systems Control: manipulation and coordination of information Perception: transforming sensory information to higher level representations NordTalk NorFA Course "Using Spoken Language Corpora"

Terminology • Maybury and Wahlster (1998) • Medium= material object used for presenting or saving information, physical carriers (sounds, movements, NL) • Code = system of symbols used for communication • Modality = senses employed to process incoming information (vision, audition, olfaction, touch, taste) => perception • vs. communication system, consisting of a code expressed through a certain medium => HCI NordTalk NorFA Course "Using Spoken Language Corpora"

ISLE/NIMM definitions • Medium = physical channel for information encoding: visual, audio, gestures • Modality = particular way of encoding information in some medium NordTalk NorFA Course "Using Spoken Language Corpora"

EAGLES definitions • Multimodal systems represent and manipulate information from different human communication channels at multiple levels of abstraction • Multimedia systems offer more than one device for user input to the system and for system feedback to the user, e.g. microphone, speaker, keyboard, mouse, touch screen, camera • do not generate abstract concepts automatically • do not transform the information • Multimodal (audio-visual) speech systems utilise the same multiple channels as human communication by integrating non-verbal cues (facial expression, eye/gaze, and lip movements) with ASR and SS NordTalk NorFA Course "Using Spoken Language Corpora"

Why multimodality? NordTalk NorFA Course "Using Spoken Language Corpora"

Why multimodal research • Next generation interface design will be more conversational in style • Flexible use of input modes depending on the setting: speech, gesture, pen, etc. • Broader range of users: ordinary citizens, children, elderly, users with special needs • Human communication research • CA, psychologists • esp. nonverbal behaviour and speech • Animated interface agents NordTalk NorFA Course "Using Spoken Language Corpora"

Advantages of MM interfaces • Redundant and/or complementary modalities can increase interpretation accuracy • E.g. combine ASR and lipreading in noisy environments • Different modalities, different benefits • Object references easier by pointing than by speaking • Commands easier to speak than to choose from a menu using a pointing device • Multimedia output more expressive than single-medium output • New applications • Some tasks cumbersome or impossible in a single modality • E.g. interactive TV NordTalk NorFA Course "Using Spoken Language Corpora"

Advantages (cont.) • Freedom of choice • users differ in their modality preferences • user have different needs (Design for All) • Naturalness • Transfer the habits and strategies learned in human-human communication to human-computer interaction • Adaptation to different environmental settings or evolving environments • switch from one modality to another depending on external conditions (noise, light...) NordTalk NorFA Course "Using Spoken Language Corpora"

”Disadvantages” • Coordination and combination of modalities • cognitive overload of the user by stimulation with too many media • Collection of data more expensive • more complex technical setup • increased amount of data to be collected • interdisciplinary know-how • “Natural” remains a rather vague term NordTalk NorFA Course "Using Spoken Language Corpora"

Projects and tools NordTalk NorFA Course "Using Spoken Language Corpora"

EAGLES/ISLE initiatives • EAGLES = Expert Advisory Group on Language Engineering Standards • Gibbon et al. (1997) Handbook on Standards and Resources for Spoken Language Systems. • Gibbon et al (2000) Handbook of multimodal and spoken dialogue systems. Resources, Terminology, and Product Evaluation. • ISLE/NIMM = International Standards for Language Engineering / Natural Interaction and Multi-Modality • discuss annotation schemes specifically for the fields of natural interaction and multi-modal research and development • develop guidelines for such schemes NordTalk NorFA Course "Using Spoken Language Corpora"

NITE • Dybkjaer et al. 2001 • workbench for multilevel and multimodal annotations • general purpose tools: stylesheets determine look and functionality of the user’s tool • continue on the basis of the project MATE • http://nite.nis.sdu.dk/ NordTalk NorFA Course "Using Spoken Language Corpora"

MPI Projects • Max Planck Institute for Psycholinguistics (MPI) in Nijmegen • develop tools for the analysis of multimedia (esp. audiovisual) corpora • support the scientific exploitation by linguists, anthropologists, psychologists and other researchers • CAVA (Computer Assisted Video Analysis) • EUDICO (European Distributed Corpora) • platform-independent • support various storage formats • support distributed operation via the internet NordTalk NorFA Course "Using Spoken Language Corpora"

ATLAS/Annotation Graphs • Framework to represent complex annotations on signals of arbitrary dimensionality • Abstraction over the diversity of linguistic annotations expanding on Annotation Graphs • http://www.nist.gov/speech/atlas/ NordTalk NorFA Course "Using Spoken Language Corpora"

TalkBank • Five year interdisciplinary research project funded by NSF • Carnegie Mellon University and the University of Pennsylvania • Developing a number of tools and standards • Study human and animal communication • Animal Communication • Classroom Discourse • Linguistic Exploration • Gesture and Sign • Text and Discourse • CHILDES database is viewed as a subset of TalkBank • http://www.talkbank.org/ NordTalk NorFA Course "Using Spoken Language Corpora"

Annotation Tools • Anvil (Michael Kipp): speech and gesture • AGTK (Bird and Liberman): speech • MMAX (Mueller and Strube): speech, gesture • Multitool (GSMLC Platform for Multimodal Spoken Language Corpora): video, video • http://www.ling.gu.se/gsmlc/ NordTalk NorFA Course "Using Spoken Language Corpora"

Some statistics of the tools • Dybkjaer et al. (2002): ISLE/NIMM Survey of Existing tools, standards and user needs • Speech is the key modality 9/10 • Gesture 7/10 • Facial expression 3/10 NordTalk NorFA Course "Using Spoken Language Corpora"

Annotation Graphs • Bird et al. 2000 • Formal framework for representing linguistic annotations • Abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems • AGTK (Annotation Graph Toolkit): • nodes encode time points, edges annotation labels • http://agtk.sourceforge.net/ NordTalk NorFA Course "Using Spoken Language Corpora"

AGTK: Discourse Annotation Tool NordTalk NorFA Course "Using Spoken Language Corpora"

Anvil - Annotation of Video and Language Data • Michael Kipp (2001) • Java-based annotation tool for video files • Encoding of nonverbal behaviour (e.g. gesture) • Import annotations of speech related phenomena (e.g. dialogue acts) on multiple layers, tracks • Track definitions according to a specific annotation scheme in Anvil's generic track configuration • All data storage and exchange is in XML NordTalk NorFA Course "Using Spoken Language Corpora"

Anvil – screen shot NordTalk NorFA Course "Using Spoken Language Corpora"

Multimodal Annotation NordTalk NorFA Course "Using Spoken Language Corpora"

Multi-media corpora • Contain multi-media information where various independent streams such as speech, gesture, facial expression and eye movements are annotated and linked • Hugely complex due to complicated time relationships between the annotations

Annotation Challenges • Better understanding of natural communication modalities: human speech, gaze, gestures, facial expressions => how do different modalities support input disambiguation • Behavioural issues: automaticity of human communication modes • Multiparty communication • Technical Challenges: • Synchronisation • Error handling • Multimodal platforms, toolkits, architectures NordTalk NorFA Course "Using Spoken Language Corpora"

Annotation Issues • Phenomena • What is investigated: sounds, words, dialogue acts, coreference, new information, correction, feedback • Theory • How to label, what categories • Representation • Markup NordTalk NorFA Course "Using Spoken Language Corpora"

XML representations • eXtended Markup Language • Becoming a standard for data representation <word>happen</word> <word base=”happen”> • Distinction between elements and attributes: • <word> <base>happen</base> <pos>verb</pos> </word> • <word base=”happen”> <pos>verb</pos> • <word base=”happen” pos=”verb”> • XSL Stylesheet Language • XSLT Language to convert XML documents into another document in any form • Does not support: • typed/grammar specification of attribute values • inference models for element values shared by more than one element • applicability restriction of attributes that are mutual exclusive NordTalk NorFA Course "Using Spoken Language Corpora"

Speech annotation Gibbon et al. (2000) Handbook of Multimodal and Spoken Dialogue Systems NordTalk NorFA Course "Using Spoken Language Corpora"

Spoken Dialogue Annotations • Dialogue Acts (Communicative Acts) • GCSL: acceptance, acknowledgement, agreement, answer, confirmation, question, request, etc. • Interact: • Feedback • strucutre, position, function • Turn management • overlap (give attention, affirmation,reminder, excuse, hesitation, disagreement, lack of hearing) • opening/closing an activity NordTalk NorFA Course "Using Spoken Language Corpora"

Interact tags (Jokinen et al. 2001) Interact tags NordTalk NorFA Course "Using Spoken Language Corpora"

Non-linguistic Vocalizations • CHRISTINE corpus • Simple descriptions: belch, clearsThroat, cough, crying, giggle, humming, laugh, laughing, moan, onTelephone, panting, raspberry, scream, screaming, sigh, singing, sneeze, sniff, whistling, yawn • More complex descriptions: imitates woman's voice, imitating a sexy woman's voice, imitating Chinese voice, imitating drunken voice, imitating man's voice, imitating posh voice, mimicking police siren, mimicking Birmingham accent, mimicking Donald Duck, mimicking stupid man's voice, mimicking, speaking in French, spelling, whingeing, face-slapping noise, drowning noises, imitates sound of something being unscrewed and popped off, imitates vomiting, makes drunken sounds and a pretend belch, makes running noises, sharp intake of breath, click • Non-vocal events: loud music and conversation, banging noise, break in recording, car starts up, cat noises, children shouting, dog barks, poor quality recording, traffic noise, loud music is on, microphone too far away, mouth full, telephone rings, beep, clapping, tapping on computer, television NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Annotation 1 • Different types: • iconic, pointing, emblematic • Different functions: • make speech understanding easier • make speech production easier • add semantic and discourse level information NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Annotation 2 • What to annotate • Time • Movement encoding • Body parts involved (head, hand, fingers) • Static vs dynamic components • Direction, path shape, hand orientation • Location w r t body NordTalk NorFA Course "Using Spoken Language Corpora"

LIMSI Coding Schema for MM Dialogues (Car Driver & Co-pilot) • general: v stands for verbal, g stands for gesture, c stands for human copilot, p stands for human pilot, / and \ stands for begin and end of gesture, % stands for a comment written by the encoder, [ and ] are used for defining successive segments of the itinerary ({ and } code fsubparts of such segments) • time: < timecode-begin / timecode-end > • body part: te=tête (head), ma=main (hand), mo=menton (chin), ms=mains (both hands) • fingers : ix=index (first finger), mj=majeur (middle finger), an=annulaire (ring finger), au=auriculaire (little finger), po=pouce (thumb) • gaze : oc= short glance on the map, ol= long glance on the map • shape of the body part: td=tendu (tense), sp=souple (loose), cr=crochet (hook) • global movement: mv=mouvement ample (wide movement), r=mouvements répétés (repeated movement), ( )=statique • direction of movement: ar=arrière (backwards), tr=transversal (side), ci=circular • meaning of gesture: ds=designation, ca= designation on the map, dr=direction, dc=description, pc=position NordTalk NorFA Course "Using Spoken Language Corpora"

LIMSI Coding Schema • Example: v(p): et maintenant? v(c): on va, non /là-bas je/ pense, tout droit g(c): ixtddr graphic(copilot): index finger tense direction NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Coding Schemas 1 Dybkjaer et al (2002) ISLE/NIMM Survey on MM tools and resources NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Coding Schemas 2 Dybkjaer et al (2002) ISLE/NIMM Survey on MM tools and resources NordTalk NorFA Course "Using Spoken Language Corpora"

Facial Action Coding (FACS) • P. Ekman & W. Friesen (1976) • describes visible facial movements • anatomically based • Action Unit (AU): action produced by one muscle or group of related muscles • any expression described as a set of AUs • 46 AUs defined NordTalk NorFA Course "Using Spoken Language Corpora"

AUs for raising eye-brows Dybkjaer et al (2002) ISLE/NIMM Survey of Annotation Schemes and Identification of Best Practise NordTalk NorFA Course "Using Spoken Language Corpora"

Alphabet of the eyes • I. Poggi, N. Pezzato, C. Pelachaud • Gaze annotation • eyebrow movements, eyelid openness, wrinkles, eye direction, eye reddening, humidity • E.g. eyebrows: • right/left: Internal: up / down Central: up / down External: up / down NordTalk NorFA Course "Using Spoken Language Corpora"

Conclusions • Need for corpora annotated with multimodal information • Much to do in coding MM information in all forms, relevant level of detail, cross-level & cross-modality • No general coding schemas • coding schemas for different aspects of facial expression, task-dependent gestures etc • No cross-modality coding schemas • Lack of theoretical formalisation • how the face expresses cognitive properties • how gestures are used (except for sign language) • how they are coordinated with speech • No general annotation tools NordTalk NorFA Course "Using Spoken Language Corpora"

References • Bernsen, N. O., Dybkjær, L. and Kolodnytsky, M.: THE NITE WORKBENCH - A Tool for Annotation of Natural Interactivity and Multimodal Data. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC'2002), Las Palmas, May 2002. • Bird, S. and M. Liberman. A formal framework for linguistic annotation. Speech Communication, 33(1,2):23-60, 2001. • Dybkjaer et al (2002). ISLE/NIMM reports. http://isle.nis.sdu.dk/reports/wp11/ • Gibbon, D., Mertins I. and R. Moore (eds.) Handbook of multimodal and spoken dialogue systems. Resources, Terminology, and Product Evaluation. Kluwer, 2000. • Granström, B. (ed.) Multimodality in Language and Speech Systems. Dordrecht: Kluwer 2002. • Kipp, M. Anvil - A Generic Annotation Tool for Multimodal Dialogue. Proceedings of Eurospeech 2001, pp. 1367-1370, Aalborg, September 2001. • Maybury, M. T. and W. Wahlster (1998). Readings in Intelligent User Interfaces. San Francisco, CA, Morgan Kaufmann • Muller, C. and M. Strube. MMAX: Atool for the annotation of multi-modal corpora. In Proceedings of 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, Washington, pp. 4550, 2001. • Wahlster, W (ed). Dagstuhl seminar on Multimodality http://www.dfki.de/~wahlster/Dagstuhl_Multi_Modality/ NordTalk NorFA Course "Using Spoken Language Corpora"

Multimodal corpora and speech technology

Multimodal corpora and speech technology

Presentation Transcript

Speech Technology

Corpora, Language Technology and Maltese

Speech and multimodal

Automatic phonetic transcription of large speech corpora

Designing Speech and Multimodal Applications for Seniors

Russian multimodal corpora

Validation and Distribution of Speech Corpora

Speech reconstruction on NAP/AAA corpora

Speech Technology

Markup of Multimodal Emotion-Sensitive Corpora

Language and Speech Technology: Introduction

Text-to-speech Technology: Multimodal Learning, Proofreading, and Struggling Readers

Creating a Multimodal Design Environment Using Speech and Sketching

Introduction to Speech Corpora@Stanford

Speech Research and Corpora in Thailand

Language and Speech Technology

Creating a Multimodal Design Environment Using Speech and Sketching

Speech Technology