Agnieszka Wagner Department of Phonetics, Institute of Linguistics,

Spoken Language Technologies:A review of application areas and research issuesAnalysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań Humboldt-Kolleg, Słubice 13.-15. November 2008

Introduction The need for and increasing interest in SLT systems: • oral information is more efficient than a written message • speech is the easiest and fastest way of communication (man – man, man – machine) Progress in the field: • technological advances in computer science • availability of specialized speech analysis and processing tools • collection and management of large speech corpora • investigation of acoustic dimensions of speech signals fundamental frequency (F0), duration, intensity and spectral characteristics Spoken Language Technologies: Introduction (1)

The tasks of SLT systems (TTS and ASR) Speech synthesis (TTS, text-to-speech) systems • generate speech signal for a given input text • example: BOSS (Polish module developed at Dept. of Phonetics in cooperation with IKP, Uni Bonn) • ECESS (European Centre of Excellence in Speech Synthesis): standards of development of language resources, tools, modules and systems Automatic speech recognition (ASR) systems • provide text of the input speech signal • example: Jurisdic (first Polish ASR system for needs of Police, Public Prosecutors and Administration of Justice) Spoken Language Technologies: Introduction (2)

Application areas Speech synthesis • telecommunications (access to textual information over the telephone) • information retrieval • measurement and control systems • fundamental & applied research on speech and language • a tool of communication e.g. for the visually handicapped Speech recognition & related technologies • text dictation • information retrieval & management • man machine communication (together with speech synthesis): - dialogue systems, - speech-to-speech translation, - Computer Assisted Language Learning, CALL (e.g. the AZAR tutoring system developed in the scope of the EURONOUNCE project) Spoken Language Technologies: Application areas

Performance Generally,the output quality is high as regards generation/recognition of the linguistic propositional content of speech Speech synthesis • high intelligibility and naturalness in limited domains (e.g. broadcasting news) Speech recognition • the best results for small vocabulary tasks • the state-of-the-art speaker-independent LVCSR systems achieve a word-error rate of 3% Spoken Language Technologies: Performance of TTS and ASR systems

Limitations • insufficient knowledge about methods for processing the non-verbal content of speech i.e. affective information – speaker’s attitude, emotional state, mood, interpersonal stances & personality traits Speech synthesis • lack of variability in speaking style which encodes affective information can be detrimental to communication (e.g. in speech-to-speech translation) • data-driven approach to conversational, expressive speech synthesis is inflexible and quite costly Speech recognition • transcription of conversational and expressive speech – substantially higher word-error rate Spoken Language Technologies: Limitations of TTS and ASR systems

Progress • the need of modeling the non-verbal content of speech i.e. affective information Applications: • high-quality conversational and emotional speech synthesis (for dialogue or speech-to-speech translation systems) • commerce – monitoring of the agent-customer interactions, information retrieval and management (e.g. QA5) • public security, criminology – secured area access control (speaker verification), truth-detection invesitgation (e.g. Computer Voice Stress Analyzer, Layered Voice Analysis) Humboldt-Kolleg, Słubice 13.-15. November 2008 Spoken Language Technologies: Progress in the field (1)

Emotion: Anger, Fear, Elation • higher mean F0 • higher F0 variability • higher intensity • increased speaking rate • Emotion: Sadness, Boredom • lower mean F0 • lower F0 variability • lower intensity • decreased speaking rate Progress Prosodic features: fundamental frequency (F0 – the central acoustic variable that underlies intonation), intensity, duration and voice quality -> encoding and decoding of affective information • Intonation models: • hierarchical, sequential, acousitc-phonetic, phonological, etc. • linguistic variation – well handled • affective, emotional variation – unaccounted for Humboldt-Kolleg, Słubice 13.-15. November 2008 Spoken Language Technologies: Progress in the field (2)

analysis (encoding) intonation description F0 generation (decoding) The comprehensive intonation model: Components • a module of F0 contour analysis • a module of F0 contour synthesis • description of intonation • discrete tonal categories (higher-level, access to the meaning of the utterance) • acoustic parameters (low-level) The comprehensive intonation model: Components

Automatic analysis of F0 contours • Summary • results comparable to inter-labeler consistency in manual annotation of intonation • high accuracy achieved using small vectors of acoustic features • statistical modeling techniques • application: 1) automatic labeling of speech corpora, 2) lexical & semantic content, 3) ambiguous parses, 4) estimation of F0 targets • Automatic synthesis of F0 contours • Summary • estimation of F0 values with a regression model • results comparable to those reported in the literature • natural (similar to the original ones) F0 contours for synthesis of a high quality and comprehensible speech (confirmed in perception tests) The comprehensive intonation model: Analysis and Synthesis

Audio (1): Mean opinion in the perception test: no audible difference The comprehensive intonation model: Synthesis example (1)

Audio (2): Mean opinion in the perception test: very good quality The comprehensive intonation model: Synthesis example (2)

Future research Extensive and systematic investigation of the mechanisms in voice production and perception of affective speech: • contribution from other knowledge domains (psychology) • affective speech data collection • classification of affective states • types of acoustic parameters • measurement of affective inferences Humboldt-Kolleg, Słubice 13.-15. November 2008 Spoken Language Technologies: Future research issues

THANK YOU FOR YOUR ATTENTION!

Agnieszka Wagner Department of Phonetics, Institute of Linguistics,