Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication

The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication Munich, Germany rigoll@ei.tum.de

General Project dates ALERT system for selective dissemination of multimedia information • Official start: 01/2000, start of work: 03/2000, duration: 30 months • Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding • Web Site: http://alert.uni-duisburg.de

Media information flooding NEWS Internet supervision by information brokers

NEWS Internet Media monitoring in the alert project information (sound, video, text) topic detection transcription today‘s headlines .... TAXES ALERT MESSAGE

General project Objectives • To develop a demo system capable of identifying specific information in multimedia data, consisting of • text, • audio and • video streams • using • advanced speech recognition • video processing techniques • automatic topic detection algorithms • demonstrator shall • alert a user about the existence of requested information • send detailed information (on client's further request) • extracted text • annotated audio/video data and video clips • provide functionality in French, German and Portuguese • demo system will be evaluated mainly by industrial partners

THe alert Consortium integration technologies users Consortium

deliverable milestone today WP structure (WP0-WP4)

deliverable today milestone WP structure (WP5-WP7)

Collection of pilot corpus • First step to setup similar resources • Purpose: testbed for assessing methods for data collection, annotation and distribution • Collection guidelines: • Minimum amount: 5 hours • Type of data: video, audio and annotation • Video format: MPEG1 • Audio format: PCM linear, 16KHz sampling rate, 16 bits/sample, mono, collected from antenna • Annotation based on LDC guidelines • Thematic orientation: news and interview shows

Collection of final databases • Experimental results • recommendations for final corpus • quality  mp3, 32 kbps, 16kHz, mono • Minimum amount: • speech recognition: 50 hours (training), 3 hours (development), 3 hours (evaluation) word-labelled • topic detection: 300 hours, topic annotated • text corpus: 100 million words • Full data set: • 1300 hours word or topic annotated • > 10k topic annotated summaries in German • text corpus: > 1 billion words

comparison of coding schemes for broadcast speech databases

Multimedia datA-labeling and alert-generation multimedia document video/image processing segmentation if video contained video-based speech processing transcription segmentation if audio alert specific users best hypo- wordgraph contained automatic topic detection topic if text keywords contained match topics found against user profiles multimedia document database label database

Basic principle of video-segmentation Stochastic Video-Model (based on HMMs):

Result of video-based segmentation

Combined video-audio-segmentation

topic segmentation Results: video based detection of topic boundaries is feasible precision rate = 1 - insertion rate = 88.2 % recall rate = 1 - deletion rate = 82.2 %

French BN speech recognizer • continuous density HMM system • 33 phones + 3 non-speech (silence, filler words, breath) • ~20% WER (on news) • 65k dictionary • automatic pronunciation with manual verification • 58 hours acoustic training data, 350 Mio words text • RT decoding: 5700 states, 92k Gaussians • 10xRT decoding: 11000 states, 350k Gaussians • 4-gram language model 15M bi-, 15M tri-, 13M four-grams

Portuguese BN speech recognizer • Based on the AUDIMUS LVCSR system • Hybrid system based on MLP/HMM techniques • Combination of different acoustic models (product of posterior probabilities) • 38 phones + silence, 57k dictionary • 4 gram LM: 5M bi-, 12M tri-, 13M fourgrams • Trained on 13 h of BN data • Results: • 15xRT: F0: ~20%, All F: ~40 %

German Baseline Speech Recognition System

German BN speech recognizer • continuous density HMM system • 50 phones + 17 non speech (silence, filler words, breath, rustle, ...) • ~20 % WER (initial DuDeutsch: >70 % WER) • 100 k dictionary • initial pronunciation from CELEX, compound word construction • 10xRT: 30-90k Gaussians • 3-gram (cached) language model, 8M bi-, 16M trigrams

Evolution of the german system system phone models #mixtures WER baseline German triphones 31 780 ~30% system, 100k, spontaneous speech baseline, not triphones 31 780 79,7% trained on broadcast data baseline with triphones 31 780 72,3% broadcast language model acoustic models monophones 1 722 54,3% trained on broadcast data acoustic models triphones 96 417 22,8% optimized on broadcast data

Examples for German transcription results

Automatic topic detection • Objectives: • to divide automatically audio/video streams into topic-specific homogeneous segments • automatic assignment of requested topics to distinct segments Test set: • 22 topics in 2956 training and 1284 test texts • deletion of 150 stop words • no stemming performed

New approach to topic detection This is a text containing important topics. [00.....0100....0] p(w1) p(w2) p(w3) . . . MMI Neural Net VQ label

Results for Clean text Comparison of feature quantization with k-means clustering and MMI neural net Comparison of new approach and standard system

Partially Corrupted text Results with partially corrupted texts: • some words are fragmented • similar to speech recognition output • 22 topics in 3037 training and 1319 test texts • no stop words • no stemming

Results for Corrupted text 22 topics 173 topics

Demonstrator specification (details)

Publications • ICASSP 2001 (7/2001) • LIMSI: Automatic transcription of compressed broadcast audio • GMUD: New approaches to audiovisual segmentation of TV news for automatic topic retrieval. • TREC-9 (11/2000) • LIMSI: The LIMSI SDR system for TREC-9 • argus press (11/2000) • Observer: Observer Argus Media beteiligt sich am EU-Forschungsprojekt ALERT • ICSLP 2000 (10/2000) • GMUD: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parlianmentary speeches • INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language • INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems

Publications (II) • ICSLP 2000 (10/2000) • LIMSI: Fast decoding for indexation of broadcast data • LIMSI: Investigating text normalization and pronunciation variants for German broadcast transcription • EDCL 2000 4th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) • INESC: Topic Detection in Read Documents • ASR 2000 (9/2000) • INESC: A Decoder for Finite-State Structured Search Spaces • ICASSP 2000 (6/2000) • GMUD: A Novel Error Measure for the Evaluation of Video Indexing Systems

Presentations • Schaufenster der Wissenschaft (3/2001) • GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten • Euromap Informationstag (12/2000) • GMUD: Das Projekt ALERT - Alert system for selective dissemination of multimedia information • IV Jornadas de Arquivo e Documentação (10/2000) • INESC: Speech recognition and topic detection applied to alert systems for broadcast news • ASR 2000 (9/2000) • GMUD: ALERT System for Selective Dissemination of Multimedia Information • Homme Technologie et Systèmes Complexes (6/2000) • VECSYS: Parlez Naturellement, la Machine Vous Comprend • RIAO'2000 Content-based Multimedia Information Access (4/2000) • VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation

outlook • use of additional data • cross-talker situations • enlarged number of topics • improving rejection mechanisms of unknown topics (confidence for topics) • detection of new topics • summarization • scalable summarization • topic-dependent summarization

Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication

Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication

Presentation Transcript

MUNICH UNIVERSITY OF APPLIED SCIENCES, GERMANY

Prof. Dr. Gerhard Meyer University of Bremen

UNIVERSITY INSTITUTE OF TECHNOLOGY (B.U.)BHOPAL.

Enhancing Human-Machine Communication via Visual Attributes

MILITARY UNIVERSITY OF TECHNOLOGY Communications Systems Institute

C. Jungemann Institute for Electronics University of the Armed Forces Munich, Germany

Ireneusz Szczygieł Institute of Thermal Technology Silesian University of Technology

Cracov University of Technology Institute of Organic Chemistry and Technology

Jordan Litman Institute for Human and Machine Cognition, Flordia, USA

Contribution of Ankara University Accelerator Technology Institute

Institute of Cybernetics at Tallinn University of Technology

V. Heinemann University of Munich – Klinikum Großhadern, Munich, Germany

Institute of Communication

UNIVERSITY INSTITUTE OF TECHNOLOGY (B.U.)BHOPAL.

University of Ontario Institute of Technology. Oshawa.Canada.

Gerhard Illing LMU Munich University/ CESifo Norges Bank Workshop on

Military University of Technology Institute of Optoelectronics

Role of Technology in Communication - Avantika University

UNIVERSITY INSTITUTE OF TECHNOLOGY (B.U.)BHOPAL.

Gerhard Illing LMU Munich University/ CESifo Norges Bank Workshop on

* Institute of Thermal Technology, Silesian University of Technology, Poland