330 likes | 616 Views
The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information. Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication Munich, Germany rigoll@ei.tum.de. General Project dates. ALERT system for selective
E N D
The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication Munich, Germany rigoll@ei.tum.de
General Project dates ALERT system for selective dissemination of multimedia information • Official start: 01/2000, start of work: 03/2000, duration: 30 months • Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding • Web Site: http://alert.uni-duisburg.de
Media information flooding NEWS Internet supervision by information brokers
NEWS Internet Media monitoring in the alert project information (sound, video, text) topic detection transcription today‘s headlines .... TAXES ALERT MESSAGE
General project Objectives • To develop a demo system capable of identifying specific information in multimedia data, consisting of • text, • audio and • video streams • using • advanced speech recognition • video processing techniques • automatic topic detection algorithms • demonstrator shall • alert a user about the existence of requested information • send detailed information (on client's further request) • extracted text • annotated audio/video data and video clips • provide functionality in French, German and Portuguese • demo system will be evaluated mainly by industrial partners
THe alert Consortium integration technologies users Consortium
deliverable milestone today WP structure (WP0-WP4)
deliverable today milestone WP structure (WP5-WP7)
Collection of pilot corpus • First step to setup similar resources • Purpose: testbed for assessing methods for data collection, annotation and distribution • Collection guidelines: • Minimum amount: 5 hours • Type of data: video, audio and annotation • Video format: MPEG1 • Audio format: PCM linear, 16KHz sampling rate, 16 bits/sample, mono, collected from antenna • Annotation based on LDC guidelines • Thematic orientation: news and interview shows
Collection of final databases • Experimental results • recommendations for final corpus • quality mp3, 32 kbps, 16kHz, mono • Minimum amount: • speech recognition: 50 hours (training), 3 hours (development), 3 hours (evaluation) word-labelled • topic detection: 300 hours, topic annotated • text corpus: 100 million words • Full data set: • 1300 hours word or topic annotated • > 10k topic annotated summaries in German • text corpus: > 1 billion words
Multimedia datA-labeling and alert-generation multimedia document video/image processing segmentation if video contained video-based speech processing transcription segmentation if audio alert specific users best hypo- wordgraph contained automatic topic detection topic if text keywords contained match topics found against user profiles multimedia document database label database
Basic principle of video-segmentation Stochastic Video-Model (based on HMMs):
topic segmentation Results: video based detection of topic boundaries is feasible precision rate = 1 - insertion rate = 88.2 % recall rate = 1 - deletion rate = 82.2 %
French BN speech recognizer • continuous density HMM system • 33 phones + 3 non-speech (silence, filler words, breath) • ~20% WER (on news) • 65k dictionary • automatic pronunciation with manual verification • 58 hours acoustic training data, 350 Mio words text • RT decoding: 5700 states, 92k Gaussians • 10xRT decoding: 11000 states, 350k Gaussians • 4-gram language model 15M bi-, 15M tri-, 13M four-grams
Portuguese BN speech recognizer • Based on the AUDIMUS LVCSR system • Hybrid system based on MLP/HMM techniques • Combination of different acoustic models (product of posterior probabilities) • 38 phones + silence, 57k dictionary • 4 gram LM: 5M bi-, 12M tri-, 13M fourgrams • Trained on 13 h of BN data • Results: • 15xRT: F0: ~20%, All F: ~40 %
German BN speech recognizer • continuous density HMM system • 50 phones + 17 non speech (silence, filler words, breath, rustle, ...) • ~20 % WER (initial DuDeutsch: >70 % WER) • 100 k dictionary • initial pronunciation from CELEX, compound word construction • 10xRT: 30-90k Gaussians • 3-gram (cached) language model, 8M bi-, 16M trigrams
Evolution of the german system system phone models #mixtures WER baseline German triphones 31 780 ~30% system, 100k, spontaneous speech baseline, not triphones 31 780 79,7% trained on broad- cast data baseline with triphones 31 780 72,3% broadcast language model acoustic models monophones 1 722 54,3% trained on broadcast data acoustic models triphones 96 417 22,8% optimized on broadcast data
Automatic topic detection • Objectives: • to divide automatically audio/video streams into topic-specific homogeneous segments • automatic assignment of requested topics to distinct segments Test set: • 22 topics in 2956 training and 1284 test texts • deletion of 150 stop words • no stemming performed
New approach to topic detection This is a text containing important topics. [00.....0100....0] p(w1) p(w2) p(w3) . . . MMI Neural Net VQ label
Results for Clean text Comparison of feature quantization with k-means clustering and MMI neural net Comparison of new approach and standard system
Partially Corrupted text Results with partially corrupted texts: • some words are fragmented • similar to speech recognition output • 22 topics in 3037 training and 1319 test texts • no stop words • no stemming
Results for Corrupted text 22 topics 173 topics
Publications • ICASSP 2001 (7/2001) • LIMSI: Automatic transcription of compressed broadcast audio • GMUD: New approaches to audio- visual segmentation of TV news for automatic topic retrieval. • TREC-9 (11/2000) • LIMSI: The LIMSI SDR system for TREC-9 • argus press (11/2000) • Observer: Observer Argus Media beteiligt sich am EU-Forschungsprojekt ALERT • ICSLP 2000 (10/2000) • GMUD: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parlianmentary speeches • INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language • INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems
Publications (II) • ICSLP 2000 (10/2000) • LIMSI: Fast decoding for indexation of broadcast data • LIMSI: Investigating text normalization and pronunciation variants for German broadcast transcription • EDCL 2000 4th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) • INESC: Topic Detection in Read Documents • ASR 2000 (9/2000) • INESC: A Decoder for Finite-State Structured Search Spaces • ICASSP 2000 (6/2000) • GMUD: A Novel Error Measure for the Evaluation of Video Indexing Systems
Presentations • Schaufenster der Wissenschaft (3/2001) • GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten • Euromap Informationstag (12/2000) • GMUD: Das Projekt ALERT - Alert system for selective dissemination of multimedia information • IV Jornadas de Arquivo e Documentação (10/2000) • INESC: Speech recognition and topic detection applied to alert systems for broadcast news • ASR 2000 (9/2000) • GMUD: ALERT System for Selective Dissemination of Multimedia Information • Homme Technologie et Systèmes Complexes (6/2000) • VECSYS: Parlez Naturellement, la Machine Vous Comprend • RIAO'2000 Content-based Multimedia Information Access (4/2000) • VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation
outlook • use of additional data • cross-talker situations • enlarged number of topics • improving rejection mechanisms of unknown topics (confidence for topics) • detection of new topics • summarization • scalable summarization • topic-dependent summarization