310 likes | 411 Views
Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop. July 13, 2005 Royal College of Physicians Edinburgh, UK. Meeting Venue Cleared. 18:00. Today’s Agenda. Updated: July 5, 2005. Administrative Points. Participants:
E N D
Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop July 13, 2005 Royal College of Physicians Edinburgh, UK
Meeting Venue Cleared 18:00 Today’sAgenda Updated: July 5, 2005
Administrative Points • Participants: • Pick up the hard copy proceedings on the front desk • Presenters: • The agenda will be strictly followed • Time slots include Q&A time. • Presenters should either • Load their presentations on the computer at the front, or • Test their laptops during the breaks prior to making their presentation • We’d like to thank: • MLMI-05 organizing committee for hosting this workshop • Caroline Hastings for the workshop’s administration • All the volunteers: evaluation participants, data providers, transcribers, annotators, paper authors, presenters and other contributors
The Rich Transcription 2005 Spring Meeting Recognition Evaluation http://www.nist.gov/speech/tests/rt/rt2005/spring/ Jonathan Fiscus, Nicolas Radde, John Garofolo, Audrey Le, Jerome Ajot, Christophe Laprun July 13, 2005 Rich Transcription 2004 Spring Meeting Recognition Workshop at MLMI 2005
Overview • Rich Transcription Evaluation Series • Research opportunities in the Meeting Domain • RT-05S Evaluation • Audio input conditions • Corpora • Evaluation tasks and results • Conclusion/Future
The Rich Transcription Task Multiple Applications RICH TRANSCRIPTION Speech-To-Text + METADATA Readable Transcripts Human-to-Human Speech Component Recognition Technologies Smart Meeting Rooms Translation Extraction Retrieval Summarization
Rich Transcription Evaluation Series • Goal: • Develop recognition technologies that produce transcripts which are understandable by humans and useful for downstream processes. • Domains: • Broadcast News (BN) • Conversational Telephone Speech (CTS) • Meeting Room speech • Parameterized “Black Box” evaluations • Evaluations control input conditions to investigate weaknesses/strengths • Sub-test scoring provides finer-grained diagnostics
Research Opportunities in the Meeting Domain • Provide fertile environment to advance state-of-the-art in technologies for understanding human interaction • Many potential applications • Meeting archives, interactive meeting rooms, remote collaborative systems • Important Human Language Technology challenges not posed by other domains • Varied forums and vocabularies • Highly interactive and overlapping spontaneous speech • Far field speech effects • Ambient noise • Reverberation • Participant movement • Varied room configurations • Many microphone conditions • Many camera views • Multimedia information integration • Person, face, and head detection/tracking
RT-05S Evaluation Tasks • Focus on core speech technologies • Speech-to-Text Transcription • Diarization “Who Spoke When” • Diarization “Speech Activity Detection” • Diarization “Source Localization”
Five System Input Conditions • Distant microphone conditions • Multiple Distant Microphones (MDM) • Three or more centrally located table mics • Multiple Source Localization Arrays (MSLA) • Inverted “T” topology, 4-channel digital microphone array • Multiple Mark III digital microphone Arrays (MM3A) • Linear topology, 64-channel digital microphone array • Contrastive microphone conditions • Single Distant Microphone (SDM) • Center-most MDM microphone • Gauge performance benefit using multiple table mics • Individual Head Microphones (IHM) • Performance on clean speech • Similar to Conversational Telephone Speech • One speaker per channel, conversational speech
Training/Development Corpora • Corpora provided at no cost to participants • ICSI Meeting Corpus • ISL Meeting Corpus • NIST Meeting Pilot Corpus • Rich Transcription 2004 Spring (RT-04S) Development & Evaluation Data • Topic Detection and Tracking Phase 4 (TDT4) corpus • Fisher English conversational telephone speech corpus • CHIL development test set • AMI development test set and training set • Thanks to ELDA and LDC for making this possible
RT-05S Evaluation Test Corpora:Conference Room Test Set • Goal-oriented small conference room meetings • Group meetings and decision-making exercises • Meetings involved 4-10 participants • 120 minutes – Ten excerpts, each twelve minutes in duration • Five sites donated two meetings each: • Augmented Multiparty Interaction (AMI) Program, International Computer Science Institute (ICSI), NIST, and Virginia Tech (VT) • No VT data was available for system development • Similar test set construction used for RT-04S evaluation • Microphones: • Participants wore head microphones • Microphones were placed on the table among participants • AMI meetings included an 8-channel circular microphone array on the table • NIST meetings include 3 Mark III digital microphone arrays
RT-05S Evaluation Test Corpora: Lecture Room Test Set • Technical lectures in small meeting rooms • Educational events where a single lecturer is briefing an audience on a particular topic • Meetings excerpts involve one lecturer and up to five participating audience members • 150 minutes – 29 excerpts from 16 lectures • Two types of excerpts selected by CMU • Lecturer excerpts – 89 minutes, 17 excerpts • Question & Answer (Q&A) excerpts – 61 minutes, 12 excerpts • All data collected at Karlsruhe University • Sensors: • Lecturer and at most two other participants wore head microphones • Microphones were placed on the table among participants • A source localization array mounted on each of the room’s four walls • Mark III mounted on the wall opposite the lecturer
Diarization “Who Spoke When” (SPKR) Task • Task definition • Identify the number of participants in each meeting and create a list of speech time intervals for each such participant • Several input conditions: • Primary: MDM • Contrast: SDM, MSLA • Four participating sites: ICSI/SRI, ELISA, MQU, TNO
SPKR System Evaluation Method • Primary Metric • Diarization Error Rate (DER) – the ratio of incorrectly detected speaker time to total speaker time • System output speaker segment sets are mapped to reference speaker segment sets so as to minimize the total error • Errors consist of: • Speaker assignment errors (i.e., detected speech but not assigned to the right speaker) • False alarm detections • Missed detections • Systems were scored using the mdevaltool • Forgiveness collar of +/- 250ms around reference segment boundaries • DER on non-overlapping speech is the primary metric
RT-05S SPKR ResultsPrimary Systems, Non-Overlapping Speech • Conference room SDM DER less than MDM • Sign test indicates differences are not significant • Primary ICSI/SRI Lecture Room system attributed the entire duration of each test excerpt to be from a single speaker. • ICSI/SRI contrastive system had a lower DER
Lecture Room Results:Broken Down by Excerpt Type • Lecturer excerpt DERs are lower than Q&A excerpt DERs
Historical Best System SPKR Performance on Conference Data • 20% relative reduction for MDM • 43% relative reduction for SDM
Diarization “Speech Activity Detection” (SAD) Task • Task definition • create a list of speech time intervals where at least one person is talking • Dry run evaluation for RT-05S • Proposed by CHIL • Several input conditions: • Primary: MDM • Contrast: SDM, MSLA, IHM • Systems designed for the IHM condition must detect speech and also reject cross talk speech and breath noises, therefore IHM systems are not directly comparable to MDM or SDM systems • Three participating sites: ELISA, Purdue, TNO
SAD System Evaluation Method • Primary metric • Diarization Error Rate (DER) • Same formula and software as used for the SPKR task • Reduced to a two-class problem: speech vs. non-speech • No speaker assignment errors, just false alarms and missed detections • Forgiveness collar of +/- 250ms around reference segment boundaries
RT-05S SAD ResultsPrimary Systems • DERs for conference and lecture room MDM data are similar • Purdue didn’t compensate for breath noise and crosstalk
Speech-To-Text (STT) Task • Task definition • Systems output a single stream of time-tagged word tokens • Several input conditions: • Primary: MDM • Contrast: SDM, MSLA, IHM • Two participating sites: AMI and ICSI/SRI
STT System Evaluation Method • Primary metric • Word Error Rate (WER) - ratio of inserted, deleted, and substituted words to the total number of words in the reference • System and reference words are normalized to a common form • System words are mapped to reference words using a word-mediated dynamic programming string alignment program • Systems were scored using the NIST Scoring Toolkit (SCTK) version 2.1 • A Spring 2005 update to the SCTK alignment tool can now score most of the overlapping speech in the distant microphone test material • Can now handle up to 5 simultaneous speakers • 98% of Conference Room test can be scored • 100% of Lecture Room test set can be scored • Greatly improved over Spring 2004 prototype
RT-05S STT ResultsPrimary Systems (Incl. overlaps) • First evaluation for the AMI team • IHM error rates for conference and lecture room data are comparable • ICSI/SRI lecture room MSLA WER lower than MDM/SDM WER Lecture Room Conference Room Microphone conditions
Historical STT Performance in the Meeting Domain • Performance for ICSI/SRI has dramatically improved for all conditions
Diarization “Source Localization” (SLOC) Task • Task definition • Systems track the three-dimensional position of the lecturer (using audio input only) • Constrained to lecturer subset of the Lecture Room test set • Evaluation protocol and metrics defined in the CHIL “Speaker Localization and Tracking – Evaluation Criteria” document • Dry run pilot evaluation for RT-05S • Proposed by CHIL • CHIL provided the scoring software and annotated the evaluation data • One evaluation condition • Multiple source localization arrays • Required calibration of source localization microphone positions and video cameras • Three participating sites: ITC-irst, KU, TNO
SLOC System Evaluation Method • Primary Metric: • Root Mean Squared Error (RMSE) – a measure of the average Euclidean distance between the reference speaker position and the system-determined speaker position • Measured in millimeters at 667 ms intervals • IRST SLOC scoring software • Maurizio Omologo will give further details this afternoon
R-05S SLOC ResultsPrimary Systems • Issues: • What accuracy and resolution is needed for successful beamforming? • What will performance be for multiple speakers?
Summary • Nine sites participated in the RT-05S evaluation • Up from six in RT-04S • Four evaluation tasks were supported across two meeting sub-domains: • Two experimental tasks: SAD and SLOC successfully completed • Dramatically lower STT and SPKR error rates for RT-05S
Issues for RT-06 Meeting Eval • Domain • Sub domains • Tasks • Require at least three sites per task • Agreed-upon primary condition for each task • Data contributions • Source data and annotations • Participation intent • Participation commitment • Decision making process • Only sites with intent to participate will have input to the task definition