Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop

Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop July 13, 2005 Royal College of Physicians Edinburgh, UK

Meeting Venue Cleared 18:00 Today’sAgenda Updated: July 5, 2005

Administrative Points • Participants: • Pick up the hard copy proceedings on the front desk • Presenters: • The agenda will be strictly followed • Time slots include Q&A time. • Presenters should either • Load their presentations on the computer at the front, or • Test their laptops during the breaks prior to making their presentation • We’d like to thank: • MLMI-05 organizing committee for hosting this workshop • Caroline Hastings for the workshop’s administration • All the volunteers: evaluation participants, data providers, transcribers, annotators, paper authors, presenters and other contributors

The Rich Transcription 2005 Spring Meeting Recognition Evaluation http://www.nist.gov/speech/tests/rt/rt2005/spring/ Jonathan Fiscus, Nicolas Radde, John Garofolo, Audrey Le, Jerome Ajot, Christophe Laprun July 13, 2005 Rich Transcription 2004 Spring Meeting Recognition Workshop at MLMI 2005

Overview • Rich Transcription Evaluation Series • Research opportunities in the Meeting Domain • RT-05S Evaluation • Audio input conditions • Corpora • Evaluation tasks and results • Conclusion/Future

The Rich Transcription Task Multiple Applications RICH TRANSCRIPTION Speech-To-Text + METADATA Readable Transcripts Human-to-Human Speech Component Recognition Technologies Smart Meeting Rooms Translation Extraction Retrieval Summarization

Rich Transcription Evaluation Series • Goal: • Develop recognition technologies that produce transcripts which are understandable by humans and useful for downstream processes. • Domains: • Broadcast News (BN) • Conversational Telephone Speech (CTS) • Meeting Room speech • Parameterized “Black Box” evaluations • Evaluations control input conditions to investigate weaknesses/strengths • Sub-test scoring provides finer-grained diagnostics

Research Opportunities in the Meeting Domain • Provide fertile environment to advance state-of-the-art in technologies for understanding human interaction • Many potential applications • Meeting archives, interactive meeting rooms, remote collaborative systems • Important Human Language Technology challenges not posed by other domains • Varied forums and vocabularies • Highly interactive and overlapping spontaneous speech • Far field speech effects • Ambient noise • Reverberation • Participant movement • Varied room configurations • Many microphone conditions • Many camera views • Multimedia information integration • Person, face, and head detection/tracking

RT-05S Evaluation Tasks • Focus on core speech technologies • Speech-to-Text Transcription • Diarization “Who Spoke When” • Diarization “Speech Activity Detection” • Diarization “Source Localization”

Five System Input Conditions • Distant microphone conditions • Multiple Distant Microphones (MDM) • Three or more centrally located table mics • Multiple Source Localization Arrays (MSLA) • Inverted “T” topology, 4-channel digital microphone array • Multiple Mark III digital microphone Arrays (MM3A) • Linear topology, 64-channel digital microphone array • Contrastive microphone conditions • Single Distant Microphone (SDM) • Center-most MDM microphone • Gauge performance benefit using multiple table mics • Individual Head Microphones (IHM) • Performance on clean speech • Similar to Conversational Telephone Speech • One speaker per channel, conversational speech

Training/Development Corpora • Corpora provided at no cost to participants • ICSI Meeting Corpus • ISL Meeting Corpus • NIST Meeting Pilot Corpus • Rich Transcription 2004 Spring (RT-04S) Development & Evaluation Data • Topic Detection and Tracking Phase 4 (TDT4) corpus • Fisher English conversational telephone speech corpus • CHIL development test set • AMI development test set and training set • Thanks to ELDA and LDC for making this possible

RT-05S Evaluation Test Corpora:Conference Room Test Set • Goal-oriented small conference room meetings • Group meetings and decision-making exercises • Meetings involved 4-10 participants • 120 minutes – Ten excerpts, each twelve minutes in duration • Five sites donated two meetings each: • Augmented Multiparty Interaction (AMI) Program, International Computer Science Institute (ICSI), NIST, and Virginia Tech (VT) • No VT data was available for system development • Similar test set construction used for RT-04S evaluation • Microphones: • Participants wore head microphones • Microphones were placed on the table among participants • AMI meetings included an 8-channel circular microphone array on the table • NIST meetings include 3 Mark III digital microphone arrays

RT-05S Evaluation Test Corpora: Lecture Room Test Set • Technical lectures in small meeting rooms • Educational events where a single lecturer is briefing an audience on a particular topic • Meetings excerpts involve one lecturer and up to five participating audience members • 150 minutes – 29 excerpts from 16 lectures • Two types of excerpts selected by CMU • Lecturer excerpts – 89 minutes, 17 excerpts • Question & Answer (Q&A) excerpts – 61 minutes, 12 excerpts • All data collected at Karlsruhe University • Sensors: • Lecturer and at most two other participants wore head microphones • Microphones were placed on the table among participants • A source localization array mounted on each of the room’s four walls • Mark III mounted on the wall opposite the lecturer

RT-05S Evaluation Participants

Diarization “Who Spoke When” (SPKR) Task • Task definition • Identify the number of participants in each meeting and create a list of speech time intervals for each such participant • Several input conditions: • Primary: MDM • Contrast: SDM, MSLA • Four participating sites: ICSI/SRI, ELISA, MQU, TNO

SPKR System Evaluation Method • Primary Metric • Diarization Error Rate (DER) – the ratio of incorrectly detected speaker time to total speaker time • System output speaker segment sets are mapped to reference speaker segment sets so as to minimize the total error • Errors consist of: • Speaker assignment errors (i.e., detected speech but not assigned to the right speaker) • False alarm detections • Missed detections • Systems were scored using the mdevaltool • Forgiveness collar of +/- 250ms around reference segment boundaries • DER on non-overlapping speech is the primary metric

RT-05S SPKR ResultsPrimary Systems, Non-Overlapping Speech • Conference room SDM DER less than MDM • Sign test indicates differences are not significant • Primary ICSI/SRI Lecture Room system attributed the entire duration of each test excerpt to be from a single speaker. • ICSI/SRI contrastive system had a lower DER

Lecture Room Results:Broken Down by Excerpt Type • Lecturer excerpt DERs are lower than Q&A excerpt DERs

Historical Best System SPKR Performance on Conference Data • 20% relative reduction for MDM • 43% relative reduction for SDM

Diarization “Speech Activity Detection” (SAD) Task • Task definition • create a list of speech time intervals where at least one person is talking • Dry run evaluation for RT-05S • Proposed by CHIL • Several input conditions: • Primary: MDM • Contrast: SDM, MSLA, IHM • Systems designed for the IHM condition must detect speech and also reject cross talk speech and breath noises, therefore IHM systems are not directly comparable to MDM or SDM systems • Three participating sites: ELISA, Purdue, TNO

SAD System Evaluation Method • Primary metric • Diarization Error Rate (DER) • Same formula and software as used for the SPKR task • Reduced to a two-class problem: speech vs. non-speech • No speaker assignment errors, just false alarms and missed detections • Forgiveness collar of +/- 250ms around reference segment boundaries

RT-05S SAD ResultsPrimary Systems • DERs for conference and lecture room MDM data are similar • Purdue didn’t compensate for breath noise and crosstalk

Speech-To-Text (STT) Task • Task definition • Systems output a single stream of time-tagged word tokens • Several input conditions: • Primary: MDM • Contrast: SDM, MSLA, IHM • Two participating sites: AMI and ICSI/SRI

STT System Evaluation Method • Primary metric • Word Error Rate (WER) - ratio of inserted, deleted, and substituted words to the total number of words in the reference • System and reference words are normalized to a common form • System words are mapped to reference words using a word-mediated dynamic programming string alignment program • Systems were scored using the NIST Scoring Toolkit (SCTK) version 2.1 • A Spring 2005 update to the SCTK alignment tool can now score most of the overlapping speech in the distant microphone test material • Can now handle up to 5 simultaneous speakers • 98% of Conference Room test can be scored • 100% of Lecture Room test set can be scored • Greatly improved over Spring 2004 prototype

RT-05S STT ResultsPrimary Systems (Incl. overlaps) • First evaluation for the AMI team • IHM error rates for conference and lecture room data are comparable • ICSI/SRI lecture room MSLA WER lower than MDM/SDM WER Lecture Room Conference Room Microphone conditions

Historical STT Performance in the Meeting Domain • Performance for ICSI/SRI has dramatically improved for all conditions

Diarization “Source Localization” (SLOC) Task • Task definition • Systems track the three-dimensional position of the lecturer (using audio input only) • Constrained to lecturer subset of the Lecture Room test set • Evaluation protocol and metrics defined in the CHIL “Speaker Localization and Tracking – Evaluation Criteria” document • Dry run pilot evaluation for RT-05S • Proposed by CHIL • CHIL provided the scoring software and annotated the evaluation data • One evaluation condition • Multiple source localization arrays • Required calibration of source localization microphone positions and video cameras • Three participating sites: ITC-irst, KU, TNO

SLOC System Evaluation Method • Primary Metric: • Root Mean Squared Error (RMSE) – a measure of the average Euclidean distance between the reference speaker position and the system-determined speaker position • Measured in millimeters at 667 ms intervals • IRST SLOC scoring software • Maurizio Omologo will give further details this afternoon

R-05S SLOC ResultsPrimary Systems • Issues: • What accuracy and resolution is needed for successful beamforming? • What will performance be for multiple speakers?

Summary • Nine sites participated in the RT-05S evaluation • Up from six in RT-04S • Four evaluation tasks were supported across two meeting sub-domains: • Two experimental tasks: SAD and SLOC successfully completed • Dramatically lower STT and SPKR error rates for RT-05S

Issues for RT-06 Meeting Eval • Domain • Sub domains • Tasks • Require at least three sites per task • Agreed-upon primary condition for each task • Data contributions • Source data and annotations • Participation intent • Participation commitment • Decision making process • Only sites with intent to participate will have input to the task definition

Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop

Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop

Presentation Transcript

Welcome to Workshop #2 March 20, 2005

Welcome 2005 Spring Conference

Welcome to the Local Sections Committee (LSC) Spring Meeting

2005 Spring ITRS Workshop : Test TWG

Welcome to the DARS Users’ Group Meeting – Spring 2004

Pitch Recognition and Transcription

CAMAR - Spring 2005 Meeting

Welcome to the Spring issue.

Welcome to the OCCA Spring Meeting July 4, 2007

Welcome to the COLS Spring Faculty Meeting

Welcome to the OCCA Fall Meeting November 9, 2005

Self-transcription workshop

Welcome to the UCLA – IGERT 2005 Summer Bioinformatics Workshop

Welcome to the ARAB SPRING!

rich Jan 9, 2005 Forward upgrades meeting

SPRING FINANCE WORKSHOP Welcome

Welcome to the OCCA Spring Meeting June 3, 2004

Welcome to the OCCA Spring Meeting June 2, 2005

Welcome to the 2006 CAS Spring Meeting!

Predictive Modeling Spring 2005 CAMAR meeting

Welcome 2005 Spring Conference

Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop