210 likes | 344 Views
Quick Rich Transcriptions of Arabic Broadcast News Speech Data. Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman ELDA (Evaluations and Language resources Distribution Agency) Meghan Glenn, Stephanie Strassel LDC (Linguistic Data Consortium). Overview Transcription method Sources
E N D
Quick Rich Transcriptions of Arabic Broadcast News Speech Data Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman ELDA (Evaluations and Language resources Distribution Agency) Meghan Glenn, Stephanie Strassel LDC (Linguistic Data Consortium)
Overview Transcription method Sources Collection Selection Transcription Segmentation, Sentence Units, Overlapping Speech, Markup Quality Control Conclusion Outline
Broadcast News Transcripts Arabic (MSA, MCA) Sources: radio + TV, mostly Middle East Verbatim orthographic transcripts, time-aligned, minimal mark-up QRTR to reduce time Overview
QRTR – Quick Rich Transcription (QTR / QRTR / CTR) Amount of detail in markup Number of features identified Degree of accuracy Completeness Amount of time Number of quality checks Transcription Method (1)
Two types of recordings: Broadcast news (BN): talking head style news reports Broadcast conversation (BC): more interactive, talk shows, interviews, call-in programs, roundtable discussions Mainly MSA from Middle East MCA from North Africa and Middle East Overlapping speech 30 – 60 minutes of recordings collected from TV and Radio sources Sources (1)
Sources recorded from satellite Daily and weekly recordings Records video stream Audio extracted from video Saved in WAV or SPH 16 bits, 16 kHz Collection
Manual audit of all programs Procedure: Listen to 30 sec samples of 3 sections: beginning, middle, end Auditors can listen to additional segments if necessary Fills in a form for auditing the recordings Web-based auditing interface Checks: Is there a recording? Is the audio quality ok? What is the language? Is it speech from the right program? What is the data type? What is the topic? Audit
Recordings rejected: poor quality wrong language Passed audit: eligible for transcription Criteria based on: data amount sources dates 2000 hours in 24 sets Sent in 20 – 300 hours packages for transcription Period: Apr. 2004 – Aug. 2007 Selection
Orthographic, verbatim transcripts Arabic script No vowels Segmented and time stamped Speaker names Sentence Units Noise markers Overlapping speech Foreign language markup Transcription
Tool for broadcast news and conversation Multi-lingual (UTF-8) Multi-platform (Windows, Linux, FreeBSD) Output TDF format Compatible with Transcriber format XTrans (1)
Segment data into sections: Speech delimited by pause or silence Non speech sections: music, silence, ads, etc Lasts 5 – 20 seconds Sections are classified as: News report (BN) Conversation (BC) Non-news Sections are next grouped into speaker turns Single speaker or overlapping turn Statement Units (SU) Speaker ID or name for each turn Segmentation (1)
Group utterances into clusters of words Each cluster represent a sentence-like unit Each unit receives a label: Statement Question Incomplete Non-Speech Sentence Units
Many recordings include conversations Portions of speech that are overlapping Segmented and annotated No SU type Could be quite challenging Difficult portions annotated as non-speech Overlapping Speech
Minimal set of markers: Hesitations Truncated words Mispronunciations Made up words Noise Difficult speech Markup (1)
Noise markers: Background noise Speaker noise: laugh, cough, sneeze, lipsmack Dialect / language markup: Non-MSA (MCA) English French Foreign Language Markup (2)
Limited quality control due to time constraints Quick Verification procedure Max 18 min / file Focus: Transcription matches speech Segmentation Speaker names Orthography Procedure: Checks 3 segments of 3 min: beginning, middle and end Transcriptions that did not pass: sent back to transcribers Quality Control
Arabic broadcast data >2000 hours transcribed 330k words Useful for quantative manual transcripts Limited timeframe Minimal but useful markup Quality control Training ASR systems for MT Conclusion