1 / 21

Quick Rich Transcriptions of Arabic Broadcast News Speech Data

Quick Rich Transcriptions of Arabic Broadcast News Speech Data. Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman ELDA (Evaluations and Language resources Distribution Agency) Meghan Glenn, Stephanie Strassel LDC (Linguistic Data Consortium). Overview Transcription method Sources

doria
Download Presentation

Quick Rich Transcriptions of Arabic Broadcast News Speech Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quick Rich Transcriptions of Arabic Broadcast News Speech Data Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman ELDA (Evaluations and Language resources Distribution Agency) Meghan Glenn, Stephanie Strassel LDC (Linguistic Data Consortium)

  2. Overview Transcription method Sources Collection Selection Transcription Segmentation, Sentence Units, Overlapping Speech, Markup Quality Control Conclusion Outline

  3. Broadcast News Transcripts Arabic (MSA, MCA) Sources: radio + TV, mostly Middle East Verbatim orthographic transcripts, time-aligned, minimal mark-up QRTR to reduce time Overview

  4. QRTR – Quick Rich Transcription (QTR / QRTR / CTR) Amount of detail in markup Number of features identified Degree of accuracy Completeness Amount of time Number of quality checks Transcription Method (1)

  5. Transcription Method (2)

  6. Two types of recordings: Broadcast news (BN): talking head style news reports Broadcast conversation (BC): more interactive, talk shows, interviews, call-in programs, roundtable discussions Mainly MSA from Middle East MCA from North Africa and Middle East Overlapping speech 30 – 60 minutes of recordings collected from TV and Radio sources Sources (1)

  7. Sources (2)

  8. Sources recorded from satellite Daily and weekly recordings Records video stream Audio extracted from video Saved in WAV or SPH 16 bits, 16 kHz Collection

  9. Manual audit of all programs Procedure: Listen to 30 sec samples of 3 sections: beginning, middle, end Auditors can listen to additional segments if necessary Fills in a form for auditing the recordings Web-based auditing interface Checks: Is there a recording? Is the audio quality ok? What is the language? Is it speech from the right program? What is the data type? What is the topic? Audit

  10. Recordings rejected: poor quality wrong language Passed audit: eligible for transcription Criteria based on: data amount sources dates 2000 hours in 24 sets Sent in 20 – 300 hours packages for transcription Period: Apr. 2004 – Aug. 2007 Selection

  11. Orthographic, verbatim transcripts Arabic script No vowels Segmented and time stamped Speaker names Sentence Units Noise markers Overlapping speech Foreign language markup Transcription

  12. Tool for broadcast news and conversation Multi-lingual (UTF-8) Multi-platform (Windows, Linux, FreeBSD) Output TDF format Compatible with Transcriber format XTrans (1)

  13. XTrans (2)

  14. Segment data into sections: Speech delimited by pause or silence Non speech sections: music, silence, ads, etc Lasts 5 – 20 seconds Sections are classified as: News report (BN) Conversation (BC) Non-news Sections are next grouped into speaker turns Single speaker or overlapping turn Statement Units (SU) Speaker ID or name for each turn Segmentation (1)

  15. Group utterances into clusters of words Each cluster represent a sentence-like unit Each unit receives a label: Statement Question Incomplete Non-Speech Sentence Units

  16. Many recordings include conversations Portions of speech that are overlapping Segmented and annotated No SU type Could be quite challenging Difficult portions annotated as non-speech Overlapping Speech

  17. Minimal set of markers: Hesitations Truncated words Mispronunciations Made up words Noise Difficult speech Markup (1)

  18. Noise markers: Background noise Speaker noise: laugh, cough, sneeze, lipsmack Dialect / language markup: Non-MSA (MCA) English French Foreign Language Markup (2)

  19. Limited quality control due to time constraints Quick Verification procedure Max 18 min / file Focus: Transcription matches speech Segmentation Speaker names Orthography Procedure: Checks 3 segments of 3 min: beginning, middle and end Transcriptions that did not pass: sent back to transcribers Quality Control

  20. Arabic broadcast data >2000 hours transcribed 330k words Useful for quantative manual transcripts Limited timeframe Minimal but useful markup Quality control Training ASR systems for MT Conclusion

  21. Thanks for your attention

More Related