1 / 16

LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

RUNDKAST: An Annotated Norwegian Broadcast News Speech Corpus. LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen. Overview. Purpose of Rundkast An overview of the database Rundkast Structure of annotation Orthographic transcription Broad phonetic annotation.

qabil
Download Presentation

LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RUNDKAST:An Annotated NorwegianBroadcast News Speech Corpus LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

  2. Overview • Purpose of Rundkast • An overview of the database Rundkast • Structure of annotation • Orthographic transcription • Broad phonetic annotation

  3. Purpose of Rundkast Databases of broadcast news can be used for a number of research topics in speech technology such as: • Supplement to existing databases of read speech for training and testing automatic speech recognition and speaker adaptation. • Research on recognition of spontaneous speech. • Research on automatic indexing of audio data. • Research on topic and/or speaker segmentation. • Research on speech/non-speech detection (e.g. background music). • International research cooperation involving speech technology for broadcast news applications. A corpus of this kind is necessary for language technology research, buthas not been available for Norwegian

  4. Overview of Rundkasthttp://www.iet.ntnu.no/projects/rundkast/ Database of 77 hours radio broadcast news fromthe Norwegian Broadcasting Corporation (NRK): • Read and spontaneous speech, as well as spontaneous dialogsand multipart discussions • There is large variation between speakers, speaking styles and topics • Speaker turns may be rapid and several speakers may talk simultaneously • The quality of the recordings include studio and telephone(mobile, satellite etc) • Frequent occurrences of background noise, jingles,music and audio illustrations Funded by the Norwegian University of Science and Technology (NTNU)

  5. Structure of annotation Rundkast is hierarchically organizedand orthographically annotated: • Name of programme, type and date • Name of speaker (if known) and dialect (5 regions) • Type of speech: spontaneity, channel, recording quality • Segmented in speaker turns of app. 2-5 seconds • Orthographic transcription (standard Norwegian) • Labels for noise (speaker noise, background noise etc.) • Labels for pronunciation mistakes, foreign words, unintelligible speech etc. • ~70 hrs work per hour of recording Transcriber used for annotation: ”standard”-tool

  6. no speaker [i] blah blah ... more blah ... speaker 1 speaker 1 speaker 2 nontrans report [b-]noisy blah[-b] ... report filler [lp] annotation level: one episode file 1 • • • 2 • • • 3 • • • Hierarchy of annotation levels levels: 1=section, 2=speaker turn, and 3=segment

  7. Orthographic transcription • The lowest level in the annotation hierarchy, segments, are transcribed orthographically. • Orthographic transcription of spoken language is a challenge, especially for Norwegian. Using dialect also in official circumstances is more and more accepted. • The majority of RUNDKAST is not compliant to any standard pronunciation. • The aim of the conventions for the orthographic transcription in RUNDKAST is to minimize uncertainty about pronunciations and facilitate consistency.

  8. Orthographic transcription:Main conventions • Words are transcribed with the written forms closest to actual pronunciations. A limited number of interjections are allowed. • Text codes are used to mark mispronunciations, truncations, and unknown words. • Numbers and symbols are written out as words. • Abbreviations are not used. • Punctuation marks are restricted to comma, period, and question mark. • Space is used between spelled letters, also when acronyms have spelled pronunciation. • Capital letters are used in proper names, spellings, and acronyms, but not at the start of sentences.

  9. Example annotation in Transcriber

  10. Broad phonetic annotation • Part of the data were to be phonetically annotated • Use for low-level experiments in ASR (new methods), smaller Norwegian counterpart to TIMIT • Auto-segmentation for e.g. unit selection TTS • Annotation to be based on existing standards– with necessary adjustments • Exploit experience and specifications from development of Norwegian speech synthesis databases • ”Suitable” level of detail: Acoustic boundaries should be labeled, but more phonemic than phonetic • Consistency of utmost importance!

  11. Broad phonetic annotation:Selected data • 10 speakers (5 male and 5 female) • Amount of speech per speaker: • app 5 min ”planned” speech and 1 min spontaneous speech • discard noisy parts (as far as possible) • from more than one programme • use turn segmentation from orthographic annotation • All in all 1 hour of speech • Approximately 1000 hours of work

  12. Broad phonetic annotation:Main principles • The annotation is mainly phonemic using the phoneme symbols closest to the perceived sound • Acoustic boundaries should be marked; some acoustically motivated symbols are included • A transcription as close as possible to the citation form is preferred • Norwegian standard SAMPA is preferred • Some English phonemes included as well as dialect variants • Example: 3 variants of the /r/-sound/r/ (tap/trill)/R/ (uvular fricative)/r\/ (approximant)

  13. Broad phonetic annotation:Annotation procedure • Conversion of orthographic transcription to a format suitable for automatic transcription. • Automatic segmentation with a phonotypical transcription using a speech recognizer. • Manual correction of both segments and labels by four phonetics students using Praat. • Format check. • Control of all annotation by one supervisor.

  14. Broad phonetic annotation:Comments on deviations Always cases of uncertainty, need a log for these. Problem: will the log be read? Solution: Codes for deviations! • Additional Praat tier for deviations • Synchronous with the phoneme tier • Easy to utilize automatically • Examples: • creaky voice • unexpected voiced/unvoiced • uncertain boundary or symbol • ... in addition a log file with whatever deviations left

  15. Example annotation in Praat

  16. Concluding remarks • Availability: • Planned to be included for non-commercial use in a future Norwegian language bank • Will complement other corpora also intended to be included • To be validated by Spex • Planned use at NTNU: SIRKUS project • Investigation in new paradigms for ASR • Low-level phone recognition experiments initially • multi-linguality aspects • Spoken information retrieval

More Related