1 / 14

Speech and Music Retrieval

Explore techniques for speech and music retrieval, including transcription, annotation, and recommender systems. Learn about speech compression and three-step speech recognition. Discover key results and error analysis in transcription accuracy.

lesther
Download Presentation

Speech and Music Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech and Music Retrieval INST 734 Doug Oard Module 12

  2. Agenda • Music retrieval • Speech retrieval • Interactive speech retrieval

  3. Spoken Word Collections • Broadcast programming • News, interview, talk radio, sports, entertainment • Scripted stories • Books on tape, poetry reading, theater • Spontaneous storytelling • Oral history, folklore • Incidental recording • Speeches, oral arguments, meetings, phone calls

  4. Speech Compression • Opportunity: • Human voices vary in predictable ways • Approach: • Predict what’s next, then send only any corrections • Standards: • Rule of thumb: 1 kB/sec for (highly compressed) speech

  5. Description Strategies • Transcription • Manual transcription (with optional post-editing) • Annotation • Manually assign descriptors to points in a recording • Recommender systems (ratings, link analysis, …) • Associated materials • Interviewer’s notes, speech scripts, producer’s logs • Automatic • Create access points with automatic speech processing

  6. Three-Step Speech Recognition • What sounds were made? • Convert from waveform to subword units (phonemes) • How could the sounds be grouped into words? • Identify the most probable word segmentation points • Which of the possible words were spoken? • Based on likelihood of possible multiword sequences

  7. Using Speech Recognition Phone n-grams Phone Detection Phone lattice Word Construction Transcription dictionary Word lattice One-best transcript Word Selection Language model Words

  8. Phone Lattice

  9. Phoneme Trigrams • Manage -> m ae n ih jh • Dictionaries provide accurate transcriptions • But valid only for a single accent and dialect • Rule-base transcription handles unknown words • Index every overlapping 3-phoneme sequence • m ae n • ae n ih • n ih jh

  10. Key Results from TREC/TDT • Recognition and retrieval can be decomposed • Word recognition/retrieval works well in English • Retrieval is robust with recognition errors • Up to 40% word error rate is tolerable • Retrieval is robust with segmentation errors • Vocabulary shift/pauses provide strong cues

  11. English Transcription Accuracy Training: 200 hours from 800 speakers

  12. 50% 25h Other Languages WER [%] 30 40 50 60 70 34.49% 35.51% 38.57% + stand.+LMTr+TC + adapt. 40.69% 41.15% 100h + LMTr + LMTr+TC + standard. 45.75% 45.91% + stand.+LMTr+TC 84h + LMTr 50.82% 100h + LMTr 57.92% 45h + LMTr 66.07% 20h + LMTr Polish Czech Slovak Hungarian Russian 10/01 4/02 10/02 4/03 10/03 4/04 10/04 4/05 10/05 4/06 10/06

  13. Error Analysis (2005) (ASR/Metadata) CLEF-2005 training + test – (metadata < 0.2), ASR2004A only, Title queries, Inquery 3.1p1

  14. Agenda • Music retrieval • Speech retrieval • Interactive speech retrieval

More Related