1 / 20

Towards Dolphin Recognition

Towards Dolphin Recognition. Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003. Outline. Speech-to-Speech Recognition Brief Introduction Lab, Research Data Requirements Audio data ‘Transcriptions’ Towards Dolphin Recognition

abner
Download Presentation

Towards Dolphin Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Dolphin Recognition Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003

  2. Outline • Speech-to-Speech Recognition • Brief Introduction • Lab, Research • Data Requirements • Audio data • ‘Transcriptions’ • Towards Dolphin Recognition • Applications • Current Approaches • Preliminary Results

  3. Part 1 • Speech-to-Speech Recognition • Brief Introduction • Lab, Research • Data Requirements • Audio data • ‘Transcriptions’ • Towards Dolphin Recognition • Applications • Current Approaches • Preliminary Results

  4. Speech Processing Terms • Speech Recognition Converts spoken speech input into written text output • Natural Language Understanding (NLU) Derives the meaning of the spoken or written input • (Speech-to-speech) Translation Transforms text / speech from language Ato text / speech of language B • Speech Synthesis (Text-To-Speech=TTS) Converts written text input into audible output

  5. Speech Recognition h e l l o Hello Hale Bob Hallo : : TTS Speech Input - Preprocessing Decoding / Search Postprocessing - Synthesis

  6. Fundamental Equation of SR h e l l o Am AE M Are A R I AI you J U we VE A-b A-m A-e I am you are we are : Acoustic Model Pronunciation Language Model P(W/x) = [ P(x/W) * P(W) ] / P(x)

  7. SR: Data Requirements Am AE M Are A R I AI you J U we VE A-b A-m A-e I am you are we are : Acoustic Model Pronunciation Language Model Audio Data Sound set Units built from sounds Text Data   

  8. Janus Speech Recognition Toolkit (JRTk) • Unlimited and Open Vocabulary • Spontaneous and Conversational Human-Human Speech • Speaker-Independent • High Bandwidth, Telephone, Car, Broadcast • Languages: English, German, Spanish, French, Italian, Swedish, Portuguese, Korean, Japanese, Serbo-Croatian, Chinese, Shanghai, Arabic, Turkish, Russian, Tamil, Czech • Best Performance on Public Benchmarks • DoD, (English) DARPA Hub-5 Test ‘96, ‘97 (SWB-Task) • Verbmobil (German) Benchmark ’95-’00 (Travel-Task)

  9. Mobil Device for Translation&Navigation

  10. Multi-lingual Meeting Support The Meeting Browser is a powerful tool that allows us to record a new meeting, review or summarize an existing meeting or search a set of existing meetings for a particular speaker, topic, or idea.

  11. Multilingual Indexing of Video • View4You / Informedia: Automatically records Broadcast News and allows the user to retrieve video segments of news items for different topics using spoken language input • Non-cooperative speaker on video • Cooperative user • Indexing requires only low quality translation

  12. Part 2 • Speech-to-Speech Recognition • Brief Introduction • Lab, Research • Data Requirements • Audio data • ‘Transcriptions’ • Towards Dolphin Recognition • Applications • Current Approaches • Preliminary Results

  13. Towards Dolphin Recognition Identification Verification/Detection ? ? Whose voice is this? Whose voice is it? Whose voice is this? Whose voice is this? Is this Bob¡¯s voice? Is it Nippy’s voice? Is this Bob¡¯s voice? Is this Bob¡¯s voice? ? ? ? ? ? Segmentation and Clustering Where are dolphins changing? Where are speaker Which segments are fromthe same dolphin? Which segments are from Where are speaker Where are speaker Which segments are from Which segments are from changes? the same speaker? changes? changes? the same speaker? the same speaker? Speaker A Speaker A Speaker B Speaker B

  14. Applications ‘off-line’ applications (off the water, off the boat, off season) • Data Management and Indexing • Automatic Assignment/Labeling of already recorded (archived) data • Automatic Post-Processing (Indexing) for later retrieval • Towards Important/Meaningful Units = DOLPHONES • Segmentation and Clustering of similar sounds/units • Find out about unit frequencies • Find out about correlation between sounds and other events • Whistles correlated to Family Relationship • Who belongs to whom • Find out about the family tree? • Can we find out more about social structure?

  15. Applications ‘on-line’ applications • Identification and Tracking • Who is currently speaking • Who is around • Towards Important/Meaningful Units • Find out about correlation between sounds and other events • Whistles correlated to Family Relationship • Who belongs to whom • Wide-range identification, tracking, and observation(since sound travels longer distances than image)

  16. Common Approaches Training Phase Two distinct phases Training speech for each dolphin Model for each dolphin Nippy Feature extraction Model training Nippy xyz Havana Havana Detection Phase ? Feature extraction Detection decision Hypothesis: Havana

  17. Current Approaches  = p(X|dolph) / p(X|dolph) Dolphin model Feature extraction / L Background model • p(X|dolph) is the likelihood for the dolphin model when the features X = (x1,x2,…) are given • p(X|dolph) is an alternative or so called background model trained on all data but that of the dolphin in question • A likelihood ratio test is used for the detection decision

  18. First Experiments - Setup • Take the data we got from Denise • Alan did the labeling of about 160 files • Labels: • dolphin sounds ~370 tokens • electric noise (machine, clicks, others) ~180 tokens • pauses ~ 220 tokens • Derive Dolphin ID from file name (educ. Guess) (Caroh, Havana, Lag, Lat, LG, LH, Luna, Mel, Nassau, Nippy) • Train one model per dolphin, one ‘garbage’ model for the rest • Recognize incoming audio file; hypotheses consist of list of dolphin and garbage models • Count number of models per audio file and return the name of dolphin with the highest count as the one being identified

  19. First Experiments - Results

  20. Next steps • Step 1: To build a ‘real’ system we need • MORE audio data MORE audio data MORE ... • Labels (the more accurate the better) • Idea 1: Automatic labeling, live with the errors • Idea 2: Manual labeling • Idea 3: Automatic labeling and post-editing • Step 2: Given more data • Automatic clustering • Try first steps towards unit detection • Step 3: Build a working system, make it small and fast enough for deployment

More Related