1 / 20

Are We Ready? A Look at the State of the Art in Speech-to-text Applications

Are We Ready? A Look at the State of the Art in Speech-to-text Applications. Marie Meteer August 2007. www.everyzing.com. Overview. Speech Recognition: The State of the Art A look back at where it came from Elements of the models State of the art performance

lucian
Download Presentation

Are We Ready? A Look at the State of the Art in Speech-to-text Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Are We Ready?A Look at the State of the Art in Speech-to-text Applications Marie Meteer August 2007 www.everyzing.com

  2. Overview Speech Recognition: The State of the Art A look back at where it came from Elements of the models State of the art performance Applications: Making them work Call Center Analytics Voicemail Transcription Needles in Haystacks Multimedia search

  3. BBN Technology’s Speech Milestones Rough’ n’ Ready prototype system for browsing audio Pioneered statistical language understanding and data extraction Introduced context dependent phonetic units Early adopter of statistical hidden Markov models DARPA EARS Program Award Exceeded DARPA EARS targets 1982 1986 1995 1998 2002 2004 1976 1992 1994 2000 2003 2005 Early continuous speech recognizer using natural language understanding First 40,000 word real time speech recognizer AVOKE STX 1.0 introduced Audio Indexer System – 1st generation Broadcast Monitoring System delivered to U.S. Gov’t. – 2nd generation AVOKE STX 2.0 with Domain DevelopmentTools First software-only, real-time, large-vocabulary, speaker-independent, continuous speech recognizer

  4. Progress in Speech Recognition 1990’s 90 80 70 Call Home 60 SWBD ConversationalTelephone 50 40 Word Error Rate (%) 30 Broadcast News 20 WSJ 64K Vocab Resource Management WSJ 5K Vocab 10 Airline Task 5 2 Resource Mgt Spkr Dep. Connected Digits 1 87 88 89 90 91 92 93 94 95 96 97 98

  5. DARPA EARS for ASR Performance BBN’s 2003 Performance Exceeds Broadcast news ceiling Broadcast news floor Telephony ceiling Telephony floor Word Error Rate Goals 60 50 40 Word error rate 30 20 10 0 2003 2002 2005 2007 Year

  6. Elements of a Speech Model Dictionary List of all the words and their pronunciations, the sequence of “phonemes” that make up the word >Real Networks R-IY-L N-EH-T-W-ER-K-S Dictionary tool automatically creates phonetic pronunciations for most words Acoustic Model Captures the relationship between the sounds and the phonemes Specific to a language (e.g. English, Spanish) and a channel (e.g. telephony, broadcast) Domain Model Captures the sequences of words in the language using a “tri-gram” model, that is the likelihood of a word given the two previous words Can be as general as “Conversational” or as specific as “Technology”

  7. Model Requirements Acoustic Data Minimum of 50-100 hours transcribed data English Broadcast News transcribed on 1600 hours of broadcast news data Training data must be a precise transcription with corresponding audio file (including partial words, “um”, laugh, etc) Domain Modeling data Text data, either transcribed from audio or off the web Does not have to be as precise as for acoustic modeling Has to model both the vocabulary and “style” of speaking Dictionary Phonetic pronunciations of all of the words

  8. Word Accuracy Recognition performance varies based on audio quality and domain Within News Factors include Speaker Audio quality Background music Across Domains Factors include Speaking style, Out of vocabulary rate Audio quality

  9. Document Retrieval Accuracy To correctly retrieve a document, a search term only has to be found once in the document The table below reports on document retrieval accuracy based on words occurring 2 or more times in the document compared with overall word accuracy.

  10. Markets and Applications Consumer Search (video search) Government Intelligence Call Center Recording Digital Asset Production Broadcast Monitoring & Retrieval (audio/video publication) Enterprise Search (webcasts, corp info)

  11. AVOKE Caller Experience Analytics Breakthrough Caller Experience Analytics The Only True End-to-End Solution From dialing to termination Multiple Techniques To Extract Understanding Prompt and speech recognition, telephony data, and human annotation Data-Driven Insights With drill-down to listen for root cause Zero Integration No on-site hardware or software To Manage & Optimize Contact Processes Improve Operational Visibility Reduce Agent Time by 15-30+% Boost First Call Resolution Eliminate Customer Dis-Satisfiers

  12. Full Text & Keyword Search Search for words spoken by callers or agents View call with full text of caller and call center – including all IVR(s), queue(s) and agent(s)

  13. Voicemail Transcription Requirements Near real time transcription High accuracy, especially on names Frequently very noisy conditions (Non-native speaker calling on a cell phone from a street corner in Germany) Solution Speech recognition automates a “first pass” Human correction provides accuracy Full human transcription on poor quality calls

  14. Voicemail Solution?Human in the loop “Hi Tom. I can’t make the meeting but I’m available to call in. Give me a call at 101-555-1212. Thanks.” Transcribers fix the output of the speech recognizer Speech Recognizer produces a rough transcript Phone message is left Correct transcription goes back to the server Result: High Quality, Lower Cost

  15. Custom Applications: Broadcast Monitoring Automatictranslationof Arabic transcript from Language Weaver MT Automatictranscriptionof Arabic speech from BBN Audio Indexer Real-time streaming video(<5 min delay)

  16. MultiMedia Search Problem: Search engines have historically had very little to work with in terms of properly discovering and indexing multimedia content: Opportunity: The value of multimedia content is “trapped” inside the files, out of view of search engines. Titles and tags miss key concepts within the files: …let’s look at the overall picture not just Obama and and Clinton Brett how do you assess the overall dynamics of what's happened over the course of the last three months how big -- victory for the president how big a defeat for the Democrat well it it. He would have been a bigger defeat it was a victory. This is this is -- reprieve cents for the president it's only as bill pointed out for months worth of funding. And it's and this issue's going to come up again in the Democrats are going to continue to try to impose restrictions on the with a president for a just war -- vote to be funded completely which is what. We're just talking about so. This is just justices have a battle he wanted that's that's nice for him but there's another one coming in just a few months. And of course what we have now is this whole idea that is taken hold and it's it's out there in the in the public parlance about September being in the big month not helpful to the president's cause -- -- for prisoners efforts you know we're not going to -- all the troops on the ground until next month and then visiting get to bounce of the summer to try to fix the situation. Probably unrealistic which in September's going to be a tough month of. ...

  17. Multimedia Consumption • Consumption: • Automatic extraction of key terms and concepts for tagging, categorization • Patent-pending “Snippet” navigation technology enables users to jump to relevant segments of the clip • Social media integrations drives RSS subscription, bookmarking, etc. • Full text output enables related content presentation

  18. Multimedia Discovery Example: FoxSports.com • EveryZing Media Merchandising indexes the full contents of FoxSports Multimedia files. • As a result, EveryZing able to significantly increase the number of keyword results • Great discovery leads to increased consumption and enhanced monetization opportunities.

  19. Summary Speech recognition takes an inaccessible data structure (audio) and turns it into an accessible one (text) It’s far from perfect, but it’s a big jump from nothing Take away: It’s the task that matters. Find the right role, and speech recognition works (Corollary: A good prompt is worth two years of research)

  20. Media Merchandising SolutionsThank you!Marie Meteer VP of Speech and NLPmmeteer@everyzing.comwww.everyzing.com

More Related