Speech Recognition

Speech Recognition Yonglei Tao

Voice-Activated GPS

Voice User Interface (VUI) • A VUI allows human interaction with computers through a voice/speech platform • Basic components • System messages • Grammars • Dialog logic • Benefits • Loosen some physical constraints such as screen size • Provide tools for universal design • disability and situational impairments • Intuitive and efficiency

System Architecture

Components • Endpointing • Speech to endpointed utterance • Feature extraction • Endpointed utterance to feature vectors • Recognition • Feature vectors to word string(s) • Natural language understanding • Word string(s) to meaning(s) • Dialog management • Meaning to actions

Typical Recognition Components

Examples • Book, boot • Write, right • Flew, flu, flue • Eight books • Ate books • I scream • Ice cream

Components • Acoustic models • Internal representation of each basic sound • Dictionary • A list of words and pronunciations • Grammar • Defines all possible strings of words the recognizer can handle • Allows to associate a meaning with those strings • Either rule-based or statistical (created by computing the probability of words occurring in a given context)

Recognition • Recognition search • A recognizer searches the recognition model to find the best-matching word string • Confidence measures • A quantitative measure of how confident the recognizer is for the best-matching string • VUI developers can use those measures in several ways • N-Best processing • A recognizer returns severalresults with the confidence measure for each

Speech Recognition Engines • Microsoft Visual Studio & CMU Sphinx • Grammar • Android • Language model – free form for dictation or web search for short phrases • Google Web Speech API for Web Applications

BNF (Backus-Naur Form) • Notation for context-free grammars • Often used to describe the syntax of programming languages • Also specify the words and patterns of words to be listened for by a speech recognizer • EBNF (Extended Backus-Naur Form) • ABNF (Augmented Backus-Naur Form) • Basis for speech grammar specifications • ABNF for .Net • Regular grammar for Java

Basics ::= meaning "is defined as" | meaning "or" < > include category name Terminal basic component <X> ::= a b c a sequence <Y> ::= a | b | c optional <Z> ::= a | a <Z> one or more

Example • Grammar for a speech recognition calculator

Visual Studio Speech Recognizer

Speech Recognition with Visual Studio • Examples • http://www.phon.ucl.ac.uk/courses/spsci/compmeth/speech/recognition.html • http://blogs.msdn.com/b/devschool/archive/2012/02/06/speech-recognition-using-visual-studio-determining-the-bna.aspx • Grammar Class • http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx • GrammarBuilderClass • http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammarbuilder.aspx

Speech Recognition for Java • Sphinx 4 • A speech recognition engine written entirely in Java • Created by CMU, Sun, Mitsubishi, HP, … • Open source • Compliant with JSpeech Grammar Format • Platform- and vendor-independent • Programmer’s guide http://cmusphinx.sourceforge.net/sphinx4/ • An example https://www.assembla.com/code/sonido/subversion/nodes/4/sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld

Android Speech Recognition public class MainActivity extends Activity { private static final int VOICE_RECOGNITION = 1; Button speakButton ; TextViewspokenWords; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); speakButton = (Button) findViewById(R.id.button1); spokenWords = (TextView)findViewById(R.id.textView1); } @Override public booleanonCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; }

@Override protected void onActivityResult(intrequestCode, intresultCode, Intent data) { if (requestCode == VOICE_RECOGNITION && resultCode == RESULT_OK) { ArrayList<String> results; results = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS); // TODO Do something with the recognized voice strings Toast.makeText(this, results.get(0), Toast.LENGTH_SHORT).show(); spokenWords.setText(results.get(0)); } super.onActivityResult(requestCode, resultCode, data); } public void btnSpeak(View view){ Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH); // Specify free form input intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM); intent.putExtra(RecognizerIntent.EXTRA_PROMPT,"Please start speaking"); intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1); intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.ENGLISH); startActivityForResult(intent, VOICE_RECOGNITION); } }

Android and Web Speech Recognition • Android Voice Recognition Tutorial • http://www.javacodegeeks.com/2012/08/android-voice-recognition-tutorial.html • Google Web Speech Recognition Examples • http://stiltsoft.com/blog/2013/05/google-chrome-how-to-use-the-web-speech-api/ • http://stackoverflow.com/questions/17635354/developing-a-simple-voice-driven-web-app-using-web-speech-api • http://apprentice.craic.com/tutorials/37

Challenges for VUI Design • People have very little patience for a "machine that does not understand” • VUIs need to respond to input reliably, or they will be rejected by their users • Designing a usable VUI requires interdisciplinary talents of computer science, linguistics and human factors • The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction

Natural Language Understanding • Ambiguity • Refers to phrases that look distinct in print but sound similar when spoken, for example, • “Wreck a nice beach” • “Recognize speech” • As the vocabulary and grammar get larger, the potential for ambiguity increases • Short words and phrases are harder to recognize than longer ones

Language Understanding (Cont.) • Deviation • Deviating from what the developer expects • For example, an issue with the question “Is that correct?” • Expecting a simple response like “Yes”, “No”, or “Correct” • Southern speakers would respond with “Yes, ma’am” or “No, ma’am”

Discussion • What you would expect if the user asks to start Microsoft Word? • Please start word • Could you start word • Start word • Please open word • Could you open word • Open word

Discussion (Cont.) • If the grammar accepts only those, determine whether or not the action to open the application can be as follows:

Language Understanding (Cont.) • Keyword Extraction • Important for applications built with a speech recognizer that returns a string containing the actual words spoke by the user • Leaving the application to interpret their semantic meaning • One might say “Computer, find me some information about the flooding in Detroit recently“ • Keywords like “find”, “flooding”, and “Detroit” are crucial for an accurate response from the VUI • Others are filler words

Dialog Management • Multi-modelity • Interaction can occur through different mediums • Need to consider when and which part of the application allows to be multi-model • Grammar • There is a close relationship between what a prompt says and what the caller ends up saying to the system • Especially the words used • Configuration files • You may choose the confidence level at which the recognizer will reject the input rather than return the answer • You may also choose parameters for the endpointer, that is, how long it should listen before timing out

Dialog Management (Cont.) • Error handling • Allow the user to be able to recover after errors and get the dialog with the user back on track • Recognition does not always succeed. When it fails, there are a number of messages the recognizer may return to the application. • Voice recognition accuracy • In-grammar data • Out-grammar data

Error Handling • In-grammar data • Correct Accept • the recognizer returned the correct answer • False Accept • the recognizer returned the wrong answer • False Reject • the recognizer could not find match and gave up • Out-of-grammar data • Correct Reject • the recognizer correctly rejected the input • False Accept • the recognizer returned a value that is wrong because the input is not in the grammar • How to handle each categories?

Error Handing in Android

Speech Recognition