490 likes | 756 Views
Listening to the Gamer: Getting Speech Recognition in Games Right. Speaker Information. Jason Hewitt Advanced Technology Group Microsoft Dr. Mike Froggatt Developer Lead Microsoft Game Studios. Audience. Are you thinking about adding speech to your game?
E N D
Listening to the Gamer:Getting Speech Recognition in Games Right
Speaker Information • Jason HewittAdvanced Technology GroupMicrosoft • Dr. Mike FroggattDeveloper LeadMicrosoft Game Studios
Audience • Are you thinking about adding speech to your game? • Are you targeting a console or the PC? • Portable platforms are also good! • Takeaway: Speech is on our consoles, it’s easy to add.
Speech in games is not new! • It was in Unreal Tournament 2004! • It was on the PS2! • It’s there, ready to use.
Two ways to listen • General Dictation • Command and Control
Multiple Solutions • Fonix http://www.fonixspeech.com • Platforms: Win32, Xbox 360, PS3, Wii • Languages: US English, UK English, French, German, Italian, Spanish, Japanese, Korean • Voxler http://www.voxler.eu • Platforms: Win32, Xbox 360, PS3, Wii, Nintendo DS, iPhone • Languages: “All major English dialects and European languages” • NuiSpeech • Kinect Only • Languages: English (US & UK), French, Japanese, Spanish (Mexico) • Preview models of French (Canadian and France, German, Italian, Spanish (Spain) and English (Australian) • Designed specifically for a 10ft experience • Others are out there
Microphones Overview Each platform has its own microphones and platform capabilities, so you can either take the lowest common denominator or you can customize to each platform’s strengths
Speech has two Inputs Speech Recognition Engine Grammar Results
Apply Good Design Principles • Set your goals at the beginning of the project: • Don’t add speech recognition with a month to spare • Evaluate the tech • Prototype • Rethink the goals • Be consistent • Users expect what works once works always • Decide early if you want to count on a Mic • Remember requiring Kinect or Move = free Mic
To require a mic? • It’s natural! • New gameplay mechanics • Expands User Control • Controller fallbacks not necessary • But are still a plus!
Or to not require a mic? • Not everyone will have a Mic • Accessibility • Some gamers won’t or can’t talk
Menus and Pausing • Don’t add at the last second! • Think of your menu names and your grammar design at the same time • If you do implement it, let users skip menu pages • Beware of a “Pause” • False positves can break flow • Best to maybe gather intent • Consider allowing users to disable
Key Scenario Integration • Focused on scenarios in games that can provide the biggest impact • Dialog tree navigation • Merchant/shop interaction • Most ideas here can again be optional; • Allow the controller to be a back up
Full Title Integration • Doesn’t mean voice only (but could) • When approaching the games control scheme, consider if voice makes sense—for example: • Squad commands • Activation of controls • Help mechanisms • Volume of player’s speech levels in a horror or stealth game
What can I say? • Teaching styles • “See it! Say it!” then “Know it! Say it!” • Repeat after me • Explore on your own • Screen awareness • Off-screen awareness
Expandable Menu Soldier! Joe Attack Frank Defend Retreat How’s the weather?
The Basics • XML based • All use W3C format or a subset of the format • http://www.w3.org/TR/speech-grammar/ • Multiple rules can be activated or inactivated at once • Custom pronunciations are available • This helps with in game items • This can also help with difficult to pronounce or understand words
Grammar Size • Check with your middleware provider on how many phrases • Key point is going beyond recommended phrases means more chances to be similar and confusable • Manage active phrases with rules • Remember you don’t need the shopkeeper recognition when fighting the dragon • Pause menu interaction should reduce the set of active rules
Evolving the Grammar • Start with a small initial word set • Do no proactively add recognition phrases too much • See through play testing where gamers go • Handle the common cases • Synonyms are a slippery slope • Especially in a See it! Say it! scenario • Multiple iterations provides better tuning
Test Each Iteration! • Record your users saying phrases both in and out of grammar • Consider automated nightly tests of each grammar iteration • Measure false negatives, false positive, success rates • Test in game scenarios • If two grammars are active at the same time, you must test them together
Working with Limitations • Speech is not perfect • Generally speech works best when • Background noise is minimum • Speaker enunciates • The grammar size is within recommendation
Working with Side talk • There may be other noises in the room that the mic picks up • Remember you can still respond to side talk! • “Hey, you talking to me?”“Sorry, my (language) is very limited.” • Test with a garbage rule
Working with Failure • Even a speech recognition failure should be a success • Handle misfires and repeats as part of the game • NPCs can have headaches, migraines, or explain their misunderstanding • “Sorry, what was that? I was thinking about sheep.”
Localization • Begin localization after most design decisions are locked down • Iterate and design in your native language • Begin before it’s too late to work with translators, manual, etc. • Be wary of text/UI translators • Spoken language can vary differently than the written language • Recommend audio translators • Leverage your existing in-game dialogue translation team • They know the right voice to use for communication • “See it! Say it!” implementations will need to be translated by this same team • Have native speakers testing • More than one native speaker is always better
Localization • Provide plenty of background of the situation to the translator. More info the better. • You should be doing this for in-game dialogue already; your team’s localization expert will be able to provide guidance here. • Different languages map 1 word to 3 words and 3 to one so provide context for each situation • Remember to coordinate changes across languages
Listening to the Gamer:Getting Speech Recognition in Games Right Kinectimals Speech Post-Mortem
“If I could talk with the animals…” • Kinectimalswas standard-bearer for speech recognition at Kinect launch • Lofty goals: • Natural interaction with animal through speech • Praise, issue commands, call animal by name • Ultimately delivered robust recognition for a reasonable command set • Animal naming most challenging component to implement
Design <grammar xml:lang="en-us"version="1.0"root="dash_commands"> <rule id="dash_commands"scope="public"> <one-of> <item> Hey, is this thing on? Xbox, can you hear me? Hey Jimmy! Come look at this! The Xbox understands me! <tag> exec "dash.xex /upgrade_to_gold_account /quiet" </tag></item> <item>Oh Xbox, you’re my only friend - my girlfriend’s left me and no one understands me like you do. <tag>exec "halo_reach.xex" </tag> </item> </one-of> </rule> </grammar>
Design Giveth… • Game design is our friend • No expectation of animals understanding speech perfectly • Player more forgiving of incorrect or failed recognition • Children interpreted failed recognition as animal “being naughty”
…Design Taketh Away • Design is our enemy • Familiar situation produces habitual response • Expectation that what a real animal responds to, the game will respond to • Commands framed with non-essential vocal noise • “Hey Skittles, sit down, please” • Speech commands often mode-less • Where to allow / disallow them?
Don’t Both Talk at Once • Narrator character introduced late in design • Gave instruction on gestures and speech commands to use • Narrator saying “Sit down” often made animal sit down • Specific hardware can help with this • Kinect has array microphone with Multichannel Echo Cancellation (MEC) • Effectiveness dependent on microphone calibration • Better to avoid issue altogether if possible • Disable speech recognition while narrator speaking • Watch out for NPC speech triggering commands during gameplay • Example: team-mates shouting “Take cover!”
Final Grammar <rule id="reserved"scope="public"> <one-of> <item> <token sapi:display="Kinect"sapi:pron="K IH N EH K T"> kinect </token> </item> </one-of> </rule> • Most complex command grammar: • Concurrent detection of 16 different phrases • Mapped to 9 distinct commands (“Sit” equivalent to “Sit down”) • Name recognition also running • Some state-based selection of different grammars • However this was worst-case scenario (most rules active) • Manually specifying phonemes for a given rule can help increase recognition accuracy • May be needed for proprietary or game-specific termslike character names • Built-in text-to-phoneme rules may not work well in these cases
Playing <tag> • <tag> element allows a single semantic to be associated with multiple utterances • Also provides language invariance • Great way to encode per rule data • Accept confidence threshold, for example • Parsed at run-time, so don’t go overboard <item><one-of><item> sit </item><item> sit down </item></one-of><tag>Sit</tag></item> <item><one-of><item> go play<tag>conf=0.45</tag></item></one-of><tag>Dismiss</tag></item>
Please Stop Talking • Speech is unpredictable • Valid utterances may vary widely in length • Background noise may end up being processed for recognition • Changing state of Speech Recognition engine may incur unexpected synchronization delays • Can occur when stopping recognition, changing rule states or loading new grammars • Bugs can become highly context-sensitive • May see occasional frame-outs when tested in noisy open-plan area, but not when tested in closed office • Easiest option: run all game-side speech processing on separate thread • Move off the h/w thread that the main game is using • Speech will typically not saturate a core
Name Your Animal (NYA) • Allow player to speak name they want to use • No attempt to turn spoken name into real text (for display) • Instead use a pictorial (camera capture) representation for identification • Implemented as free form speech to phoneme conversion • Then use phonemes to build a grammar rule with custom pronunciation • Name used to attract animal’s attention, just as it would be in real life • Pushes the limits of NuiSpeech
NYA Challenges • Used a special grammar for speech to phoneme conversion • Much larger than normal command grammars • 11.5MB for largest NYA grammar vs. 5KB for largest command grammar • Also requires a dynamic grammar to add the “name” rule to • So even more memory for the acoustic model • Much more sensitive to environmental noise than the normal speech commands • Naming process would sometimes drive itself to completion from noise in the room • Watched for some reserved terms (“Kinect”, “Xbox”), no attempt to catch swearing etc. • Space of potentially prohibited terms simply too large • Reject names that are too long as difficult for the player to repeat successfully
NYA Flow “CH I Z AX N” generated “CH I T AX” ideal string • One utterance unlikely to be sufficient to get the “right” name • Allow a number of attempts to successfully repeat name • Hopefully deals with player trying to mislead the system • If no repeat in sensible number of attempts, prompt player to try different name • Try to avoid player getting stuck trying to repeat “problem” name • Balance ease of use when player is using the system “correctly” against rejecting noise as a naming attempt
NYA Internationalization • Separate speech to phoneme grammar for each NuiSpeech language • NYA accuracy varies across languages • US English NYA works well for languages other than English • Tested in 11 additional countries • Allowed us to support NYA in countries that weren’t supported by NuiSpeech at launch
The Challenge of Testing Speech • Human beings very good at spotting patterns • Even non-existent ones • Easy to find reasons why speech works better or worse • “Speech works better when I wear a blue shirt!” • In reality, recognition strongly influenced by exact acoustic environment • So test with lots of people, and lots of different conditions • Individual office vs. open plan • Look at whether player successfully completes tasks with speech • Not just whether individual commands are recognised (too conservative) • Watch out for commands that never seem to work however! • Make low-level speech success / failure events visible • On-screen log is very useful
Heed the Advice of W. C. Fields • Never work with children or animals • Kinectimals had both… • Recognition confidences for children inherently lower than for adults • Can be self-conscious about “talking to the TV” leading to them not speaking clearly • If they become frustrated, they may shout or do other things that make recognition worse, not better • Tutor them through which speech commands to use, and how best to say them • Set confidence thresholds lower and accept some degree of False Accepts for adult speakers • This can be difficult since your test / development team will get a worse experience
What We Learnt • Integration of speech recognition system straightforward (even with NYA) • But testing hard and time-consuming! • Look at task completion, not purely at recognition accuracy • Players will probably not notice occasionally having to repeat commands • Contrast issuing commands to the game, versus talking to an in-game character • Issuing commands: small command set, but very high accuracy required • Talking to character: more tolerant of failed recognition, but larger command set, or even natural language expected • Naming things via speech is hard • You probably won’t have access to generic speech-to-text capabilities • If you can, use text input to acquire the name and then add it dynamically as a grammar rule • You may want a custom lexicon of common / difficult names to ensure correct phonemes used • Accept you may not be able to please everyone all the time • Weight success towards your primary audience
Thank you to… • Xbox Platform Speech Team • Kinectimals Team at Frontier • No animals were harmed in the making of this game • A few testers lost their voices however