260 likes | 328 Views
Characterizing and Processing Robot-Directed Speech. Paulina Varchavskaia • Paul Fitzpatrick • Cynthia Breazeal MIT AI Lab Humanoid Robotics Group. Characterizing and Processing Robot-Directed Speech. Paulina Varchavskaia • Paul Fitzpatrick • Cynthia Breazeal
E N D
Characterizing and Processing Robot-Directed Speech Paulina Varchavskaia • Paul Fitzpatrick • Cynthia Breazeal MIT AI Lab Humanoid Robotics Group
Characterizing and ProcessingRobot-Directed Speech Paulina Varchavskaia • Paul Fitzpatrick • Cynthia Breazeal MIT AI Lab Humanoid Robotics Group
Characterizing and Processing Robot-Directed Speech Paulina Varchavskaia • Paul Fitzpatrick • Cynthia Breazeal MIT AI Lab Humanoid Robotics Group
Characterizing and ProcessingRobot-Directed Speech Paulina Varchavskaia • Paul Fitzpatrick • Cynthia Breazeal MIT AI Lab Humanoid Robotics Group
Speech Recognition • What speech does the robot evoke? • How can we process that speech? • Support a growing vocabulary • Without hurting recognition accuracy
Extending Vocabulary /lft/ /ryt/ /yn/ /ylo/ /ylo/ /ylo/ /ylo/ /ylo/ /ylo/ /ylo/ /ylo/ /ylo/ /ylo/ /lft/ /lft/ /lft/ /lft/ /lft/ /lft/ /lft/ /lft/ /lft/ /lft/ /ryt/ /lft/ /ryt/ Extending Vocabulary /ryt/ /ryt/ /ryt/ /ryt/ /ryt/ /ryt/ /grin/ /grin/ /grin/ /grin/ /grin/ /grin/ /grin/ /grin/ /ylo/ /ryt/ /lft/ /ryt/ /ylo/ /grin/ Baseline System
Baseline System • “Magic word” to introduce vocabulary • Separation of introduction from use • Robot confused by unknown/filler words
13 children 5-10 years old 20 minute sessions Some prompting (Turkle et al) Study: Characterizing Speech
10 9 8 7 6 number of utterances per minute (approx) 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 children (cumulative %) Number of Utterances
100 90 80 70 60 single word utterances (%) 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 children (cumulative %) Single Word Utterances
100 90 80 70 60 utterances with enunciation (%) 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 children (cumulative %) Utterances with Enunciation
Cooperative Speech Exists • Some clean word-learning scenarios • “Tiffany” • Can we use them as leverage? • “My name is Tiffany” • What recognition rates are possible?
Study: Processing Speech • “LCSINFO” domain (Glass, Weinstein) • Provide some initial vocabulary • Build language model without transcripts
Language Model word 1 word 2 word n
OOV Out Of Vocabulary Model word 1 word 2 word n
phone 1 phone 2 phone n Out Of Vocabulary Model OOV (Bazzi)
Run recognizer Extract OOV fragments Identify rarely-used additions Identify competition Update Language Model Add to lexicon Remove from lexicon Update lexicon, baseforms Hypothesized transcript N-Best hypotheses
Keyword Error Rate (%) coverage (%)
Qualitative Results Given: 1600 Utterances “email, phone, room, office, address” Finds: 1. n ah m b er 6. p l iy z 2. w eh r ih z 7. ae ng k y uw 3. w ah t ih z 8. n ow 4. t eh l m iy 9. hh aw ax b aw 5. k ix n y uw 10. g r uw p
Conclusions • What about acoustic models? • Idiosyncratic vocabulary is important • Pick a good name for your robot!
Acknowledgements • MIT Initiative on Technology and Self • MIT Spoken Language Systems group • DARPA • NTT