Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship Clifford Nass Stanford University

Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship Clifford Nass Stanford University

Speaking is Fundamental • Fundamental means of human communication • Everyone speaks • IQs as low as 50 • Brains as small as 400 grams • Humans are built for words • Learn new word every two hours for 11 years

Listening to Speech is Fundamental • Womb: Mother’s voice differentiation • One day old: Differentiate speech vs. other sounds • Responses • Brain hemispheres • Four day olds: Differentiate native language vs. other languages • Adults: • Phoneme differentiation at 40-50 phonemes per second • Cope with cocktail parties

Listening Beyond Speech is Fundamental • Humans are acutely aware of para-linguistic cues • Gender • Personality • Accent • Emotion • Identity

Humans are Wired for Speech • Special parts of the brain devoted to • Speech recognition • Speech production • Para-linguistic processing • Voice recognition and discrimination

Therefore … Voice interface should be the most Enjoyable, Efficient, & Memorable method for providing and acquiring information

Are They? No!Why Not? • Machines are different than humans • Technology is insufficient But are these good reasons?

Critical Insights • Voice = Human • Technology Voice = Human Voice • Human-Technology Interaction = Human-Human Interaction

Where’s the Leverage? • Social sciences can give us • What’s important • What’s unimportant • Understanding • Methods • Unanswered questions

Male or Female Voice? • Is gender important? • Can technology have gender?

The Case of BMW

Brains are Built to Detect Voice Gender • First human category • Infants at six months • Self-identification by 2-3 years old • Within seconds for adults • Multiple ways to recognize gender in voice • Pitch • Pitch range • Variety of other spectral characteristics

Once Person Identifies Gender by Voice • Guides every interaction • Same-gender favoritism • Trust • Comfort • Gender stereotyping

Gender and Products • Gender should match product • More appropriate • More credible • Mutual influence of voice and product gender • Female voices feminize products (and conversely) • Female products feminize voices (and conversely) • “Match principle”

Research Context • “Gender” of voice (synthetic) • Gender of user • “Gender” of product • E-Commerce website

Examples of Advertisements • “Female” voice; female product • “Male” voice; female product • “Male” voice; male product

Appropriateness of the Voice

Voice/Product Gender Influences • Female voices feminize products;Male voices masculinize products • Strongest for opposite gender products • Female products feminize voices;Male products maculinize voices • Strong preference when voice matches product

Results for User Gender • People trust voices that match themselves • Females conform more with “female” voices • Males conform more with “male” voices • People like voices that match themselves • Females like the “female” voice more • Males like the “male” voice more

Other Results • Participants denied stereotyping technology • Participants denied harboring stereotypes!

People stereotype voices by gender • Voice “gender” should match content “gender” • Product descriptions • Teaching • Praise • Jokes

Gender is Marked by Word Choice • Female speech • More “I,” “you,” “she,” “her,” “their,” “myself” • Less “the,” “that,” these,” “one,” “two,” “some more” • More compliments • More apologies • More relationships between things • Less description of particular things • “They” for living things only • Voices should speak consistently with their “gender”

Selecting Voices • Voices manifest many traits • Gender • Personality • Age • Ethnicity • Voice traits should match content traits • Content • Language style • Appearance (e.g., accent and race) • Context • Voice traits should match user traits

If Only One Voice • Consider stereotypes • Masculine vs. feminine (same voice) • Boost high frequencies (feminine) • Boost low frequencies (masculine)

Emotions

Emotion and Voice • Voice is the first indicator of emotion • Voice emotion has many markers • Pitch • Value • Range • Change rate • Amplitude • Value • Range • Change rate • Words per minute

Emotion is always relevant • User has initial emotion • Interactions create emotions • Voice is particularly powerful • Frustration is particularly powerful

Emotion and Technology • Could technology-based voices exhibit emotion? • Could technology-based voice emotion influence people?

Research Context • Create upset or happy drivers • Have them “drive” for 15 minutes • Female voice gives information and makes suggestions • Upbeat • Subdued

Number of Accidents

Results • People speak to car much more when emotion is consistent • People like car much more when emotion is consistent

Implications • User emotion is a critical part of any interaction • Emotion must match content • Perception of voice • Trust • Intelligence • User • Performance • Comfort • Enjoyment

One Voice Emotion: Select for Goal • Overall liking • Slightly happy voice • Attention-getting • Anger • Sadness • Trust and vulnerability • Sadness (mild)

If You Can’t Manipulate Voice Emotion • Manipulate content • Manipulate music

Using the First Person: Should IT say “I”

Should Voice Interfaces say “I”? • When should a voice interface say “I”? • Does synthetic vs. recorded speech affect the answer to the previous question?

The Importance of “I” • “I” is the most basic claim to humanity • “I think, therefore I am” • “I, Robot” • Dobby and monsters don’t say “I” • “I” is the marker of responsibility • “I made a mistake” vs.“Mistakes were made”

Research Context • Auction site • Telephone interface with speech recognition • Recorded bidding behavior • Online questionnaire

Average Bidding Price

Results • When “I”+Recorded or “No I”+Synthetic • System is higher quality • Users were much more relaxed • “No I” is more objective • “I” is more “present”

Results • “I” is right for embodiments • Robots • Characters • Autonomous intelligence (“KITT”) • “I” is wrong when voice is second fiddle to technology • Traditional car • Heavily-branded products

Design • Text-to-Speech is a machine voice • Recorded speech is a human voice • Design questions are • Not philosophical questions • Not judgment questions • Experimentally verifiable

Mistakes are Tough to Talk About

Who is Responsible for Errors? • Recognition is not perfect • When system fails, who should be assigned responsibility? • System • User • No one

Responding to Errors • Modesty • Likable • Unintelligent (people believe modesty!) • Criticism • Isn’t really constructive • Unpleasant • Intelligent • Scapegoating • Effective • Safe

System Responses to Errors • System blame (most common) • No blame • User blame

Research context • Amazon-by-phone • Numerous planned interaction errors

Book Buying

Results • Neutral and system blame • Sell much better than user blame • Neutral blame • Easier to use than system blame • Nicer than system blame • User blame is most intelligent! • System blame is least intelligent

Results for Errors • Take responsibility when unavoidable • Increases trust • Increases liking • Weak negative effect on intelligence • Ignore errors whenever possible • Duck responsibility to third party if needed • Blame the phone line • Blame the road

Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship Clifford Nass Stanford University