Speaking while monitoring addressees for understanding

Speaking while monitoring addressees for understanding Seminar „Gaze asfunctionofinstructions - andviceversa“ Herbert H. Clark andMeredyth A. Krych TorstenJachmann 16.12.2013

Research Question • Speaking and listening in dialog • Unilateral • Speakers and listeners act autonomous • No interaction • Bilateral • Speakers and listeners monitor their respective partner • Joint activity • What do speakers monitor? • How do they use that information?

Grounding • Level 1 • Attend to vocalization • Level 2 • Identify words, phrases and sentences • Level 3 • Understand the meaning • Level 4 • Consider answering

Grounding A: Where you there when they erected the new signs? B: Th… which new signs? (Level 3) A: Little notice boards, indicating where you had to go for everything B: No.  Bilateral account

Monitoring • Voices • Attendance to partners utterances • Faces • Gazeand facial expressions as indicator for understanding • Workspaces • Region in front of the body • Manual gestures (but also games, etc.)

Monitoring • Bodies • Head and torso movement as indicator • Shared Scenes • Scenery beyond workspace • Signals vs. Symptoms • Signals are constructed to get meaning across • Symptoms are not intentionally created

Least joint effort • Opportunistic • Selection of the available methods that take the least effort to produce • “Tailored” • Overhearers (not monitored by speaker) may misunderstand utterances

Method • Pairs of directors and builders • 76 students (34 male / 42 female) • Instructions to build 10 simple Lego Models • 2 x 2 design(interactive) • 28 pairs • Additional non-interactive condition • 10 pairs • Video and audio analyses

Interactive • Mixture model • Workspace (between subject) • Visible • Invisible • Faces (within subject) • Visible • Invisible • No restrictions in time and talk

Non-interactive • Only one condition • Director records instructions • No time or talk constrains • Prototype can be examined as long as wanted before recording • Builders listen to instructions • No constrains on actions • Start, stop, rewind

Results • Efficiency • Turns • Gestures and grounding • Deictic expressions • Gestures by addressees • Cross-timing of actions • Timing strategies • Visual monitoring

Efficiency • Visibility of workspace improves efficiency

Efficiency Non-interactive • Time needed to build much longer (245s “n-i” vs. 183s “i”) • Strong drop in accuracy • Inadequate instructions

Turns • Fewer SPOKEN turns of builder when workspace is visible

Deictic expressions • Mainly unusable when workspace hidden • Joint attention needed • only referring to before mentioned situation

Gestures by addressees • Mostly accompanied by deictic utterances (if any) • Explicit verdict usually only on such utterances (otherwise continuing)

Cross-timing • Gestural signals • Reflect understanding at that moment

Cross-timing • Overlapping signals • Usually not in spoken dialog • Start with “sufficient information”

Cross-timing • Projecting • Prediction of following actions/instructions

Cross-timing • Initiation time • Waiting for partner to be able to attend the following utterance

Cross-timing • Time uptake • Responses have to be timed exactly to the action and situation

Timing strategies • Self-interruption • Dealing with evidence from the addressee • Usually not continued

Timing strategies • Collaborative references • Deictic references rely on addressees actions

Visual monitoring • Mainly used when director reaches a problem • Eye gaze as support

Conclusion • Grounding is fundamental • Visible workspace enhances grounding speed • In task-oriented dialogs faces are not important • Compensation possible (only if any monitoring is available)

Conclusion • Updating common ground • Increments are determined jointly • Much evidence for bilateral account • Addressees provide statement about current understanding • Speakers monitor to update and change utterances

Conclusion • Opportunistic process • Offering options • Self-interruptions • Waiting • Instant revision • Multi-modal process • Speech and gestures are combined if possible • Speech alone takes more time

Remarks • Gaze only important for certain types of tasks • Measurement of time maybe outdated (“old” study) • No contradicting studies (To some extend commonsense)

Gaze and Turn-TakingBehavior in CasualConversation Interactions KristiinaJokinen, HirohisaFurukawa, MasafumiNishidaandSeiichi Yamamoto

Differences • Three-party dialogue • No instructional task • Stronger focus on eye gaze

Research Question • How well can eye gaze help in predicting turn taking? • What is the role of eye gaze when the speaker holds the turn? • Is the role of eye gaze as important in three-party dialogs as in two-party dialogue?

Hypothesis • In group discussions, eye gaze is important in turn to management (especially in turn holding cases) • The speaker is more influential than the other partners in coordinating interactions (selects the next speaker)

Method • Three-person conversational eye gaze corpus • Natural conversations • Balanced familiarity (50% familiar; 50% unfamiliar) • Balanced gender (male-only; female-only; mixed)

Method • 28 conversations among Japanese students in their early 20’s with three participants each • Each conversation about 10 minutes • Eye gaze recorded for one participant

Method • Eye tracker fixed on table to remain naturalness

Method

Used data • Estimated at the last 300ms of an utterance if followed by a 500ms pause

Used data • Dialog acts • Speech features • Values of F0, etc. • Eye gaze

Results

Conclusion • Speaker signals whether he intends to give the turn or hold it by using eye gaze • fixating listener vs. focusing attention somewhere • Eye gaze in multi-participant conversation as important as in two-participant conversations

Conclusion • Eye gaze is used to select next speaker (seems to be correct) • Maybe Japanese data interferes with value of speech data • Comparison Study? • Listeners focus on speaker not vice versa

Remarks • Vague information and data presentation • Although various data exists, interaction of factors is not presented • Some conclusions rely on the before mentioned point • Setup only takes one participant in consideration • Much of the data was unused • Lack in quality and way of creation

Remarks • Study is based on data for another study • Setup is not optimal • Realistic design • Yet, contains biasing flaws (situation of the participants, only one eye tracker)

Comparison • Clark and Krych present interesting ideas but eye gaze is only rarely handled • How could this be altered? • Jokinen et al. focus on eye gaze in a (more or less) natural situation but lack in scientific results and setup • What points and ideas of this setup could be beneficial?

Speaking while monitoring addressees for understanding