720 likes | 878 Views
A Perceptual Study of the Effects of Localized Audio in Increasing the Human Participation in Videoconferencing and Virtual Reality Environments. John A. Greenfield 11/05/03. Contents. Problem Description Approach Pilot Experiments Localization Accuracy Visual Localization Main Experiment
E N D
A Perceptual Study of the Effects of Localized Audio in Increasing the Human Participation in Videoconferencing and Virtual Reality Environments John A. Greenfield 11/05/03
Contents • Problem Description • Approach • Pilot Experiments • Localization Accuracy • Visual Localization • Main Experiment • Discussion • Conclusion and Future Work
Video-conferencing History • Began in 1960’s with AT&T Picturephone. • Until 1990’s Video-conferencing designed for small screens. • Limited to 1-3 on-screen Videos, and 3-6 individuals. • Typically expensive and limited use.
Large Scale Videoconference • In late 1990’s Large screen videoconferencing with large numbers of on-screen participants introduced. • Mbone – University College London, 1996 • MASH – UC Berkeley, 1997 • AccessGrid – Argonne National Lab, 1999 • Internet based, inexpensive, wide use.
Virtual Reality Video-conferencing • Video-conferencing has been implemented in Virtual Reality environments • Real video • Avatars based on real video • Avatars based on motion tracking • Collaborative Virtual Environments • Avatar based similar to a videoconference
Video-conferencing AccessGrid Session
Large-scale Video-conferencing Features • 10 – 30 on-screen sub-windows • 1-10 people per sub-window • Multiple simultaneous speaking people possible • Both presentations and discussions • 200 + Installed AccessGrid studios
Limitations • It is difficult to identify which image on-screen contains the currently speaking person. • It can take from one second to several minutes to scan all the faces and see which lips are moving. • Interferes with communication and comprehension.
Benefits of Identifying Speaking Person • Not frustrated by search • Expression and gestures add information • Argyle, M. Bodily Communication. 1975. • Visual context identifies speaking person site • Comfort with conversation enhanced • Improved feeling of “presence,” sense of being there.
Measuring Effectiveness • Less time to find the speaking person’s image on screen = less frustration for the listener • Less time to find image = more visual information obtained.
Identification Approaches • Display only speaking person • Eliminates information about other participants • Multiple speaking people can be confusing • Highlight speaking person sub-window • Can be distraction and still need to search. • Requires looking at screen • Multiple speaking people can be confusing • Localized Sound
Localized Sound Benefits • Works for multiple simultaneous speaking people • Works even if listener looks away from screen. • No visual distraction added • Can enhance tracking of multiple conversations – Cocktail Party Effect. • Bolia, et al, “Asymmetric Performance in the Cocktail Party Effect”. Human Factors, 2001 • Can enhance comprehension of conversation • Baldis, J. “Effects of Spatial Audio on Memory, Comprehension and Preference During Desktop Conferences. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2001. • Can work if speaking person off-screen • Closer to real life
Hypothesis • Axiom: Localized sound makes the speaker’s voice appear to come from their image on-screen • Hypothesis: The addition of spatially localized sound to large format video-conferencing systems will significantly reduce the visual search times of users.
Localized Sound Implementation • Stereo Panning • Head-related Transfer Functions in headphones • Surround sound or 3-D panning • Wall of Sound Transducers
Localized Sound Used • Stereo Panning • Sound transducers, tracked headphones, or head mounted display (HMD) give horizontal localization • Head Related Transfer Function (HRTF) • tracked headphones or HMD used to give both horizontal and vertical localization.
HRTF • Head-Related Transfer Function • Head and ear alter frequency and amplitude depending on azimuth and elevation relative to ear. • Uses convolutional transfer function of standard human ear to mimic the frequency and amplitude effects of sound coming from specific physical location. • Produces 3-D Localization using tracked headphones
Contents • Problem Description • Approach • Pilot Experiments • Localization Accuracy • Visual Localization • Main Experiment • Discussion • Conclusion and Future Work
Approach • Implement simplified video-conference in Virtual Reality System. • Fewer variables than real video-conference systems • Represent faces with simple cartoon animation. • Test with human subjects: search time and accuracy.
Scope of Experiments • Compare search times for non-localized sound, and localized sound, with three independent variables: • Levels of visual complexity (# faces) • Levels of visual distraction. (# Blinking eyes) • Levels of sound distraction. (second voice) • Variables are primary features of video-conference situation.
Contents • Problem Description • Approach • Pilot Experiments • Localization Accuracy • Visual Localization • Main Experiment • Discussion • Conclusion and Future Work
Pilot Experiments • Sound Localization • Determine accuracy of localization • Visual Localization with no distracters • Determine localization times • Visual Localization with blinking • Determine localization times with blinking eyes • Distracters added because they exist in video-conferencing application.
Sound Localization Results • Stereo: • Error more prevalent to right than left • HRTF: • Error more prevalent to left than right 9 6 3 0 -3 -6 -9 Error in degrees Stereo HRTF
Visual Localization Details • Animated face mouth • Sound 1 • Sound 2
Visual Localization Experiment Features • Within subject tests • Localized and non-localized trials interspersed. • Video contrast uneven over all locations • Might increase delay for some locations • Mouse selection of column • Compare search time for localized and unlocalized sound
Visual Localization Analysis • Paired t-test, two-talied: • H0: Localized Sound • H1: Unlocalized Sound • p-value of 0.05 or less statistically significant. • 95% confidence interval.
Visual Localization Sample Size • Sample size: • No-blink: 11 subjects • Blink: 10 subjects • 20 iterations localized sound per subject. • 20 iterations unlocalized sound per subject • 5 practice iterations
Visual Localization Experiment Results • Non-blinking • Statistically significant improvement (0.3 seconds) • Blinking • Statistically significant improvement with larger magnitude (2.7 seconds)
Conclusions from Pilot Experiments • Accuracy sufficient for experiments. • Localized sound has significant positive effect. • Particularly in the presence of visual distracters • Bracket typical videoconference • No-blink less difficult • All blinking more difficult.
Contents • Problem Description • Approach • Pilot Experiments • Localization Accuracy • Visual Localization • Main Experiment • Discussion • Conclusion and Future Work
Main Experiments • Determine dependencies involved with levels and types of distracters. • Numbers of faces: 40, 30, 12 • Levels of blinking: 50% or more, less than 50% • Audio Distracter: with, without • Variable Face Sizes: • Large, • Large & small, • Very large & large & small
Main Experiments (cont.) • Stereo/HRTF • One face size • 40, 30, and 12 faces • 2 Blink levels • Audio distracter • Variable Face Size • All large, • Large & small, • Very large & large & small • 100% Blink level
Main Experiment Details • Sound 1 • Sound 2 • Audio Distracter
Main Experiments Details • Within subjects tests • Same order of trials for all participants • Participants don’t know if sound localized or not
Gender Age Experience # faces Blink level Audio Distracter Face sizes Main Experiment Variables
Main Experiments Details • For each # faces, blink level: • 10 localized with audio distracter • 10 unlocalized with audio distracter • 10 localized without audio distracter • 10 unlocalized without audio distracter.
Stereo Results • 31 Participants • 23 Inexperienced • 8 Experienced – participated in Pilots • Significant: • Localization for # faces (Sig <0.001) • Localization for blink level (Sig < 0.001) • Overall Localization (Sig < 0.001)
Stereo Resultsvariable face size • No Audio Distracter, 100% blinking eyes. • Significant: • Localization for face Size (Sig 0.002) • Overall localization (Sig <0.001)