390 likes | 560 Views
Audio in VEs. Ruth Aylett. Use of audio in VEs. Important but still under-utilised channel for HCI including virtual environments. Speech recognition for hands-free input All computers now have sound output At least a beep Usually CD-quality stereo sound
E N D
Audio in VEs Ruth Aylett
Use of audio in VEs • Important but still under-utilised channel for HCI including virtual environments. • Speech recognition for hands-free input • All computers now have sound output • At least a beep • Usually CD-quality stereo sound • Conventional stereo places a sound in any spot between the left and right loudspeakers. • In true 3D sound, the source can be placed in any location: right or left, up or down, near or far.
Potential uses • Associate sounds with particular events • Associate sounds with static objects • Associate sound with the motion of an object • Use localised sound to attract attention to an object • Use ambient sounds to add to the feeling of immersion • Use sounds to add to the feeling of realism • Use speech to communicate with devices or avatars • Use sound as a warning or alarm signal
Overall impact • High Quality audio provides: • Increased realism • Reinforces visuals • Strong immersive sense • World exists beyond part that is seen • Strong positional cues • Extra information about the environment • The shape of the world • What does ‘High Quality’ mean?
VR sound environment • VR equipment creates a difficult sound environment. • CAVE • Stand in a glass box • Pretend you can’t hear the echoes • Also hard to place speakers • Semi-immersive VR theatre • Sit in a big cylinder • Reflects sound in a very strange way.
Workbench • Not so bad but still... • Big flat screen 1 metre in front of you • Sound coming from surround speakers • Creates echoes inappropriate for scene • Much VR audio is based on very high quality headphones • Use head tracking to get position and orientation • Play to user • Problem solved! - well, no
L speaker R speaker Mon-aural source Phantom source Stereo sound • In the entertainment industry, stereo was the first successful commercial product involving spatial sound. • To placesound on the left, send its signal to the left loudspeaker, to place it on the right, send its signal to the right loudspeaker.
Stereo techniques • If same signal sent to both speakers*, phantom source seems to originate from point midway between them. • Crossfading signal from one speaker to the other gives impression of source moving continuously between the two positions. • Simple crossfading cannot create impression of source outside of line segment between speakers. • Can also shift the location of the phantom source by exploiting the precedence effect (delay). *and if the speakers are wired "in phase" and if the listener is more or less midway between the speakers and if the room is not too acoustically irregular
A world of sound • You are surrounded by sound all the time • Silence is unheard of! • The environment affects (shapes?) the sound you hear • Size, shape, materials
Rendering sound: auralisation • To generate correct echoes must model sound behaviour in the space • Rooms are complex • Filled with different materials • Reflective, Absorbant, Frequency filtering • Just like rendering light
----- Direct Sound ---- Reverberant Field ---- Early Reflections Simulation of Room Characteristics
Putting sound into a VE • What sound? • ‘Ambient’ sounds • ‘Surround’ sound‘ • Often use recorded sounds • Positional sounds • Designed to give a strong sense of something happening in a particular place • Also often provided by using recorded sounds
Positional sound • Using sound to create the sense of active things in the environment • Enhances presence • Enhances immersion • Need to deal with many components • Reflections (echoes) • Diffraction effects
City models • VisClim: scene in Linköping’s Storatorget • Surrounding environment • Vehicles? • Several roads nearby • People? • Many people in the square • Weather noise effects • Rainfall • Snowfall (no sound but damping effect)
Air Traffic control • Simulation • No ‘ambient’ sound required • No aircraft noises • No realism wanted? • Positional warnings? • Designed to draw the users attention to the location of a problem • Which may be out of the field of view
Creating positional sound • Amplitude • (or more) • Synchronisation • Audio delays • Frequency • Head-Related Transfer Function (HRTF)
Amplitude • Generate audio from position sources • Calculate amplitude from distance • Include damping factors • Air conditions • Snow • Directional effect of the ears
Synchronisation • Ears are very precise instruments • Very good at hearing when something happens after something else • Sound travels slowly (c 340 m/sec in air): different distance to each ear • Use this to help define direction • Difference in amplitude gives only very approximate direction information
Speed effect • 30 centimetres =0.0008 seconds • Human can hear ≤ 700µS
What is 3D sound? • Able to position sounds all around a listener. • Sounds created by loudspeakers/headphones: perceived as coming from arbitrary points in space. • Conventional stereo systems generally cannot position sounds to side, rear, above, below • Some commercial products claim 3D capability - e.g stereo multimedia systems marketed as having “3D technology”. But usually untrue.
3D positional sound • Humans have stereo ears • Two sound pulse impacts • One difference in amplitude • One difference in time of arrival • How is it that a human can resolve sound in 3D? • Should only be possible in 2D?
Frequency • Frequency responses of the ears change in different directions • Role of pinnae • You hear a different frequency filtering in each ear • Use that data to work out 3D position information
Head-Related Transfer Function • Unconscious use of time delay, amplitude difference, and tonal information at each ear to determine the location of the sound. • Known as sound localisation cues. • Sound localisation by human listeners has been studied extensively. • Transformation of sound from a point in space to the ear canal can be measured accurately • Head-Related Transfer Functions (HRTFs). • Measurements are usually made by inserting miniature microphones into ear canals of a human subject or a manikin.
HRTFs • HRTFs are 3D • Depend on ear shape (Pinnae) and resonant qualities of the head! • Allows positional sound to be 3D • Computationally difficult • Originally done in special hardware (Convolvotron) • Can now be done in real-time using DSP
HRTFs • First series of HRTF measurement experiments in 1994 by Bill Gardner and Keith Martin, Machine Listening Group at MIT Media Lab. • Data from these experiments made available for free on the web. • Picture shows Gardner and Martin with dummy used for experiment - called a KEMAR dummy. • A measurement signal is played by a loudspeaker and recorded by the microphones in the dummy head.
HRTFs • Recorded signals processed by computer, derives two HRTFs (left and right ears) corresponding to sound source location. • HRTF typically consists of several hundred numbers • describes time delay, amplitude, and tonal transformation for particular sound source location to left and right ears of the subject. • Measurement procedure repeated for many locations of sound source relative to head • database of hundreds of HRTFs describing sound transformation characteristics of a particular head.
HRTFs • Mimick process of natural hearing • reproducing sound localisation cues at the ears of listener. • Use pair of measured HRTFs as specification for a pair of digital audio filters. • Sound signal processed by digital filters and listened to over headphones • Reproduces sound localisation cues for each ear • listener should perceive sound at the location specified by the HRTFs. • This process is called binaural synthesis (binaural signals are defined as the signals at the ears of a listener).
HRTFs • You should be able to describe the process involved in generating true 3D audio using HRTFs.
The problem • Rendering audio is really, really hard • Much bigger problem than lighting • Material properties are more complex • Can’t fake it as easily • Properties are always a problem • Good methods exist but problem too computationally hard for these to be in general use at present
What is possible now • Constraint is real time audio rendering • Must adapt to dynamic user who moves unpredictably • Simple (reflectionless) stereo positional sound • Using amplitude • Using synchronization • Using HRTF frequency filtering • Useful for audio cues and simple environmental sounds
What about surround sound? • Principal format for digital discrete surround is the "5.1 channel" system. • The 5.1 name stands for five channels (in front: left, right andcentre, and behind: left surround and right surround) of full-bandwidth audio (20 Hz to 20 kHz) • sixth channel at times contain additional bass information to maximise the impact of scenes such as explosions, etc. • This channel has a narrow freq. response (3 Hz to 120 Hz), thus sometimes referred to as the ".1" channel.
What about surround sound? • Surround sound systems NOT true 3D audio systems - just collection of more speakers. • Various commercial surround sound formats - for home entertainment, Dolby is big name. • Dolby Surround Digital. • Lots of other proprietaryapproaches - e.g. the BattleChair (pictured).
Dolby Headphone • Dolby Headphone: based proprietary algorithm, presumably similar to HRTFs, originally developed by Australian company Lake Technology • attempts to produce convincing surround-sound effects through ordinary stereo headphones. • Technology originally developed for VR or tele-conferencingapplications but not marketed for consumer applications. • A more genuine 3Daudio system developed by UK company Sensaura.
Voice interaction • Voice input for control • Continuous? Discrete? • Voice output for information • Positional - alerts • Non-positional - ‘voice over’ • Character-based - social channel
Voice output • Voice synthesis • Computer strings together set of phonemes (basic language sound units) • Problems with articulation: sounds robotic • Unit selection voices • Uses large database created from real voices • Plus sophisticated algorithm for putting bits together • Good results but need very large memory (1 gigabyte) to hold database • Takes lots of time and expertise to create ‘voice’