1 / 57

Speech and Audio Processing and Coding (cont.)

Speech and Audio Processing and Coding (cont.). Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html. Psychoacoustics.

tolla
Download Presentation

Speech and Audio Processing and Coding (cont.)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

  2. Psychoacoustics • Psychoacoustics is the study of how humans perceive sound, such as • Perception of loudness • Pitch perception • Space perception • References • B. C.J. Moore, An Introduction to the Psychology of Hearing, Academic Press, 1995. • D.M. Howard and J. Angus, Acoustics and Psychoacoustics, Focal Press, 1996. • W. A. Yost, Fundamentals of Hearing: an Introduction, Academic Press, 1994. • R. M. Warren, Auditory Perception, Cambridge Univ. Press, 1999.

  3. Inner Ear Function • The inner ear consists of cochlea which has a snail-like structure. • It transfers the mechanical vibrations to the movement of basilar membrane, and then converts into nerve firings (organ of corti which consists of a number of hair cells). • The basilar membrane carries out frequency analysis of input sounds, and it responds best to high frequencies at the (narrow and thin) base end, and to low frequencies at the (wide and thick) apex end.

  4. Inner Ear Function • The spiral nature of the cochlea • The cochlea unrolled • Vertical cross-section through the cochlea • Detailed view of the cochlea tube From: (Howard & Angus, 1996)

  5. Basilar Membrane Idealised shape of unrolled basilar membrane From: (Howard & Angus, 1996)

  6. Displacement of Basilar Membrane Idealised envelope of basilar membrane movement to sounds at five different frequencies From: (Howard & Angus, 1996)

  7. ‘Place’ Theory of Hearing • The displacement of the basilar membrane changes as the frequencies change. • The basilar membrane is stimulated from the base end which responds best to high frequencies, and it is important to note that its envelope of movement for a pure tone (or individual component of a complex sound) is not symmetrical, but it tails off less rapidly towards higher frequencies than towards lower frequencies. • The linear distance measured from the apex to the place of the maximum basilar membrane displacement is directly proportional to the logarithm of the input frequency.

  8. Critical Bands • An illustration of the perceptual changes when playing two tones simultaneously with the frequency of a pure tone (F1) fixed and the other (F2) changing. From: (Howard & Angus, 1996)

  9. Critical Bands (cont) • The discrimination between two frequencies depends whether the basilar membrane displacements are separated or not. • A listener’s perception change for the frequency difference between two pure tones from rough and separate to smooth and separate is known as ‘critical bandwidth’ (CB). • “The critical bandwidth is that bandwidth at which subjective responses rather abruptly change.” (Scharf, 1970) • The ‘equivalent rectangular bandwidth’ (ERB) was proposed to use the notion of critical bandwidth practically. (Moore and Glasberg, 1983)

  10. Critical Bands (cont) • The relationship between the ERB and the centre filter frequency (Howard & Angus, 1996)

  11. Critical Bands (cont) • Semitone: is the smallest musical interval between musical notes, defined as the interval between two adjacent notes in a 12-tone scale (e.g. from C to C#). Hence, it equals to 100 cents (i.e. a twelfth of an octave) • Octave: the interval between two music pitches with one has a double frequency of the other. In other words, the frequency of one note is 12 semitones higher or lower than that of the other. For example, A4 note is one octave higher than an A3 note, but one octave lower than A5 note.

  12. Loudness Perception • The ear’s sensitivity to sounds of different frequencies varies over a wide range of sound pressure level (SPL). The minimum SPL that can be detected by the human hearing system around 4kHz is approximately 10e-5Pa, while the maximum SPL (i.e the threshold of pain) is 20Pa. • For convenience, in practice, SPL is usually represented in decibels (dB) relative to 20e-5Pa. • For example, the threshold of hearing at 1 kHz is, in fact, In dB, it equals to • While the threshold of pain is 20Pa which in dB equals to where is the measured SPL,

  13. Loudness Perception (cont.) • The perceived loudness of an acoustic sound is related to its amplitude (but not a simple one-to-one relationship), as well as the context and nature of the sound. • As the sensitivity of our hearing system varies as the frequency changes, it is possible for a sound with a larger pressure amplitude to be heard as quieter than a sound with a lower pressure amplitude (for example, if they are at different frequencies). [recall the equal loudness contour of the human auditory system shown in the first lecture]

  14. Demos for Loudness Perception • Resources: Audio Box CD from Univ. of Victoria • Decibels vs Loudness Starting with a 440Hz tone (i.e. note A4), then it is reduced 1dB each step Starting with a 440Hz tone (i.e. note A4), then it is reduced 3dB each step Starting with a 440Hz tone (i.e. note A4), then it is reduced 5dB each step • Intensity vs Loudness Various frequencies played at a constant SPL A reference tone is played and then the same tone is played 5dB higher; followed by the reference tone, and then the tone 8dB higher and finally the reference tone and then the one 10dB higher

  15. Pitch Perception

  16. Pitch • What is pitch? Pitch • is “the attribute of auditory sensation in terms of which sounds may be ordered on a musical scale extending from low to high” (American Standard Association, 1960) • is a “subjective” attribute, and cannot be measured directly. Therefore, a specific pitch value is usually referred to the frequency of a pure tone that has the equal subjective pitch of the sound. In other words, the measurement of pitch requires a human listener (the “subject”) to make a perceptual judgement. This is in contrast to the measurement in the laboratory of, for example, the fundamental frequency of a complex tone, which is an “objective” measurement. (Howard & Angus, 1996) • is related to the repetition rate of the waveform of a sound, therefore it corresponds to the frequency of a pure tone and the fundamental frequency of a complex tone. In general, sounds having a periodic acoustic pressure variation with time are perceived as pitched sounds, for non-periodic acoustic pressure waveform, as non-pitched sounds. (Howard & Angus, 1996)

  17. Pitch • Comparison of pitched and non-pitched sounds (Howard & Angus, 1996)

  18. Pitch • Examples of pitched (see the figures in “Musical Notes and its Fundamental Frequencies”) and non-pitched sounds (see the figure below, the waveform and spectrum of a drum being brushed, Howard & Angus, 1996)

  19. Existing Pitch Perception Theories • ‘Place’ theory • Spectral analysis is performed on the stimulus in the inner ear, different frequency components of the input sound excite different places or positions along the basilar membrane, and hence neurones with different centre frequencies. • ‘Temporal’ theory • Pitch corresponds to the time pattern of the neural impulses evoked by that stimulus. Nerve firings tend to occur at a particular phase of the stimulating waveform, and thus the intervals between successive neural impulses approximate integral multiples of the period of the stimulating waveform.

  20. Place Theory Three methods are commonly used for finding the value of f0 based on a place analysis of the frequency components of the input sound: • Method 1: locate the f0 component itself. • Method 2: find the minimum frequency difference between adjacent harmonics, i.e. (n+1)*f0 – n*f0 = f0. • Method 3: find the highest common factor of the frequency components that are present in the input sound.

  21. Place Theory (cont) • Method 1: • Suggests that the pitch of a sound corresponds to the place stimulated by the lowest frequency component, i.e. fundamental frequency f0. • Assumes that f0 is always present in the sound. For example, as stated by Olm: “a pitch corresponding to a certain frequency can only be heard if the acoustic wave contains power at that frequency”. • Exceptional case: • As demonstrated by Schouten (1940) that even removing the f0 from a pulse wave, its pitch remained the same. • Therefore, f0 doesn’t have to be present for pitch perception. Also, the lowest frequency component is not the basis for pitch perception.

  22. Place Theory (cont) • Method 2: • Suggests that whether or not the fundamental frequency f0 is present, some adjacent harmonics, provided that they exist, should be used as a basis for pitch perception. • For most musical sound, adjacent harmonics are indeed present. • Exceptional case: • As shown in the figure below, when f0 is present (or absent), the difference between adjacent frequencies are f0, 2f0, 2f0, etc. (or 3f0, 2f0, 2f0, etc), while the perceived pitch would not change. (Howard & Angus, 1996)

  23. Place Theory • Method 3: • The highest common factor is the highest value appearing in all rows of the place analysis table below, where as an example, f0 = 100Hz. • It can address the exceptional cases in both Method 1 and Method 2. (Howard & Angus, 1996)

  24. Place Theory • Method 3: • Another example shown by Schouten was using the analysis table to interpret pitch perception for non-harmonic sound. For a sound whose component frequencies were 1040Hz, 1240Hz and 1440Hz, and it was found the pitch was approximately 207Hz. Using Method 2, the pitch would be the spacing between these components, and hence, 200Hz. • Using the processing table (shown in the next page), the highest common factor would be approximately 207Hz which is an average of 208Hz, 207Hz, 206Hz, of which the components are the 5th, 6th, and 7th harmonic respectively. The pitch perceived in such a situation is referred to as “residue pitch”, “pitch of the residue”, or “virtual pitch”. Actually, the fundamental frequency of these components is 40Hz, of which they are the 26th, 31st, and 36th harmonic respectively. It seems that the perceived pitch found by the auditory system is based on the adjacent harmonics that present in these frequencies.

  25. Place Theory (Howard & Angus, 1996)

  26. Problems with the Place Theory • Although it provides a basis for understanding how f0 is found in terms of frequency analysis, it does not explain (Howard & Angus, 1996): • The discrimination of frequency difference in pitch perception. [To discuss] • The pitch perception of sounds with frequency components that could not be resolved by the place mechanism of basilar membrane. [In general, no harmonic above about the 5th to 7th is resolved for any fundamental frequency, because in these situations, the critical bandwidth at the centre frequencies (i.e. these harmonics), will be higher than the fundamental frequency.] • The pitch perceived for some sounds which has non-harmonic (i.e. continuous) spectra. [For example, most listeners would rate ‘ss’ in “sea” to have higher pitch as compared with ‘sh’ in “shell”, as the energy is biased more towards the lower frequencies for ‘sh’ with a peak around 2.5kHz, as compared with a peak around 5kHz for ‘ss’. Figure shown in the next page.] • Pitch perception for sounds with a fundamental frequency less than 50Hz [This is because the pattern of vibration on the basilar membrane does not seem to change in that region.]

  27. ‘ss’ versus ‘sh’

  28. Frequency Discrimination • The size of the frequency difference limen (DL), or sometimes called just noticeable difference (JND), is the smallest detectable change in frequency. Two methods were used to measure DL, including • DLF - The subject is asked to judge which of two frequencies has higher pitch. This method was used by Henning (1970), Moore (1973), etc. It was found that expressed in Hz, the change is smallest at low frequencies, and increases monotonically with increasing frequencies; expressed as a proportion of centre frequency, it tends to be smallest for middle frequencies, and larger for very high and very low frequencies. • FMDL - Tones which are frequency modulated (FM) at a low rate (typically 2-4Hz) are used for the measurement. This method was used by Shower & Biddulph (1931). FMDL seems to vary less with frequency than DLF, and both get smaller as the sound level increases.

  29. Frequency Discrimination • The frequency discrimination thresholds change with the centre frequencies, plotted as log(threshold) versus square root of centre frequency below: Frequency discrimination threshold measured by several different authors, all measured DLFs except S & B who measured FMDLs (figure first published by Wier et al, 1977, and reproduced in Moore, 1995)

  30. Temporal Theory • This theory is based on the fact that the waveform of an acoustic signal with a strong pitch is periodic. • This theory suggests that it is the detailed nature of the actual waveform that excites the different places along the basilar membrane. Therefore, it depends on the timing of neural firings generated in the organ of Corti, in response to vibrations of the basilar membrane. • It can be simulated by a bank of band-pass filters whose centre frequencies and bandwidths vary according to the critical bandwidth of the human hearing system. • The nerve fibres fire at all places along the basilar membrane, and a given nerve fibre may only fire at one phase or instant in each cycle of the stimulating waveform. This process is known as phase locking. • Due to phase locking, the time between firings for any particular nerve will always be an integer multiple of periods of the stimulus. At each place, there are a number of nerves involved. (Howard & Angus, 1996)

  31. Simulation of Temporal Theory • Band-pass filtering of note C4 played on a violin, whose f0 is 261.6Hz. (Howard & Angus, 1996)

  32. Simulation of Temporal Theory • The first six harmonics (around 260, 520, 780, 1040, 1300 and 1560Hz) are well resolved by the band-pass filters, and therefore can be explained by the place theory. • For the output waveforms whose filter centre frequencies above the sixth harmonic are not sinusoidal since they are not resolved individually, as the bandwidth is higher than the fundamental frequency. • When two components close in frequency are combined, they produce a beat waveform if both components are harmonics of some fundamental frequency. The beat frequency is equal to the f0, as shown in the filter outputs above the 1.5KHz in the figure of the previous page. • The minimum time between the firings (i.e. 1 period of the stimulus) can be inferred from the filter output (which is the period of the lower harmonics and the period of the input wave itself). • Note that, although the nerve does not necessarily fire in every cycle, and the cycle in which it fires tends to be random, due to phase locking, the time between the firings for any particular nerve will always be an integer multiple of periods of the stimulating waveform. (Howard & Angus, 1996)

  33. Nerve Firing • An illustration of nerve firing along the basilar membrane for the first 16 harmonics of an input sound. (Howard & Angus, 1996)

  34. Problems with Temporal Theory • Although it provides a basis for understanding how the fundamental period could be found from an analysis of the timing of the nerve firing from all places across the basilar membrane, it couldn’t explain the following: • Pitch perception of sounds whose f0 is higher than 5kHz. [This is because phase locking breaks down above 5kHz.] • In practice, this means there will be only approximately two harmonics to be analysed, due to the limitation of the human hearing system (i.e. the upper limit 20kHz). (Howard & Angus, 1996)

  35. Contemporary Theory • Neither of the theories is perfect for explaining the mechanism of human pitch perception. A combination of both theories will benefit the analysis of pitch perception, as a model proposed by Moore (1982) for complex tones, shown below. (Howard & Angus, 1996)

  36. Musical Intervals (Melody) • One tone evokes a pitch, a sequence of tones with appropriate frequencies can evoke the perception of a musical interval (or melody). • A sequence of tones below 5kHz evokes a sense of melody, while a sequence of tones above 5kHz does not evoke a clear sense of melody, although different frequencies can be heard. (Moore, 1989) • For example, two tones which are separated in frequency by an interval of one octave (i.e. one has twice the frequency of the other) sound similar. Hence, they are judged to have the same name on the musical scale (for example, C or D). • It appears that the musical interval of an octave is only clearly perceived when both tones are below 5kHz. Above 5kHz, a sequence of pure tones does not produce a clear sense of melody, as shown by Atteneave and Olson, 1971.

  37. Pitch versus Sound Level • The pitch of a pure tone is determined not only by its frequency (mainly), but also by its sound level (lightly). • On average, the pitch of tones below about 2kHz decreases with increasing sound level, while the pitch of tones above about 4kHz increases with increasing sound level. (Moore, 1989) • For tones between 1 and 2kHz, changes in pitch with level are generally less than 1%, while for tones of lower and higher frequencies, the changes can be larger (up to 5%). (Verschuure and van Meeteren, 1975; Moore, 1989)

  38. Musical Notes • Notes are played by music instruments that have different pitches. • As the sensitivity of our hearing system varies as the frequency changes, it is possible for a sound with a larger pressure amplitude to be heard as quieter than a sound with a lower pressure amplitude (for example, if they are at different frequencies). [recall the equal loudness contour of the human auditory system shown in the first lecture]

  39. Musical Note and its Fundamental Frequency (Waveform of A4) (Howard & Angus, 1996)

  40. Musical Note and its Harmonics (Spectrum of A4)

  41. Musical Note and its Harmonics • The shape of the waveform and the spectrum for each of the notes played by the four different instruments shown in the previous page is different, even though they all perceived as note A4 (i.e. they have the same fundamental frequencies). It is the so-called “timbre” that distinguishes the four different music instruments. • The frequency components of notes produced by any pitched instruments are called harmonics which are integer multiples of the fundamental frequency f0. Therefore, the first harmonic is the fundamental frequency f0, and the 2nd harmonic is 2f0, and the third is 3f0, etc. • Another term that is also used by many authors is “overtones”. The first overtone refers to the first frequency component that is above f0, which is the second harmonic, i.e. 2f0. For example, for the note A4 played by violin, f0=440.5Hz, the first harmonic is therefore 440.5Hz, and the first overtone is 881.0Hz.

  42. Demos for Pitch Perception • Resources: Audio Box CD from Univ. of Victoria This three demos show how pitch is perceived with different time duration of the signals. In each track, time bursts of sounds are played. Three different pitches are played in these three tracks.

  43. Space Perception

  44. Sound Localisation • Sound localisation refers to judgements of the direction and distance of a sound source, usually achieved through the use of two ears (binaural hearing). • Help humans and animals to locate the sounds of threats and to avoid such threats. • Help humans and animalsdirect visual attention • Help humans and animalsfocus attention on sounds from specific directions by excluding other interfering sounds in a noisy and reverberant environment • For blind people, in particular, they can use information from the echoes and reflections to estimate the distance of sound sources. • Although binaural hearing is crucial for sound localisation, monaural perception is similarly effective in some cases, such as in the detection of signals in quiet, intensity discrimination, and frequency discrimination.

  45. Localisation Cues • There are two important cues that enable us to localise sounds: • interaural time difference • interaural intensity difference

  46. Interaural Time Difference (ITD) • The two ears are separated by the dimension of the head. For an average head, the distance between the ears is about 18cm. As such, there will be a time difference between the sound reaching the ear near the source and the one further away. Such difference is called interaural time difference (ITD). • A simple and rough model to calculate the ITD is given below, in which it assumes that the sound travel around the head can be ignored: Where - ITD (in s) - Distance between the ears (in m) - The angle of arrival of the sound from the median (in radians) - Sound speed (in m/s)

  47. Interaural Time Difference (ITD) (Howard & Angus, 1996)

  48. Interaural Time Difference (ITD) • However, in reality, the sound has to travel around the head in order to reach the ear. • A more accurate model to calculate the ITD is given below, in which it assumes that the head is spherical: • Based on the equation below, it can be shown that the maximum ITD occurs at 90 degree (considering the average head diameter), which is: Where - ITD (in s) - Half the distance between the ears (in m) - The angle of arrival of the sound from the median (in radians) - Sound speed (in m/s)

  49. Interaural Time Difference (ITD) (Howard & Angus, 1996)

  50. ITD as a Function of Angle (Howard & Angus, 1996)

More Related