Media Processing – Audio Part

Media Processing – Audio Part Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

Approximate outline • Week 6: Fundamentals of audio • Week 7: Audio acquiring, recording, and standards • Week 8: Audio processing, coding, and standards • Week 9: Audio perception and audio quality assessment • Week 10: Audio production and reproduction

Speech codec, audio coding quality evaluation, and audio perception Concepts and topics to be covered: • Speech coding • Waveform coder, vocoder, and hybrid coder • Frequency domain and time domain coders • Audio file formats • Digital container format • Audio quality measurement • Subjective assessment: listening tests • Objective assessment: perceptual objective measurement • Objective perceptual measurements • Masked threshold, internal representation, • PEAQ, PESQ • Audio Perception • Loudness perception, pitch perception, • space perception, timbre perception

Speech coding strategies • Speech coding schemes can be broadly divided into the following three main categories: vocoders, waveform coders, and hybrid coders. • The aim is to analyse the signal, remove the redundancies, and efficiently code the non-redundant parts of the signal in a perceptually acceptable manner. • SBC = subband coding, ATC = adaptive transform coding, MBE = multiband excitation, APC = adaptive coding, RELP = residual excited linear predictive coding (LPC), MPLPC = multi-pulse LPC, CELP = code-excited LPC, SELP = self-excitation LPC. Source: Kondoz, 2001

Waveform coders • Such coders attempt to preserve the general shape of signal waveforms. Hence they are not speech specific. • They generally operate on a sample to sample basis. Their performance is usually measured by SNR, as quantisation is the major source of distortion. • They usually operate above 16 kb/s. For example, the first speech coding standard, PCM operates at 64 kb/s, and then a later standard, adaptive differential PCM (ADPCM) operates at 32 kb/s.

Voice coders (vocoders) • A vocoder consists of an analyser and a synthesiser. The analyser extract from the original speech a set of parameters representing the speech production model, which are then transmitted. The syntheser then reconstruct the original speech based on the parameters transmitted. The synthesised speech is often crude. • Vocoders are very speech specific and they don’t attempt to preserve the waveform of speech. • Vocoder often operates in the regions below 4.8 kb/s. It is usually subjectively measured using mean opinion scores (MOS) test, diagnostic acceptability measure (DAM) (including both perceptual quality of the signal and background, such as intelligibility, pleasantness, and overall acceptability). • Such standard is mainly targeted at non-commercial applications, e.g. secure military systems.

Hybrid coders • The hybrid scheme attempts to use the advantages of the waveform coder and vocoder. • It can be generally categorised into: frequency-domain and time-domain methods. • The basic idea of frequency domain coding is to divide the speech spectrum into frequency bands or components using filter bank or a block transform analysis. After encoding and decoding, these components are used to resynthesise the original input waveform based on either filter bank summation or inverse block transform. • The time domain coding is usually motivated by linear prediction. The statistical characteristics of speech signals can be very accurately modelled by a source-filter model which assumes speech is produced by filtering the excitation signal with a linear time-varying filter. For voiced speech, the excitation signal is a periodic impulse train, and for unvoiced speech, a random noise signal.

Hybrid coders (cont) • An example of frequency domain hybrid coder: a typical sub-band coder (broad band analysis). Source: Kondoz, 2001

Hybrid coders (cont) • An example of frequency domain hybrid coder: an adaptive transform coder (narrow band analysis), in which different bit depth can be applied to each sub-band. Source: Kondoz, 2001

Hybrid coders (cont) • An example of time domain hybrid coder: adaptive predictive coder. Source: Kondoz, 2001

Quality of speech coding schemes Source: Kondoz, 2001

Digital speech coding standards

Difference between audio codecs and audio file formats • A codec is an algorithm that performs the encoding and decoding of the raw audio data. Audio data itself is usually stored in a file with a specific audio file format. • There are three major kinds of file formats: • Uncompressed audio formats, such as WAV, AIFF, AU, or PCM. • Lossless compressed audio formats, such as FLAC, WavPack, Apple Lossless, MPEG-4 ALS, Windows Media Audio (WMA) Lossless. • Lossy compression audio formats, such as MP3, Ogg Vorbis, AAC, WMA Lossy.

Difference between audio codecs and audio file formats (cont) • Most audio file formats support only one type of audio data (created with an audio coder), however there are multimedia digital container formats (as AVI) that may contain multiple types of audio and video data. • A digital container format is a meta-file format where different types of data elements and metadata exist together in a computer file. • Formats exclusive to audio include, e.g., wav, xmf. • Formats that contain multiple types of data include, e.g. Ogg, MP4.

Coding dilemma • In practical audio codec design, it is always a trade-off between the following two important factors: • Data rate and system complexity limitation • Audio quality

Objective quality measurement of coded audio • Traditional objective measure: The quality of audio is measured using, e.g. the following objective performance index, where psychoacoustic effects are ignored. • Signal to noise ratio (SNR) • Total block distortion (TBD) • Perceptual objective measure: The quality of audio is predicted based on a specific model of hearing.

Subjective quality measurement of coded audio • Human listening tests: When a highly accurate assessment is needed, formal listening tests will be required to judge the perceptual quality of audio.

Experiment of “13 dB miracle” • J. Johnston and K. Brandenburg, then at Bell Labs, presented examples of two signals having the same SNR of 13 dB, one of which was added white noise, and the other injected noise but perceptually shaped (so that the distortion was partially or completed masked by the signal components). Despite the same SNR measure, the perceived quality was very different with the latter one being judged as a good-quality signal, and the former one as a bit annoying.

Factors to consider in assessing the audio coder quality • Audio material • Different material stresses different aspects of a coder. For example, transient sounds can be used to test the coder’s ability of coding transient signals. • Data rate • Decreasing the data rate is likely to reduce the quality of a codec. It will be meaningful to take into account the data rate when comparing the quality of audio codecs.

Impairments versus transparency • The audio quality of a coding system can be assessed in terms of impairment, which is the perceived difference between the output of a system under test and a known reference signal. • The coding system under test is said to be transparent when even listeners who are experts in identifying impairments cannot distinguish between the reference and the test signals. • To determine whether the coding system is transparent or how transparent the system is, we can present both the test and reference signals to the listeners in random order, and ask them to pick up the test signal. If the listeners get it wrong roughly 50%, then the system is transparent.

Coding margin • Coding margin refers to a measure of how far the coder is from the onset of audible impairments. • It can be estimated using listening tests where the data rate of a coder is gradually reduced before listeners can detect the test signal with statistically significant accuracy when the reference signals are also present. • In practice, people are interested in the degree of impairments when they are below the region of transparency. In most cases, the coder in or near the transparent region is preferred (i.e. the impairments are very small). The well-known five-grade impairment scale and the formal listening test process, to be discussed later, are designed (by the ITU-R) for such situations.

Listening tests for audio codec quality assessment • Main features of carrying out a listening test for coded audio with small impairments (more details described in the standard [ITU-R BS.526]) • Five-grade impairment scale • Test method • Training and grading • Expert listeners and critical material • Listening conditions • Data analysis of listening results

Five-grade impairment scale • According to the standard ITU-R BS.562-3, any perceived difference between the reference signal and the output of the system under test should be interpreted as a perceptual impairment, measured by the following discrete five-grade scale: Source: Bosi & Goldberg, 2002

Five-grade impairment scale (cont) • Correspondence between the five-grade impairment scale and five-grade quality scale: Source: Bosi & Goldberg, 2002

Five-grade impairment scale (cont) • For the convenience of data analysis, subjective difference grade (SDG) is usually used. SDG is the difference grade between listener’s rating of the reference signal and the coded signal, i.e. SDG = Grade of coded signal – grade of reference signal. • The SDG has a negative value when the listener successfully distinguishes the reference from the coded signal and it has a positive value when the listener erroneously identifies the coded signal as the reference. Source: Bosi & Goldberg, 2002

Test method • The most widely accepted method for testing coding systems with small impairments is the so-called “double-blind, triple-stimulus with hidden reference” method. • Triple stimulus. The listener is presented with three signals: the reference signal, the test signals A and B. One of the test signals is identical to the reference signal. • Double blind. Neither the listener nor the test administrator should know beforehand which test signal is which. Test signals A and B are assigned randomly by some entity different from the test administrator entity. • Hidden reference. The hidden reference (one of the test signals that is identical to the reference signal) provides an easy mean to check that the listener does not consistently make mistakes. • The above test method has been employed worldwide and it provides a very sensitive, accurate, and stable way of assessing small impairments in coded audio.

Training and grading • A listening test usually consists of two phases: training and a formal grading phase. • Training phase. This is carried out prior to the formal grading phase. This phase allows the listening panel to become familiar with the test environment, the grading process, and the codec impairments. This phase can essentially reduce the effect of the so-called informational masking, which refers to the phenomenon where the threshold of a complex maskee masked by a complex masker can decrease on the order of 40 dB after training. Note that a small unfamiliar distortion is much more difficult to assess than a small familiar distortion. • Testing phase. In this phase, the listener is presented with a grading sheet, as shown in an example used for the development of the MPEG AAC coder, see the figure in the next page.

Training and testing (cont) • Example of a grading sheet from a listening test Source: Bosi & Goldberg, 2002

Expert listeners and critical material • Expert listener refers to listeners who have recent and extensive experience of assessing impairments of the type being studied in the test. The expert listener panel is typically selected by using pre-screening (e.g. an audiometric test) and post-screening (e.g. to determine whether the listener can consistently identify the hidden reference) procedures. • Critical material should be sought for each codec to be tested, even though it is impossible to create a complete list of the difficult material for perceptual audio codecs. Such material can be the synthetic signals that deliberately break the system under test, any potential broadcast material that stresses the coding system under test.

Listening conditions • The listening conditions and the equipment need to be precisely specified for others to be able to reliably reproduce the test. The listening conditions include the characteristics of the listening room (such as its geometric properties, its reverberation time, early reflections, background noise, etc.), the characteristics and the arrangement of the loudspeakers in the listening room, and the reference listening area. (See the multichannel loudspeakers configuration from [ITU-R BS. 1116]) Source: Bosi & Goldberg, 2002

Data analysis • ANOVA (Analysis of Variance) method is most commonly used for the analysis of the test results. SDG (subjective difference grade) is an appropriate basis for a detailed statistical analysis. • The resolution achieved by the listening test is reflected in the confidence interval, which contains the SDG values with a specific degree of confidence, 1-a, where a represents the probability that inaudible differences are labelled as audible. Figure below shows an example of formal listening test results from [ISO/IEC MPEG N1420]. Source: Bosi & Goldberg, 2002

MUSHRA method • MUSHRA (Multiple Stimulus with Hidden Reference and Anchors) is recommended in [ITU-R BS. 1534] to provide guidelines for the assessment of audio systems with intermediate quality, i.e. for the ranking between two systems in the region far from transparency. In this case, the seven-grade comparison scale is recommended. • The presence of the anchor(s), which is a low-passed version of the reference signal, is meant as an aid in weighting the relative annoyance of the various artefacts. Source: Bosi & Goldberg, 2002

Advantage and problems with formal subjective listening tests • Advantage • Good reliability • Disadvantage • High cost • Time consuming

Objective perceptual measurements of audio quality • Aim • To predict the basic audio quality by using objective measurements based on psychoacoustic principles. • PEAQ (Perceptual Evaluation of Audio Quality) • Adopted in [ITU-R BS.1387], is based on a refinement of a generally accepted psychoacoustics models, together with new cognitive components accounting for higher-level processes involved in the judgement of audio quality.

Two basic approaches used in objective perceptual measurements • The masked threshold method (based on the estimation and accurate model of masking) • The internal representation method (based on the estimation of the excitation patterns of the cochlea taking place in the human ear) Masked threshold method Internal representation method Source: Bosi & Goldberg, 2002

PEAQ • PEAQ has two versions: basic (only using DFT) and advanced(using both DFT and filter bank). The basic model is fast and suitable for real-time applications, while the advanced model is computationally more expensive but provides more accurate results. • In advanced version, the peripheral ear is modelled both through a DFT and a bank of forty pairs of linear-phase filters with centre frequencies and bandwidths corresponding to the auditory filters bandwidths. • The model output values (MOVs) are based partly on the masked threshold method and partly on the internal representation method. • MOVs include partial loudness of linear and nonlinear distortions, noise to mask ratios, alteration of temporal envelopes, harmonic errors, probability of error detection, and proportion of signal frames containing audible distortions. • The selected MOVs are mapped to an objective difference grade (ODG) via an artificial neural network. ODG is a prediction of SDG. • The correlation between SDG and ODG proved to be very good, and there is no significant statistical difference between them.

PEAQ (cont) • Psychoacoustic model of dvanced version of PEAQ. Source: Bosi & Goldberg, 2002

Coding artifacts • Pre-echo • For sharp transient signal, pre-echo is caused by the spreading of quantisation noise into a time region where it is not masked. Can be reduced by block switching. • Aliasing • It might happen when applying PQMF and MDCT and coarse quantisation, but not a problem in normal conditions. • Birdies • This could happen in low data rate, due to the bit allocation changes from block to block for highest frequency bands, causing the appear or disappear of some spectral coefficients. • Reverberation • It could happen when a large block size is employed for the filter bank in low data rate. • Multichannel artefacts • The loss or shift in the stereo image can introduce artefacts, relevant to binaural masking.

PESQ • PESQ refers to perceptual evaluation of speech quality, described in [ITU-T Rec. P.862], launched in 2000, is a family of algorithms for objective measurements of speech quality that predict the results of subjective listening tests on telephony systems. • PESQ uses a sensory model to compare the original, unprocessed signal with the degraded signal from the network or network element. The resulting quality score is analogous to the subjective “Mean Opinion Score” (MOS) measured using listening tests according to ITU-T P.800. • PESQ takes into account coding distortions, errors, packet loss, delay and variable delay, and filtering in analogue network components. The user interfaces have been designed to provide a simple access to this powerful algorithm, either directly from the analogue connection or from speech files recorded elsewhere.

Audio Perception • Loudness perception • Pitch perception • Space perception • Timbre perception

Inner Ear Function • The inner ear consists of cochlea which has a snail-like structure. • It transfers the mechanical vibrations to the movement of basilar membrane, and then converts into nerve firings (organ of corti which consists of a number of hair cells). • The basilar membrane carries out frequency analysis of input sounds, and it responds best to high frequencies at the (narrow and thin) base end, and to low frequencies at the (wide and thick) apex end.

Inner Ear Function • The spiral nature of the cochlea • The cochlea unrolled • Vertical cross-section through the cochlea • Detailed view of the cochlea tube From: (Howard & Angus, 1996)

Loudness Perception • The ear’s sensitivity to sounds of different frequencies varies over a wide range of sound pressure level (SPL). The minimum SPL that can be detected by the human hearing system around 4kHz is approximately 10e-5Pa, while the maximum SPL (i.e the threshold of pain) is 20Pa. • For convenience, in practice, SPL is usually represented in decibels (dB) relative to 20e-5Pa. • For example, the threshold of hearing at 1 kHz is, in fact, In dB, it equals to • While the threshold of pain is 20Pa which in dB equals to where is the measured SPL,

Loudness Perception (cont.) • The perceived loudness of an acoustic sound is related to its amplitude (but not a simple one-to-one relationship), as well as the context and nature of the sound. • As the sensitivity of our hearing system varies as the frequency changes, it is possible for a sound with a larger pressure amplitude to be heard as quieter than a sound with a lower pressure amplitude (for example, if they are at different frequencies). [recall the equal loudness contour of the human auditory system shown in the first lecture]

Demos for Loudness Perception • Resources: Audio Box CD from Univ. of Victoria • Decibels vs Loudness Starting with a 440Hz tone (i.e. note A4), then it is reduced 1dB each step Starting with a 440Hz tone (i.e. note A4), then it is reduced 3dB each step Starting with a 440Hz tone (i.e. note A4), then it is reduced 5dB each step • Intensity vs Loudness Various frequencies played at a constant SPL A reference tone is played and then the same tone is played 5dB higher; followed by the reference tone, and then the tone 8dB higher and finally the reference tone and then the one 10dB higher

Pitch Perception • What is pitch? Pitch • is “the attribute of auditory sensation in terms of which sounds may be ordered on a musical scale extending from low to high” (American Standard Association, 1960) • is a “subjective” attribute, and cannot be measured directly. Therefore, a specific pitch value is usually referred to the frequency of a pure tone that has the equal subjective pitch of the sound. In other words, the measurement of pitch requires a human listener (the “subject”) to make a perceptual judgement. This is in contrast to the measurement in the laboratory of, for example, the fundamental frequency of a complex tone, which is an “objective” measurement. (Howard & Angus, 1996) • is related to the repetition rate of the waveform of a sound, therefore it corresponds to the frequency of a pure tone and the fundamental frequency of a complex tone. In general, sounds having a periodic acoustic pressure variation with time are perceived as pitched sounds, for non-periodic acoustic pressure waveform, as non-pitched sounds. (Howard & Angus, 1996)

Existing Pitch Perception Theories • ‘Place’ theory • Spectral analysis is performed on the stimulus in the inner ear, different frequency components of the input sound excite different places or positions along the basilar membrane, and hence neurones with different centre frequencies. • ‘Temporal’ theory • Pitch corresponds to the time pattern of the neural impulses evoked by that stimulus. Nerve firings tend to occur at a particular phase of the stimulating waveform, and thus the intervals between successive neural impulses approximate integral multiples of the period of the stimulating waveform. • ‘Contemporary’ theory • Neither of the theories is perfect for explaining the mechanism of human pitch perception. A combination of both theories will benefit the analysis of pitch perception.

Contemporary Theory (Moore, 1982)

Demos for Pitch Perception • Resources: Audio Box CD from Univ. of Victoria This three demos show how pitch is perceived with different time duration of the signals. In each track, time bursts of sounds are played. Three different pitches are played in these three tracks.

Space Perception • Sound localisation refers to judgements of the direction and distance of a sound source, usually achieved through the use of two ears (binaural hearing). • interaural time difference • interaural intensity difference • Although binaural hearing is crucial for sound localisation, monaural perception is similarly effective in some cases, such as in the detection of signals in quiet, intensity discrimination, and frequency discrimination.

Media Processing – Audio Part