350 likes | 560 Views
Digital Audio Compression. CIS 465 Spring 2013. Speech Compression. Compression of voice data We have previously mentioned several methods that are used to compress voice data mu-law and A-law companding ADPCM and delta modulation
E N D
Digital Audio Compression CIS 465 Spring 2013
Speech Compression • Compression of voice data • We have previously mentioned several methods that are used to compress voice data • mu-law and A-law companding • ADPCM and delta modulation • These are examples of methods which work in the time domain (as opposed to the frequency domain) • Often they are not even considered compression methods
Speech Compression • Although the previous techniques are generally applied to speech data they are not designed specifically for such data • Vocoders, instead, are • Can’t be used with other analog signals • Model speech so that the salient features can be captured in as few bits as possible • Linear Predictive Coders model the speech waveform in time • Also channel vocoders and formant vocoders • In electronic music, vocoders allow a voice to modulate a musical source (via synthesizer, e.g.)
General Audio Compression • If we want to compress general audio (not just speech), different techniques are needed • In particular, music compression is a more general form of audio compression • We make use of psychoacoustical modeling • Enable perceptual encoding based upon an analysis of the ear and brain perceive sound • Perceptual encoding exploits audio elements that the human ear cannot hear well
Psychoacoustics • If you have been listening to very loud music, you may have trouble afterwards hearing soft sounds (that normally you could hear) • Temporal masking • A loud sound at one frequency (a lead guitar) may drown out a sound at another frequency (the singer) • Frequency masking
Equal-Loudness Relations • If we play two pure tones, sinusoidal sound waves, with the same amplitude but different frequencies • One may sound louder than another • The ear does not hear low or high frequencies as well as mid-range ones (speech) • This can be shown with equal-loudness curves which plot perceived loudness on the axes of true loudness and frequency
Threshold of Hearing • The following image is a plot of the threshold of human hearing for pure tones – at loudness below the curve, we don’t hear a tone
Threshold of Hearing • A loud sound can mask other sounds at nearby frequencies as shown below
Frequency masking • We can determine how a pure tone at a particular frequency affects our ability to hear tones at nearby frequencies • Then, if a signal can be decomposed into frequencies, for those frequencies that are only partially masked, only the audible part will be used to set the quantization noise thresholds
Critical Bands • Human hearing range divides into critical bands • Human auditory system cannot resolve sounds better than within about one critical band when other sounds are present • Critical bandwidth represents the ear’s resolving power for simultaneous tones • At lower frequencies the bands are narrower than at higher frequencies • The band is the section of the inner ear which responds to a particular frequency
Critical Bands • Generally, the audio frequency range for hearing (20 Hz – 20 kHz) can be partitioned into about 24 critical bands (25 are typically used for coding applications • The previous slide does not show several of the highest frequency critical bands • The critical band at the highest audible frequency is over 4000 Hz wide • The ear is not very discriminating within a critical band
Temporal Masking • A loud tone causes the hearing receptors in the inner ear to become saturated, and they require time to recover • This leads to the temporal masking effect • After the loud tone we cannot immediately hear another tone – post-masking • The length of the masking depends on the duration of the masking tone • A masking tone can also block sounds played just before – pre-masking (shorter time)
Temporal Masking • MPEG audio compression takes advantage of both temporal and frequency masking to transmit masked frequency components using fewer bits
MPEG Audio Compression • MPEG (Motion Picture Experts Group) is a family of standards for compression of both audio and video data • MPEG-1 (1991) CD quality audio • MPEG-2 (1994) Multi-channel surround sound • MPEG-4 (1998) Also includes MIDI, speech, etc. • MPEG-7 (2003) Not compression – searching • MPEG-21 (2004) Not compression – digital rights management
MPEG Audio Compression • MPEG-1 defined three downward compatible layers of audio compression • Each layer offers more complexity in the psychoacoustic model used and hence better compression • Increased complexity leads to increased delay • Compatibility achieved by shared file header information • Layer 1 – used for Digital Audio Tape • Layer 2 – proposed for digital audio broadcasting • Layer 3 – music (MPEG-1 layer 3 == mp3)
MPEG Audio Compression • MPEG audio compression relies on quantization, masking, critical bands • The encoder uses a bank of 32 filters to decompose the signal into sub-bands • Uniform width – not exactly aligned to crit. bands • Overlapping • A Fourier transform is used for the psycho-acoustical model • Layer 3 adds a DCT to the sub-band filtering so that layers 1 and 2 work in the temporal domain and layer 3 in the frequency domain
MPEG Audio Compression • PCM input filtered into 32 bands • PCM FFT transformed for PA model • Windows of samples (384, 576, 1152) coded at a time
MPEG Audio Compression • Since the sub-bands overlap, aliasing may occur • This is overcome by the use of a quadrature mirror filter bank • Attenuation slopes of adjacent bands are mirror images
MPEG Audio Algorithm • The PCM audio data is assembled into frames • Header – sync code of 12 1s • SBS format – describe how many sub-band samples (SBS) are in the frame • The SBS (384 in Layer 1, 1152 in Layers 2, 3) • Ancillary data – e.g. multi-lingual data or surround-sound data
MPEG Audio Algorithm • The sampling rate determines the frequency range • That range is divided up into 32 overlapping bands • The frames are sent through a corresponding 32-filter filter bank • If X is the number of samples per frame, each filter produces X/32 samples • These are still samples in the temporal domain
MPEG Audio Algorithm • The Fourier transform is performed on a window of samples surrounding the samples in the frame (either 1024 or 2*1024 samples) • This feeds into the psychoacoustic model (along with the subband samples) • Analyze tonal and nontonal elements in each band • Determine spreading functions (how much each band affects another)
MPEG Audio Algorithm • Find the masking threshold and signal-to-mask ratios for each band • The scaling factor for each band is the maximum amplitude of the samples in that band • The bit-allocation algorithm takes the SMRs and scaling factor and determines how many bits can be allocated (quantization granularity) for each band • In MP3, the bits can be moved from band to band as needed to ensure a minimum amount of compression while achieving higher quality
MPEG Audio Algorithm • Layer 1 has 12 samples encoded per band per frame • Layer 2 has 3 groups of 12 (36 samples) per frame • Layer 3 has non-equal frequency bands • Layer 3 also performs a Modified DCT on the filtered data, so we are in the frequency (not time) domain • Layer 3 does non-uniform quantization followed by Huffman coding • All of these modifications make for better (if more complex) performance for MP3
Stereo Encoding • MPEG codes stereo data in several different ways • Joint stereo • Intensity stereo • Etc. • We are not discussing these
MPEG File Format • MPEG files do not have a header (so you can start playing/processing anywhere in the file) • Consist of a sequence of frames • Each frame has a header followed by audio data
MPEG File Format • ID3 is a metadata container most often used in conjunction with the MP3 audio file format. • Allows information such as the title, artist, album, track number, year, genre, and other information about the file to be stored in the file itself. • Last 128 bytes of the file
Bit Rates • Audio (or Video) compression schemes can be characterized as either constant bit rate (CBR) or variable bit rate (VBR) • In general, higher compression can be achieved with VBR (at the cost of added complexity for code/decode) • MPEG-1 Layers 1 and 2 are CBR only • MP3 is either VBR or CBR • Average Bit Rate (ABR) is a compromise
MPEG-2 AAC • MPEG-2 (which is used for encoding DVDs) has an audio component as well • MPEG-2 AAC (Advanced Audio Coding) standard was aimed at transparent sound reproduction for theatres • 320 kbps for five channels (left, right, center, left-surround and right-surround) • 5.1 channel systems include a low-frequency enhancement channel (“woofer”) • AAC can also deliver high-quality stereo sound at bitrates less than 128 kbps
MPEG-2 AAC • AAC is the default audio format for (e.g.): YouTube, iPod (iTunes), PS3, Nintendo Dsi, etc. • Compared to MP3 • More sampling frequencies • More channels • More efficient, simpler filterbank (pure MDCT) • Arbitrary bit rates and variable frame lengths • Etc. etc.
MPEG-4 Audio • MPEG-4 audio integrates a number of audio components into one standard • Speech compression • Text-to-speech • MIDI • MPEG-4 AAC (similar to MPEG-2 AAC) • Alternative coders (perceptual coders and structured coders)