Digital Audio Compression

Digital Audio Compression CIS 465 Spring 2013

Speech Compression • Compression of voice data • We have previously mentioned several methods that are used to compress voice data • mu-law and A-law companding • ADPCM and delta modulation • These are examples of methods which work in the time domain (as opposed to the frequency domain) • Often they are not even considered compression methods

Speech Compression • Although the previous techniques are generally applied to speech data they are not designed specifically for such data • Vocoders, instead, are • Can’t be used with other analog signals • Model speech so that the salient features can be captured in as few bits as possible • Linear Predictive Coders model the speech waveform in time • Also channel vocoders and formant vocoders • In electronic music, vocoders allow a voice to modulate a musical source (via synthesizer, e.g.)

General Audio Compression • If we want to compress general audio (not just speech), different techniques are needed • In particular, music compression is a more general form of audio compression • We make use of psychoacoustical modeling • Enable perceptual encoding based upon an analysis of the ear and brain perceive sound • Perceptual encoding exploits audio elements that the human ear cannot hear well

Psychoacoustics • If you have been listening to very loud music, you may have trouble afterwards hearing soft sounds (that normally you could hear) • Temporal masking • A loud sound at one frequency (a lead guitar) may drown out a sound at another frequency (the singer) • Frequency masking

Equal-Loudness Relations • If we play two pure tones, sinusoidal sound waves, with the same amplitude but different frequencies • One may sound louder than another • The ear does not hear low or high frequencies as well as mid-range ones (speech) • This can be shown with equal-loudness curves which plot perceived loudness on the axes of true loudness and frequency

Equal-Loudness Relations

Threshold of Hearing • The following image is a plot of the threshold of human hearing for pure tones – at loudness below the curve, we don’t hear a tone

Threshold of Hearing • A loud sound can mask other sounds at nearby frequencies as shown below

Frequency masking • We can determine how a pure tone at a particular frequency affects our ability to hear tones at nearby frequencies • Then, if a signal can be decomposed into frequencies, for those frequencies that are only partially masked, only the audible part will be used to set the quantization noise thresholds

Critical Bands • Human hearing range divides into critical bands • Human auditory system cannot resolve sounds better than within about one critical band when other sounds are present • Critical bandwidth represents the ear’s resolving power for simultaneous tones • At lower frequencies the bands are narrower than at higher frequencies • The band is the section of the inner ear which responds to a particular frequency

Critical Bands

Critical Bands • Generally, the audio frequency range for hearing (20 Hz – 20 kHz) can be partitioned into about 24 critical bands (25 are typically used for coding applications • The previous slide does not show several of the highest frequency critical bands • The critical band at the highest audible frequency is over 4000 Hz wide • The ear is not very discriminating within a critical band

Temporal Masking • A loud tone causes the hearing receptors in the inner ear to become saturated, and they require time to recover • This leads to the temporal masking effect • After the loud tone we cannot immediately hear another tone – post-masking • The length of the masking depends on the duration of the masking tone • A masking tone can also block sounds played just before – pre-masking (shorter time)

Temporal Masking • MPEG audio compression takes advantage of both temporal and frequency masking to transmit masked frequency components using fewer bits

MPEG Audio Compression • MPEG (Motion Picture Experts Group) is a family of standards for compression of both audio and video data • MPEG-1 (1991) CD quality audio • MPEG-2 (1994) Multi-channel surround sound • MPEG-4 (1998) Also includes MIDI, speech, etc. • MPEG-7 (2003) Not compression – searching • MPEG-21 (2004) Not compression – digital rights management

MPEG Audio Compression • MPEG-1 defined three downward compatible layers of audio compression • Each layer offers more complexity in the psychoacoustic model used and hence better compression • Increased complexity leads to increased delay • Compatibility achieved by shared file header information • Layer 1 – used for Digital Audio Tape • Layer 2 – proposed for digital audio broadcasting • Layer 3 – music (MPEG-1 layer 3 == mp3)

MPEG Audio Compression • MPEG audio compression relies on quantization, masking, critical bands • The encoder uses a bank of 32 filters to decompose the signal into sub-bands • Uniform width – not exactly aligned to crit. bands • Overlapping • A Fourier transform is used for the psycho-acoustical model • Layer 3 adds a DCT to the sub-band filtering so that layers 1 and 2 work in the temporal domain and layer 3 in the frequency domain

MPEG Audio Compression • PCM input filtered into 32 bands • PCM FFT transformed for PA model • Windows of samples (384, 576, 1152) coded at a time

MPEG Audio Compression • Since the sub-bands overlap, aliasing may occur • This is overcome by the use of a quadrature mirror filter bank • Attenuation slopes of adjacent bands are mirror images

MPEG Audio Algorithm • The PCM audio data is assembled into frames • Header – sync code of 12 1s • SBS format – describe how many sub-band samples (SBS) are in the frame • The SBS (384 in Layer 1, 1152 in Layers 2, 3) • Ancillary data – e.g. multi-lingual data or surround-sound data

MPEG Audio Algorithm • The sampling rate determines the frequency range • That range is divided up into 32 overlapping bands • The frames are sent through a corresponding 32-filter filter bank • If X is the number of samples per frame, each filter produces X/32 samples • These are still samples in the temporal domain

MPEG Audio Algorithm • The Fourier transform is performed on a window of samples surrounding the samples in the frame (either 1024 or 2*1024 samples) • This feeds into the psychoacoustic model (along with the subband samples) • Analyze tonal and nontonal elements in each band • Determine spreading functions (how much each band affects another)

MPEG Audio Algorithm • Find the masking threshold and signal-to-mask ratios for each band • The scaling factor for each band is the maximum amplitude of the samples in that band • The bit-allocation algorithm takes the SMRs and scaling factor and determines how many bits can be allocated (quantization granularity) for each band • In MP3, the bits can be moved from band to band as needed to ensure a minimum amount of compression while achieving higher quality

MPEG Audio Algorithm • Layer 1 has 12 samples encoded per band per frame • Layer 2 has 3 groups of 12 (36 samples) per frame • Layer 3 has non-equal frequency bands • Layer 3 also performs a Modified DCT on the filtered data, so we are in the frequency (not time) domain • Layer 3 does non-uniform quantization followed by Huffman coding • All of these modifications make for better (if more complex) performance for MP3

Stereo Encoding • MPEG codes stereo data in several different ways • Joint stereo • Intensity stereo • Etc. • We are not discussing these

MPEG File Format • MPEG files do not have a header (so you can start playing/processing anywhere in the file) • Consist of a sequence of frames • Each frame has a header followed by audio data

MPEG File Format

MPEG File Format • ID3 is a metadata container most often used in conjunction with the MP3 audio file format. • Allows information such as the title, artist, album, track number, year, genre, and other information about the file to be stored in the file itself. • Last 128 bytes of the file

Bit Rates • Audio (or Video) compression schemes can be characterized as either constant bit rate (CBR) or variable bit rate (VBR) • In general, higher compression can be achieved with VBR (at the cost of added complexity for code/decode) • MPEG-1 Layers 1 and 2 are CBR only • MP3 is either VBR or CBR • Average Bit Rate (ABR) is a compromise

MPEG-2 AAC • MPEG-2 (which is used for encoding DVDs) has an audio component as well • MPEG-2 AAC (Advanced Audio Coding) standard was aimed at transparent sound reproduction for theatres • 320 kbps for five channels (left, right, center, left-surround and right-surround) • 5.1 channel systems include a low-frequency enhancement channel (“woofer”) • AAC can also deliver high-quality stereo sound at bitrates less than 128 kbps

MPEG-2 AAC • AAC is the default audio format for (e.g.): YouTube, iPod (iTunes), PS3, Nintendo Dsi, etc. • Compared to MP3 • More sampling frequencies • More channels • More efficient, simpler filterbank (pure MDCT) • Arbitrary bit rates and variable frame lengths • Etc. etc.

MPEG-4 Audio • MPEG-4 audio integrates a number of audio components into one standard • Speech compression • Text-to-speech • MIDI • MPEG-4 AAC (similar to MPEG-2 AAC) • Alternative coders (perceptual coders and structured coders)

Digital Audio Compression

Digital Audio Compression

Presentation Transcript

Dualities in Digital Audio Compression and Digital Audio Watermarking

Audio Compression

MPEG Audio Compression

Audio Compression

Audio Compression

Audio/Video compression Security

Audio Compression Techniques

Audio Compression

Digital Audio

Digital audio

Audio Compression

S kills : audio compression

Digital Audio

Digital Audio

Digital Audio Compression

AUDIO COMPRESSION

Digital audio

2.4 Audio Compression