370 likes | 510 Views
To MP3 and beyond! - or - The story of a compiler test program gone large. James D. (jj) Johnston Chief Scientist DTS, Inc. A bit of Chronology. 1976 – 16 kHz, 2-8 bit ADPCM in analog hardware 1977 – 32kHz, 2-12 bit ADPCM in analog hardware
E N D
To MP3 and beyond!- or -The story of a compiler test program gone large. James D. (jj) Johnston Chief Scientist DTS, Inc.
A bit of Chronology • 1976 – 16 kHz, 2-8 bit ADPCM in analog hardware • 1977 – 32kHz, 2-12 bit ADPCM in analog hardware • 1978 – 56kb/s 2-band Sub-band coding for Commentary (7kHz bandwidth) in analog hardware • 1979 – 56kb/s 2-band Sub-band coding for Commentary, in digital form, using QMFs
What does all that mean? Well, this was my first signal processor: gnd r + ADC r - r r Yes, that’s an analog DPCM encoder. Yes. Really!!! DAC - rp C
What next? Well, for the next trick, we added an analog divider in front of the ADC and an analog multiplier in front of the DAC. This allowed us to change the step size in the ADC/DAC pair. As the divider and multiplier had exponential inputs for the “control” voltage, this conveniently allowed us to implement Jayant/Gersho adaptive quantization, i.e.: Delta (t+1)= delta(t)^(b)*M(q(t)) Where Delta is the step size, b is a “leak” factor that allows the system to converge on startup or system error, M() is a set of multipliers based on the absolute value of the quantized value, and q(t) the quantized value. ‘t’ is the time index. In the hardware, this was implemented by the equation d(t+1)=b*d(t)+m(q(t)), where each (except b) are logs of their respective values.
Big boards • Lots of resistors • Lots of things to adjust • And you had to adjust them • Constantly • Did I mention they were a bit touchy?
Then we built sub-band coders using analog filters • Imagine a 4-band SBC, using elliptical filters for each bandpass filter. • Imagine the number of resistors and capacitors on the board. • These devices worked, amazingly enough, and demonstrated that integer band sampling and SBCs were a practical concept.
Even more resistors • Even more adjustments • The Nicolet spectrum analyzer sat next to the hardware. • So did the white-noise generator • Did I mention all the little pots to adjust? • And, the little screwdriver sat there, too. Always. • We put a lanyard on that screwdriver and attached it to the board.
SBCs and rate gain • The first reason to do an SBC was “rate gain”, the thought of doing some frequency diagonalization and then coding each band appropriately. • Yes, this worked, but it was very quickly evident that there were more advantages than expected.
56kb/s Commentary coder • It was a 2-band coder, originally using 14kHz sampling, in each of two bands, 4 bit ADPCM in each. • It was implemented in analog hardware. • Imagine a card with a 14 pole-pair (yes, 28th order) bandpass, and a 7 pole-pair lowpass. All elliptic filters, too. • That’s a lot of poles. And a lot of 1% resistors. And 1% isn’t nearly good enough. • It worked. It proved the concept. It required constant adjustment.
Enter the QMF • That’s “quadrature mirror filter”, by Estiban and Galand • It’s digital. • It cancels aliasing. (No, there was no aliasing cancellation in the analog filters, rather the sampling rates were arranged to provide guard bands.) • There were no “nice” filter design methods for QMFs then. • So we designed some filters, using brute-force design methods. • That’s still my most cited paper! That was 1979!
1980 – 192kb/s design for 6-band quasi-octave band 15 kHz coder • Not enough memory space to try it • 1981-82 – Hardware realization of 2-band SBC in AT&T DSP2. • Oops, something goes way wrong with high-pass types of signals, even though SNR is great. • First heard, the phrase “upward spread of masking” – Joe Hall
What was going wrong? • Well, it was pretty clear, actually: • When you had a signal with low-frequency energy, the coder worked great. • When you had no low-frequency energy (say below 500Hz), it sounded awful. • Yes, the predictor was adaptive, and was working correctly. • Yes, the SNR was as expected. • Enter “Upward spread of masking”
(time passes, working on other things) • 1984-85 – The Alliant FX-series minicomputers arrive • Hey, now we have memory space! • Somebody needs to test the programming environment. • Atal, Hall, Schroeder 1979 (trying masking model on speech coding, which flops, not just due to a lack of flops) • I’m detailed to “break the compiler” for the Alliant computers.
A digression or two: • Joe Hall, one of the authors, pointed out the asymmetry of tone vs. noise masking • Noise masks well. • Tones mask poorly. • In retrospect, this might have been the most useful bit of information.
The moral of the story? • 15dB in Noise-masking-tone performance is massive overkill, the number is around 5.5dB at best. • 30dB for Tone-masking-noise is about the right number. • Music is not like speech. Speech is nearly always, for voiced, about halfway, which is to say needing about 15dB SNR to sound clean, and for unvoiced, about 5dB to sound clean. • Conveniently for most speech coders, 10dB is about the difference in LPC gain between voiced and unvoiced speech. • So with either shaped ADPCM or standard ADPCM, you get the same results with and without perceptual considerations. • That’s why Atal/Hall/Schroeder didn’t show much improvement. The proper adjustment almost always happens naturally.
Back to the test program • 1984-85 – I wrote a series of test programs • Perceptual Noise Insertion – Inserting noise on a Bark by Bark basis according to a masking threshold • Perceptual Entropy – Measuring the amount of information in a signal quantized at the same masking threshold • (finally) Perceptual transform coding (PXFM), actually coding 32kHz sampled signals at 128 kb/s. • This used an overlapped FFT filterbank, not an MDCT • It did long, long, LONG (2048 shift) blocks.Yes, there was a bit of pre-echo. Quite a bit. Lots, in fact. • None the less, it was a lot better at 128kb/s for anything but special signals than previous attempts, including the 192kb/s SBC.
Where did we get the data? • We used 4 clips, taken from LP, transferred to a good cassette, and then put through a 12 bit floating-point ADC (Three Rivers) that was state-of-the-art at the time. • That works out to about 2 hours of fiddling about for each 10 seconds of sound. • Hence, the small test suite. • Later, when CDs arrived, test data was much easier to get.
1986 – First (informal) listening tests. • 1986 – Compiler (F8x) working like a champ. • 1987-88 – Worked on video. Audio work not released for publication (no patent budget). • 1988 – Publication finally allowed. • Perceptual Entropy talk at ICASSP • Next booth over was Karlheinz Brandenburg’s paper on OCF • We could have traded papers and gone on like we wrote each other’s paper. • This was the “birth” of MP3 in a very real way.
The Test Program • It was written in something called “Fortran 8x” • The good news, you could do things like: • real x(1024),y(1024) • y=x*x • sfm=(sum(x)/1024)/exp(sum(log10(x))/1024) • The bad news: • No malloc, no pointers • Consider doing a Huffman codebook. Better yet, don’t.
Anyway, that’s how perceptual coding got started. What’s the big deal? • Well, the answer is another demo: • I’m going to play 3 tracks, one each of • The original • The original, with perceptually added noise at 13.6dB SNR • The original, with sample-modulated white noise at 13.6dB SNR • I’ll bet you can pick one of those out, even over a bad PA system! • You can, just barely, pick out the other in a quiet office in headphones.
Ok, we have this great technlogy • At long last, 4 years late, we get to actually write the paper. • That makes no money • So? • (Management) I know, let’s make it a STANDARD! • I/O, I/O, it’s off to disc I go. • No, wait, it’s off to MPEG-1.
Standards, or the FUN STUFF * *Warning, sarcasm included.
MPEG-1 • We have 16 proponents. • So, into 4 groups they go. • “SBCs with lots of bands” • “SBCs” • “Transform Codecs” • “Something else I’ve forgotten” • Now, you have a very short period of time to build hardware that combines all 4 of the group members’ ideas. Or else out you go.
Our Result? ASPEC • Audio Spectro-Perceptual Entropy Coding • Yeah, that’s a whopper. It’s also after the fact, ASPEC was what one partner demanded. • What is ASPEC? • PXFM perceptual model, more or less • OCF filterbank (MDCT, good choice) • Block switching from 3rd contributor • “That looks like a good idea” from the 4th.
Hardware? Remember? • Well, they changed the interface rules a few weeks before the test. • Some of the interfaces did not work very well, ours included. • In particular we introduced horrible jitter into the audio signal.
The TEST • Well, there were several parts to the test. • Audio quality vs. bit rate • Complexity • To make a long story short, ASPEC, including horrid jitter, was clearly the best of the audio quality bunch. • But it was “too complicated” • According to their “estimated hardware” an ASPEC encoder (simpler than an AAC encoder for comparison) would take up a bit over 1 square INCH of silicon. • Yes, that was very, very silly. • Hence, it was “too complicated” and rejected outright.
On to MPEG-1 Standards • This lead to a 2-part MPEG-1 Audio standard, layers 1, and 2 • Ok, time to play standards-wars. • Countries filed objections • Words were spoken in haste • Words were spoken in anger • Beer mugs were waved about • Lawyers were consulted
Hey, let’s have a “Layer 3 and 4” • But wait, that has to use the filterbank in Layers 1 and 2. • Ok, that just won’t work. Not enough signal-processing gain. • So, use a “HYBRID” filter bank • Work goes forward. • Layer 3 was born • There were several Layer 3’s and 4’s. The final Layer 3 was always the highest numbered.
An aside • There is a mathematical way to describe a hybrid (i.e. multistage) filterbank • It has a longer impulse response than it needs to have • It has worse frequency rejection than it ought to have • It takes more flops than doing it “right”. • The hybrid bank, of course, was not “too complicated” compared to the MDCT.
Anyhow, on to Layer 3 • And there you have it, after much more shuffling, jumps to the left, steps to the right, and the continual involvement of loud voices, harsh words, angry gestures, and waved beer mugs, not to say papers from lawyers and standards bodies. • There’s also a debacle involving joint stereo coding, but we’ll leave that for another day, or not.
So, what’s the deal with AAC? • Well, while MP3 was finishing • A different group, consisting almost exclusively of Layer 1 and 2 people, who were not busied out writing Layer 3, decided: • MPEG-2 audio had to be backward compatible • That means that one sends a downmix (LxRx) in the original MPEG-1 standard • One then sends difference channels, effectively, in the “auxiliary” data. • This only costs an extra 40kb/s, according to proponents.
This led JJ to a realization • Either the math that said 40kb/s was right - or - • Linear algebra works • So, how many people HAVE used MPEG2 BC codecs?
What did AT&T do? • I dropped out of the standards process. • Anibal Ferierra and I worked on a new codec, using a non-hybrid filterbank, a stereo coding algorithm that made sense, and a bitstream that didn’t build in assumptions about audio signal spectra. • This was called “PAC”, for “Perceptual Audio Coder”
Whence MPEG? • Well, along those lines, some folks in MPEG insisted on a test • We will provide backward adaptive codecs • We’ll let in ONE non-BC codec • Well, they had to let in 2 of them. • The NBC (non backward compatible) codecs trounced the BC codecs. • In that test result, PAC was first, a US competitor’s a somewhat distant second, but a firm, clear second at that, and all of the BC codecs, at higher rates, even, dead, solid, last. • Linear algebra does seem to work.
The NBC project • So, the extremely competitive groups who didn’t get along that came in 1st and 2nd on the test were put together with the group responsible for 3rd place, and told that they might be allowed to propose something new, called (working title, obviously) the NBC Codec. • Yes, that’s about as chaotic as it sounds. • And then we added one more partner.
Some Events • PAC wins the “NBC” startup test • Independent (non-BC) ASPEC wins the “reference model” (MP3 being ASPEC with the hybrid filter bank) • Nearly all of the features of PAC then replace nearly all of the features of ASPEC • And then it was all renamed AAC.
THE END? • The morals of the story: • You never know what is going to happen when you try to break the compiler for a newly delivered computer. • As Allan Sherman said: “I really hate to say it, ‘cause it truly isn’t pretty, but a Camel was a Horse that was made by a committee.” Quote from Allan Sherman, “Peter and the Commissar”