170 likes | 474 Views
Audio Fingerprinting. Wes Hatch MUMT-614 Mar.13, 2003. What is Audio Fingerprinting?. a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came. Applications. Broadcast monitoring
E N D
Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003
What is Audio Fingerprinting? • a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came
Applications • Broadcast monitoring • playlist generation • royalty collection • ad verification • Connected Audio • general term for consumer applications • Other • Napster--use of fingerprinting systems to prohibit the transmission of copywritten materials • Finding desired content efficiently in “an overwhelming amount of audio material”
“Benefits” • Automated search of illegal content on the Internet • examines the real audio information rather than just tag information • For the consumer • make the meta-data of songs in a library consistent, allowing for easy organization • can guarantee that what is downloaded is actually what it says it is • will allow consumer to record signatures of sound and music on small handheld devices
Two principle components • Compute the fingerprint • Compare it to a database of previously computed fingerprints • A text example: “…in a box. I will not eat them with a fox. I…”
Details to worry about • Robustness (to noise, distortion) • Reliability • Fingerprint size (reduced dimensionality) • Granularity • Search speed and scalablity • Computationally efficient • Resulting features must be informative about the audio content • Semantic or non-semantic features? • Hash table or vector representation?
Computing the fingerprint • Compare to hash functions…? • compare computed hash value with that stored in a database • Drawback • need to worry about perceptual similarity and not mathematical similarity • PCM audio vs. MP3: both sound alike but mathematically (i.e. spectral content) are quite different • perceptual similarity is not transitive • not possible to design a system which computes mathematical fingerprints for perceptually similar objects
Techniques (general) • Any ‘x’ number of seconds may be used to compute the fingerprint • Audio gets separated into frames • Features computed for each frame: • Fourier coefficients • MFCC, LPC • Spectral flatness • sharpness • “features mapped into a more compact representation by using …HMM, or quantization”
Techniques (Haitsma, Kalker) • one 32-bit sub-fingerprint every 11.6 ms • A block consists of 256 sub-fingerprints • Corresponds to a granularity of only 3 seconds • Large overlap (31/32), so subsequent sub-fingerprints are similar and vary slowly in time • worst-case scenario: the frame boundaries used during identification are 5.8 ms off with those in database
Techniques (Haitsma, Kalker) • Data from each frame is sent through a filterbank • 33 filters, logarithmically spaced (to correspond roughly to the Bark scale) • between 300 and 2000Hz • phase is neglected (perceptual reasons)
Techniques (Burges, Platt) • downsampled to 11.025 kHz, split into frames with overlap of 2 • MCLT is then applied to each frame. A 128-sample log spectrum is generated by taking the log modulus of each MCLT coefficient
Techniques (Burges, Platt) • Use prior knowledge to define form of the feature extractor • Features computed by a “linear, convolutional” neural network • convert signal into a feature vector • uses Pattern Classification and Scene Analysis (PCA) to find a set of projections • generates a vector of 128 values for every 11.6ms interval • dimensional-reduction method (i.e. lots of math)
Techniques (Burges, Platt) • 3 layers of Oriented PCA (OPCA) • operates on a frame of 128 values • layer 1: generates 10 values for each frame • layer 2: takes 42 ‘layer 1 outputs’ and produces 20 values • layer 3: takes 40 ‘layer 2 outputs’ and produces 64 values (11K inputs --> 64 outputs)
Searching the Database • Look for the most similar (not necessarily exact) fingerprint • 10,000 5-min. songs 250 million sub-fingerprints • brute force takes in excess of 20 minutes on a very fast PC • brute force computes bit-error rate for every possible position in the database
Searching the Database • make assumption that at least 1 (of the 256) sub-fingerprints are error-free • then, use a hash table (as opposed to more memory-intensive look-up table) • 800,000 times faster
Results • false-positive rate of 3.6x10-2 (Haitsma, Kalker) • On tests with a large (500,000) set of input traces • has a “low” false-positive and false-negative rate. (Burges, Platt) • didn’t test on time compression, expansion • can withstand distortions occurring from transmission over mobile phones.