1 / 17

Audio Fingerprinting

Audio Fingerprinting. Wes Hatch MUMT-614 Mar.13, 2003. What is Audio Fingerprinting?. a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came. Applications. Broadcast monitoring

ailis
Download Presentation

Audio Fingerprinting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003

  2. What is Audio Fingerprinting? • a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came

  3. Applications • Broadcast monitoring • playlist generation • royalty collection • ad verification • Connected Audio • general term for consumer applications • Other • Napster--use of fingerprinting systems to prohibit the transmission of copywritten materials • Finding desired content efficiently in “an overwhelming amount of audio material”

  4. “Benefits” • Automated search of illegal content on the Internet • examines the real audio information rather than just tag information • For the consumer • make the meta-data of songs in a library consistent, allowing for easy organization • can guarantee that what is downloaded is actually what it says it is • will allow consumer to record signatures of sound and music on small handheld devices

  5. Two principle components • Compute the fingerprint • Compare it to a database of previously computed fingerprints • A text example: “…in a box. I will not eat them with a fox. I…”

  6. Details to worry about • Robustness (to noise, distortion) • Reliability • Fingerprint size (reduced dimensionality) • Granularity • Search speed and scalablity • Computationally efficient • Resulting features must be informative about the audio content • Semantic or non-semantic features? • Hash table or vector representation?

  7. Computing the fingerprint • Compare to hash functions…? • compare computed hash value with that stored in a database • Drawback • need to worry about perceptual similarity and not mathematical similarity • PCM audio vs. MP3: both sound alike but mathematically (i.e. spectral content) are quite different • perceptual similarity is not transitive • not possible to design a system which computes mathematical fingerprints for perceptually similar objects

  8. Techniques (general) • Any ‘x’ number of seconds may be used to compute the fingerprint • Audio gets separated into frames • Features computed for each frame: • Fourier coefficients • MFCC, LPC • Spectral flatness • sharpness • “features mapped into a more compact representation by using …HMM, or quantization”

  9. Techniques (Haitsma, Kalker) • one 32-bit sub-fingerprint every 11.6 ms • A block consists of 256 sub-fingerprints • Corresponds to a granularity of only 3 seconds • Large overlap (31/32), so subsequent sub-fingerprints are similar and vary slowly in time • worst-case scenario: the frame boundaries used during identification are 5.8 ms off with those in database

  10. Techniques (Haitsma, Kalker) • Data from each frame is sent through a filterbank • 33 filters, logarithmically spaced (to correspond roughly to the Bark scale) • between 300 and 2000Hz • phase is neglected (perceptual reasons)

  11. System overview

  12. Techniques (Burges, Platt) • downsampled to 11.025 kHz, split into frames with overlap of 2 • MCLT is then applied to each frame. A 128-sample log spectrum is generated by taking the log modulus of each MCLT coefficient

  13. Techniques (Burges, Platt) • Use prior knowledge to define form of the feature extractor • Features computed by a “linear, convolutional” neural network • convert signal into a feature vector • uses Pattern Classification and Scene Analysis (PCA) to find a set of projections • generates a vector of 128 values for every 11.6ms interval • dimensional-reduction method (i.e. lots of math)

  14. Techniques (Burges, Platt) • 3 layers of Oriented PCA (OPCA) • operates on a frame of 128 values • layer 1: generates 10 values for each frame • layer 2: takes 42 ‘layer 1 outputs’ and produces 20 values • layer 3: takes 40 ‘layer 2 outputs’ and produces 64 values (11K inputs --> 64 outputs)

  15. Searching the Database • Look for the most similar (not necessarily exact) fingerprint • 10,000 5-min. songs  250 million sub-fingerprints • brute force takes in excess of 20 minutes on a very fast PC • brute force computes bit-error rate for every possible position in the database

  16. Searching the Database • make assumption that at least 1 (of the 256) sub-fingerprints are error-free • then, use a hash table (as opposed to more memory-intensive look-up table) • 800,000 times faster

  17. Results • false-positive rate of 3.6x10-2 (Haitsma, Kalker) • On tests with a large (500,000) set of input traces • has a “low” false-positive and false-negative rate. (Burges, Platt) • didn’t test on time compression, expansion • can withstand distortions occurring from transmission over mobile phones.

More Related