A Novel Approach for Recognizing Auditory Events & Scenes

A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor

Problem Description • How can we represent arbitrary environments, so that we can: • Label scene elements • Classify environments • Synthesize environmental sounds • Example: Coffee Shop • Basic spectral texture • Glasses clinking, doors opening, etc.

Outline of Our Approach • Create a palette of sounds • Epitomes (Jojic et al) for audio • Given an audio segment • Generate distributions over the palette • Use the distribution for classification/detection etc

Representation: Palette of Sounds World Palette Features To Represent Audio Input Audio

Epitomes for images • Epitome • Jojic, Frey, and Kannan, ICCV 2003 • Developed for images

Epitomes for Audio • 1-D signal • 2-D representation, but little vertical self-similarity • Lots of redundancy (silence, repeated background) • Much longer inputs, bigger ratio of input to epitome size • Hours of data => 10-30 second epitome

Informative Sampling of Patches • Original epitome: take patches at random • Our approach: try to maximize coverage • reduce sampling likelihood of patches similar to those we have covered f t t* probabilityof patchselection t

Examples: Toy Sequence 600 frame (10 sec) epitome from 3700 frames (2 min) Random Sampling Informative Sampling

Random Vs Informative • Simulation on the toy dataset • 2 secs long epitome • Likelihood Vs # of patches • Averaged over 10 runs

Examples: Outdoor Sequence 1800 frame (1 min) epitome from 15000 frames (8 min)

Cafe Highway Classification of Events/Scenes • Look at distributions over the epitome • Given a audio segment to classify • For all the patches in the audio • Recover the transformations given the epitome • Look at the distribution of the transformations to classify P(T|e,c=1) Speech P(T|e,c=2) Cars classifying c’: P(T|e,c=c’) ???

Experiments • 3 Different Environments • Highway, Kitchen, Outdoor Parking • 6 Minutes of data to train 30 sec long epitome • 4 Events to Detect (manually segmented) • Speech (22 examples) • Car (17 examples) • Utensil: Knife Chopping Vegetables (29 examples) • Bird Chirp (24 examples) • None of the above (30 examples)

Car Speech Knife/Utensil Chirp

Detection Example • Speech Detection (hard case) • Very noisy environment (148th Ave) • Only 5 labeled examples of speech

Performance Comparison • Mixture of Gaussians • For each audio segment to classify • Classify every frame using the mixture • Vote among the results • Nearest Neighbor • Same method as for mixture of Gaussians • Computationally too expensive!

Performance Vs Amount of Training Data

Knife/Utensil Speech Car Chirp

Contributions • Framework for Acoustic Event Detection and Scene Classification • Epitomes for Audio • Informative Sampling (Can be applied to any domain) • Distributions over epitomic indexes for discrimination

Future Work • Informative Sampling • Maximizing the Minimum Likelihood • Discriminative Epitomes • Novel Scene Classification • Rich Representation using Epitomes • Boosting, other ensemble techniques • Hierarchical Acoustic Sound Analysis • Same Model for: • Acoustic Event Detection, Scene Classification & Synthesis • clustering mechanisms for scene retrieval

Acknowledgments • Sumit Basu • Nebojsa Jojic • My friends and fellow interns

A Novel Approach for Recognizing Auditory Events & Scenes