Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks

Audio-BasedMultimedia Event DetectionUsing Deep RecurrentNeural Networks Yun Wang, Leonardo Neves, Florian Metze3/23/2016

Multimedia Event Detection Goal: Content-based retrieval Example events:

Multimedia Event Detection • Sources of information: Visual Speech Non-speech audio

Conventional Pipeline Event Bag of audio words, GMM supervector, or i-vector Order disregarded Low-level features (e.g. MFCCs) Local context only

Noisemes • Semantically meaningful sound units • Examples: • Can be long-lasting or transient • Allow for fine-grained audio scene understanding

Proposed Pipeline Event Deep RNN Noiseme confidence vectors Deep RNN OpenSMILE features (983 dimensions)

Step 1 Frame-level Noiseme Classification

The “Noiseme” Corpus • 388 clips, 7.9 hours • Hand-annotated with 48 noisemes • Merged into 17 (+background) • 30% overlap • 60% training, 20% validation, 20% test * S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012.

Baseline • Evaluation criterion: frame accuracy • Linear SVM: 41.5% • Feed-forward DNN: • 2 hidden layers • 500 ReLU units per layer • Softmax output • Accuracy: 45.1%

Recurrent Neural Networks • Hidden unit: ReLU or LSTM cell

Bidirectional RNNs

Evaluation • Bidirectionality helps • LSTM cells not necessary

Step 2 Clip-level Event Detection

Noiseme Confidence Vectors • Generated with trained ReLU BRNN

TRECVID 2011 MED Corpus • 3,104 training clips, 6,642 test clips • 15 events • Evaluation criterion: • Mean average precision (MAP) • Average precision (AP) for one event: • ✓✗✓✗✓ (1/1 + 2/3 + 3/5) / 3 • MAP = mean(AP) across all events

RNN Models • One RNN for each event • Unidirectional LSTM • Sigmoid output at last frame only

Multi-Resolution Training

Multi-Resolution Training • MAP: • 4.0% @ length = 1 (Feed-forward baseline) • 4.6% @ length = 32 • 3.2% @ length = 512 • LSTM can use temporal information, but only for short sequences

Follow-Up Work • SVM baseline: 7.1% • Using the chi2-RBF kernel • Recurrent SVMs: 8.8% * Y. Wang and F. Metze, “Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection”, submitted to ICMR 2016.

Conclusion • Temporal information helps! • Frame-level noiseme classification accuracy:45.1%  47.0% • Clip-level event detection: 4.0%  4.6% MAP • Clip-level event detection still needs improvement • Recurrent SVMs

Thanks! Any questions?

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks

Presentation Transcript

Recurrent neural networks (I)

Speech Sound Production: Recognition Using Recurrent Neural Networks

III. Recurrent Neural Networks

Face Detection and Neural Networks

Neural Networks and Deep Learning

Recurrent Neural Networks or Associative Memories

Generating Text with Recurrent Neural Networks

Intrusion Detection Using Hybrid Neural Networks

Predicting Signal Peptides using Deep Neural Networks

RECURRENT NEURAL NETWORKS

Learning from relational databases using recurrent neural networks

Model Based Event Detection in Sensor Networks

Model Based Event Detection in Sensor Networks

Recurrent Neural Networks

Model Based Event Detection in Sensor Networks

Neural Networks and Deep Learning_