250 likes | 276 Views
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks. Yun Wang, Leonardo Neves, Florian Metze 3/23/2016. Multimedia Event Detection. Goal: Content-based retrieval Example events:. Multimedia Event Detection. Sources of information:. Visual. Speech. Non-speech audio.
E N D
Audio-BasedMultimedia Event DetectionUsing Deep RecurrentNeural Networks Yun Wang, Leonardo Neves, Florian Metze3/23/2016
Multimedia Event Detection Goal: Content-based retrieval Example events:
Multimedia Event Detection • Sources of information: Visual Speech Non-speech audio
Conventional Pipeline Event Bag of audio words, GMM supervector, or i-vector Order disregarded Low-level features (e.g. MFCCs) Local context only
Noisemes • Semantically meaningful sound units • Examples: • Can be long-lasting or transient • Allow for fine-grained audio scene understanding
Proposed Pipeline Event Deep RNN Noiseme confidence vectors Deep RNN OpenSMILE features (983 dimensions)
Step 1 Frame-level Noiseme Classification
The “Noiseme” Corpus • 388 clips, 7.9 hours • Hand-annotated with 48 noisemes • Merged into 17 (+background) • 30% overlap • 60% training, 20% validation, 20% test * S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012.
Baseline • Evaluation criterion: frame accuracy • Linear SVM: 41.5% • Feed-forward DNN: • 2 hidden layers • 500 ReLU units per layer • Softmax output • Accuracy: 45.1%
Recurrent Neural Networks • Hidden unit: ReLU or LSTM cell
Evaluation • Bidirectionality helps • LSTM cells not necessary
Step 2 Clip-level Event Detection
Noiseme Confidence Vectors • Generated with trained ReLU BRNN
TRECVID 2011 MED Corpus • 3,104 training clips, 6,642 test clips • 15 events • Evaluation criterion: • Mean average precision (MAP) • Average precision (AP) for one event: • ✓✗✓✗✓ (1/1 + 2/3 + 3/5) / 3 • MAP = mean(AP) across all events
RNN Models • One RNN for each event • Unidirectional LSTM • Sigmoid output at last frame only
Multi-Resolution Training • MAP: • 4.0% @ length = 1 (Feed-forward baseline) • 4.6% @ length = 32 • 3.2% @ length = 512 • LSTM can use temporal information, but only for short sequences
Follow-Up Work • SVM baseline: 7.1% • Using the chi2-RBF kernel • Recurrent SVMs: 8.8% * Y. Wang and F. Metze, “Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection”, submitted to ICMR 2016.
Conclusion • Temporal information helps! • Frame-level noiseme classification accuracy:45.1% 47.0% • Clip-level event detection: 4.0% 4.6% MAP • Clip-level event detection still needs improvement • Recurrent SVMs
Thanks! Any questions?