1 / 23

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks. Yun Wang, Leonardo Neves, Florian Metze 3/23/2016. Multimedia Event Detection. Goal: Content-based retrieval Example events:. Multimedia Event Detection. Sources of information:. Visual. Speech. Non-speech audio.

avitagliano
Download Presentation

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Audio-BasedMultimedia Event DetectionUsing Deep RecurrentNeural Networks Yun Wang, Leonardo Neves, Florian Metze3/23/2016

  2. Multimedia Event Detection Goal: Content-based retrieval Example events:

  3. Multimedia Event Detection • Sources of information: Visual Speech Non-speech audio

  4. Conventional Pipeline Event Bag of audio words, GMM supervector, or i-vector Order disregarded Low-level features (e.g. MFCCs) Local context only

  5. Noisemes • Semantically meaningful sound units • Examples: • Can be long-lasting or transient • Allow for fine-grained audio scene understanding

  6. Proposed Pipeline Event Deep RNN Noiseme confidence vectors Deep RNN OpenSMILE features (983 dimensions)

  7. Step 1 Frame-level Noiseme Classification

  8. The “Noiseme” Corpus • 388 clips, 7.9 hours • Hand-annotated with 48 noisemes • Merged into 17 (+background) • 30% overlap • 60% training, 20% validation, 20% test * S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012.

  9. Baseline • Evaluation criterion: frame accuracy • Linear SVM: 41.5% • Feed-forward DNN: • 2 hidden layers • 500 ReLU units per layer • Softmax output • Accuracy: 45.1%

  10. Recurrent Neural Networks • Hidden unit: ReLU or LSTM cell

  11. Bidirectional RNNs

  12. Evaluation • Bidirectionality helps • LSTM cells not necessary

  13. Step 2 Clip-level Event Detection

  14. Noiseme Confidence Vectors • Generated with trained ReLU BRNN

  15. TRECVID 2011 MED Corpus • 3,104 training clips, 6,642 test clips • 15 events • Evaluation criterion: • Mean average precision (MAP) • Average precision (AP) for one event: • ✓✗✓✗✓ (1/1 + 2/3 + 3/5) / 3 • MAP = mean(AP) across all events

  16. RNN Models • One RNN for each event • Unidirectional LSTM • Sigmoid output at last frame only

  17. Multi-Resolution Training

  18. Multi-Resolution Training

  19. Multi-Resolution Training

  20. Multi-Resolution Training • MAP: • 4.0% @ length = 1 (Feed-forward baseline) • 4.6% @ length = 32 • 3.2% @ length = 512 • LSTM can use temporal information, but only for short sequences

  21. Follow-Up Work • SVM baseline: 7.1% • Using the chi2-RBF kernel • Recurrent SVMs: 8.8% * Y. Wang and F. Metze, “Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection”, submitted to ICMR 2016.

  22. Conclusion • Temporal information helps! • Frame-level noiseme classification accuracy:45.1%  47.0% • Clip-level event detection: 4.0%  4.6% MAP • Clip-level event detection still needs improvement • Recurrent SVMs

  23. Thanks! Any questions?

More Related