190 likes | 289 Views
WP4 – Sound Object Representation. Enabling Access to Sound Archives through Integration, Enrichment and Retrieval. Introduction to Workpackage-Overview. Objectives: How to represent audio for the purposes of efficient querying. Segmentation of audio streams.
E N D
WP4 – Sound Object Representation Enabling Access to Sound Archives through Integration, Enrichment and Retrieval
Introduction to Workpackage-Overview • Objectives: • How to represent audio for the purposes of efficient querying. • Segmentation of audio streams. • Distinct objects may then be recognized using musical instrument identification and speaker identification techniques . • Identification of higher level features • Speech related- Gender, Emotion, Laughter and Language • Music related- tempo, beat detection, rhythm… • Tasks: • T 4.1 Audio stream segmentation- Speech/music separation… • T 4.2 Source separation- Instrument Identification, Speaker Identification • T 4.3 Sound object identification • T4.5: Transcription • Music transcription • High level speech phonetics & characteristics
Deliverables and Milestones • Deliverables • D4.1 Prototype segmentation, separation and speaker/instrument identification system (Month 14) • D4.2 Prototype transcription system (Month 27) • D4.3 Final report on sound object representations (Month 30) • Milestones and expected result • M4.1- Month 6: Speech/music separation methods implemented and tested • M4.2 - Month 10: Initial results on identification of sound objects, prototype segmenter and separator • M4.3 – Month 18: Identification of speech characteristics from segmented, separated audio streams • M4.4 – Month 24: Transcription of monophonic music from segmented, separated audio streams • M4.5 – Month 28: Testing and evaluation of complete system
Workpackage Progress – Speech Related • Prototype for speaker segmentation is ready. • Preliminary prototype for SID is ready. • Pre-processing module implemented for ED and SID: Energy based Voice Activity Detector. • ED, Laughter DLL is ready (NICE’s API). • LID algorithm evaluated on English UK corpus. We got (achieved ?) over 85% accuracy (explain more this point ?). • Trained on a testbed representing atleast 10 (European) languages • On going research on speaker identification (outlier detection and exclusion, how to deal with multi-speaker?).
Contributions and Connections with Other Workpackages • This WP provides many inputs to other WPs and relies on few outputs from other WPs • WP2 • The sound objects extracted in WP4 populate the ontology devised in WP2 • WP3 • Sound object recognition used to enable enhanced retrieval • Retrieval of speakers • Retrieval of key speech and music features • WP5 • Sound objects used both in archiving and as access tools • Source separation • Audio enhancement
Upcoming Work Plan Months 12-24 – Speech Related • Speaker Identification • Retrieval of speakers (for use in WP3) • Research on Outlier detection and exclusion • Research on new scoring methods • How to Deal with Multiple Targets in Speaker Identification? • ED, Laughter and Gender • VAMP API • On going research on robust methods. • LID • Build robust model for English UK and implementation.
Music Transcription • Reasonable accuracy detection in: • Onset detection • Tempo detection • Key detection • Monophonic pitch detection • Unsolved or unexplored research areas: • Ornamentation detection • Time signature detection • Segmentation: • Bar line detection • Music Structure Detection
ROLL CUT STRIKE Music Transcription: Ornamentation detection Gainza, M. and E. Coyle. Automating Ornamentation Transcription. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07)
Music Transcription: Time Signature Detection • Music is highly repetitive: chorus, phrases, bars… • The method utilises a multi-resolution audio similarity matrix to detect repetitive musical bars by building templates of time signature candidates • The method only depends on musical structure, and does not depend on the presence of percussive instruments or strong musical accents
Music Transcription: Time Signature Detection Gainza, M. and E. Coyle. Time Signature Detection by Using a Multi-Resolution Audio Similarity Matrix. In Audio Engineering Society 122nd Convention. 2007. Vienna.
Bar line prediction Bar length Bar line aligment Song [p1, p2... pn] ASM Anacrucis [b1, b2... bn] Onset detector Music Transcription: Bar line Segmentation • Detects the musical bar length and the anacrusis using Audio Sim. Matrix • Predicts and aligns the position of future bars by using an Onset Detector Gainza, Mikel; Barry, Dan ; Coyle, Eugene Automatic Bar Line Segmentation. In Audio Engineering Society 123nd Convention, New York, 2007
Anacrucis Bar length Music Transcription: Bar line Segmentation
Azimugram S A,T N basis func B1,T Segments Song ADDRESS PCA ICA Orthogonality enforcement Music Transcription: Music Structure Segmentation • There are many mid-level representations: spectrogram, chromagram, MFCC… • Novel mid-level representation: Azimugram time-azimuth representation of a stereo field • System based on the assumption that each section type (e.g: chorus) have a unique source location-intensity profile.
Intro Verse Chorus Music Transcription: Music Structure Segmentation Audio Signal Azimugram Segmentation Barry, Dan; Gainza, Mikel; Coyle, Eugene. Music Structure Segmentation using the Azimugram in conjunction with Principal Component Analysis. In AES 123nd Convention, New York, 2007
Upcoming Work Plan Months 12-24 • Assess the robustness of the ornamentation detector for a variety of instruments • Dynamically adapt time signature and bar line detections to tempo variations • Assess the best mid-level representation for music segmentation • Combine the music structure and bar line segmentation systems. Thus, a segment is aligned to the bar lines • Incorporate knowledge of music structure (e.g.: 8 bars per section…) • Migrate all MATLAB applications to C++
ALL - Workpackage progress Silence to silence segmentation – ALL • Start – stop segmentation • Threshold algorithm – ALL use this, it is sufficient for speech wave energy under the threshold value is silence • Multi-threshold there are different threshold values for different situations • Trained HMMmanually segmented sample for the training Usage • Preparation phase for the manual segmentation of the training corpus
ALL - Workpackage progress Speech – non speech segmentation – ALL • Trained HMM with gaussian mixture distribution • Trained for: • Speech • Music • Singing • Whistle • …. • Using 26 dimensions MFCC feature vectors Usage • speech – non-speech segmentation filters the input for the speech recognition