WP4

WP4 – Sound Object Representation Enabling Access to Sound Archives through Integration, Enrichment and Retrieval

Introduction to Workpackage-Overview • Objectives: • How to represent audio for the purposes of efficient querying. • Segmentation of audio streams. • Distinct objects may then be recognized using musical instrument identification and speaker identification techniques . • Identification of higher level features • Speech related- Gender, Emotion, Laughter and Language • Music related- tempo, beat detection, rhythm… • Tasks: • T 4.1 Audio stream segmentation- Speech/music separation… • T 4.2 Source separation- Instrument Identification, Speaker Identification • T 4.3 Sound object identification • T4.5: Transcription • Music transcription • High level speech phonetics & characteristics

Deliverables and Milestones • Deliverables • D4.1 Prototype segmentation, separation and speaker/instrument identification system (Month 14) • D4.2 Prototype transcription system (Month 27) • D4.3 Final report on sound object representations (Month 30) • Milestones and expected result • M4.1- Month 6: Speech/music separation methods implemented and tested • M4.2 - Month 10: Initial results on identification of sound objects, prototype segmenter and separator • M4.3 – Month 18: Identification of speech characteristics from segmented, separated audio streams • M4.4 – Month 24: Transcription of monophonic music from segmented, separated audio streams • M4.5 – Month 28: Testing and evaluation of complete system

Workpackage Progress – Speech Related • Prototype for speaker segmentation is ready. • Preliminary prototype for SID is ready. • Pre-processing module implemented for ED and SID: Energy based Voice Activity Detector. • ED, Laughter DLL is ready (NICE’s API). • LID algorithm evaluated on English UK corpus. We got (achieved ?) over 85% accuracy (explain more this point ?). • Trained on a testbed representing atleast 10 (European) languages • On going research on speaker identification (outlier detection and exclusion, how to deal with multi-speaker?).

Contributions and Connections with Other Workpackages • This WP provides many inputs to other WPs and relies on few outputs from other WPs • WP2 • The sound objects extracted in WP4 populate the ontology devised in WP2 • WP3 • Sound object recognition used to enable enhanced retrieval • Retrieval of speakers • Retrieval of key speech and music features • WP5 • Sound objects used both in archiving and as access tools • Source separation • Audio enhancement

Upcoming Work Plan Months 12-24 – Speech Related • Speaker Identification • Retrieval of speakers (for use in WP3) • Research on Outlier detection and exclusion • Research on new scoring methods • How to Deal with Multiple Targets in Speaker Identification? • ED, Laughter and Gender • VAMP API • On going research on robust methods. • LID • Build robust model for English UK and implementation.

DemonstrationSpeaker Identification

DemonstrationSpeaker segmentation

Music Transcription • Reasonable accuracy detection in: • Onset detection • Tempo detection • Key detection • Monophonic pitch detection • Unsolved or unexplored research areas: • Ornamentation detection • Time signature detection • Segmentation: • Bar line detection • Music Structure Detection

ROLL CUT STRIKE Music Transcription: Ornamentation detection Gainza, M. and E. Coyle. Automating Ornamentation Transcription. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07)

Music Transcription: Time Signature Detection • Music is highly repetitive: chorus, phrases, bars… • The method utilises a multi-resolution audio similarity matrix to detect repetitive musical bars by building templates of time signature candidates • The method only depends on musical structure, and does not depend on the presence of percussive instruments or strong musical accents

Music Transcription: Time Signature Detection Gainza, M. and E. Coyle. Time Signature Detection by Using a Multi-Resolution Audio Similarity Matrix. In Audio Engineering Society 122nd Convention. 2007. Vienna.

Bar line prediction Bar length Bar line aligment Song [p1, p2... pn] ASM Anacrucis [b1, b2... bn] Onset detector Music Transcription: Bar line Segmentation • Detects the musical bar length and the anacrusis using Audio Sim. Matrix • Predicts and aligns the position of future bars by using an Onset Detector Gainza, Mikel; Barry, Dan ; Coyle, Eugene Automatic Bar Line Segmentation. In Audio Engineering Society 123nd Convention, New York, 2007

Anacrucis Bar length Music Transcription: Bar line Segmentation

Azimugram S A,T N basis func B1,T Segments Song ADDRESS PCA ICA Orthogonality enforcement Music Transcription: Music Structure Segmentation • There are many mid-level representations: spectrogram, chromagram, MFCC… • Novel mid-level representation: Azimugram time-azimuth representation of a stereo field • System based on the assumption that each section type (e.g: chorus) have a unique source location-intensity profile.

Intro Verse Chorus Music Transcription: Music Structure Segmentation Audio Signal Azimugram Segmentation Barry, Dan; Gainza, Mikel; Coyle, Eugene. Music Structure Segmentation using the Azimugram in conjunction with Principal Component Analysis. In AES 123nd Convention, New York, 2007

Upcoming Work Plan Months 12-24 • Assess the robustness of the ornamentation detector for a variety of instruments • Dynamically adapt time signature and bar line detections to tempo variations • Assess the best mid-level representation for music segmentation • Combine the music structure and bar line segmentation systems. Thus, a segment is aligned to the bar lines • Incorporate knowledge of music structure (e.g.: 8 bars per section…) • Migrate all MATLAB applications to C++

ALL - Workpackage progress Silence to silence segmentation – ALL • Start – stop segmentation • Threshold algorithm – ALL use this, it is sufficient for speech wave energy under the threshold value is silence • Multi-threshold there are different threshold values for different situations • Trained HMMmanually segmented sample for the training Usage • Preparation phase for the manual segmentation of the training corpus

ALL - Workpackage progress Speech – non speech segmentation – ALL • Trained HMM with gaussian mixture distribution • Trained for: • Speech • Music • Singing • Whistle • …. • Using 26 dimensions MFCC feature vectors Usage • speech – non-speech segmentation filters the input for the speech recognition

WP4 – Sound Object Representation

WP4 – Sound Object Representation

Presentation Transcript

Three-Dimensional Object Representation

Digital Sound Representation & Introduction to Software Synthesis

Sparse representation for coarse and fine object recognition

WP4 Introduction

CASIMIR WP4 Data Representation John Hancock MRC Harwell

Complex Networks for Representation and Characterization of Object

WP4 Report

3D concepts and object representation

Sound localization in the owl: representation and control

WP4: Instrumentation

ENT - WP4

An Object-oriented Representation for Efficient Reinforcement Learning

FUTURE WP4

INRIA –WP4

WP4 Instantiation WP4 Status 25 September, 2013

WP4 report

BIODEEP-WP4

Calice WP4

WP4 - Monitoring

Object-based Image Representation

Report WP4

WP4 – Sound Object Representation

WP4 – Sound Object Representation

Presentation Transcript

Three-Dimensional Object Representation

Digital Sound Representation &amp; Introduction to Software Synthesis

Sparse representation for coarse and fine object recognition

WP4 Introduction

CASIMIR WP4 Data Representation John Hancock MRC Harwell

Complex Networks for Representation and Characterization of Object

WP4 Report

3D concepts and object representation

Sound localization in the owl: representation and control

WP4: Instrumentation

ENT - WP4

WP4

An Object-oriented Representation for Efficient Reinforcement Learning

FUTURE WP4

INRIA –WP4

WP4 Instantiation WP4 Status 25 September, 2013

WP4 report

BIODEEP-WP4

Calice WP4

WP4 - Monitoring

Object-based Image Representation

Report WP4

Digital Sound Representation & Introduction to Software Synthesis