1 / 1

Julien PINQUIER, Christine S É NAC and Régine ANDR É -OBRECHT

Acknowledgment. Example of speech and music classification. Experiment. General schema of the classification system. Abstract. Training of the four GMMs. Differentiated modeling. Introduction. Training. Class model. Non-Class model. Audio file indexed. Signal. CLASSIFICATION.

hesper
Download Presentation

Julien PINQUIER, Christine S É NAC and Régine ANDR É -OBRECHT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acknowledgment Example of speech and music classification Experiment General schema of the classification system Abstract Training of the four GMMs Differentiated modeling Introduction Training Class model Non-Class model Audio file indexed Signal CLASSIFICATION Acoustic preprocessing SMOOTHING (max log-likelihood) 32 Gaussian laws Speech model Manual indexing (speech) Speech parameters VQ EM Assignment Cepstral Coeff. Non-Speech model Non-Speech parameters VQ EM 18 Signal Preprocessing Music model Music parameters 29 VQ EM Spectral Coeff. Assignment Non-Music model Non-Music parameters VQ EM Manual indexing (music) 16 Gaussian laws Speech and Music Classification in Audio Documents Julien PINQUIER, Christine SÉNAC and Régine ANDRÉ-OBRECHT Institut de Recherche en Informatique de Toulouse UMR 5505 CNRS – INP – UPS, FRANCE 118, Route de Narbonne, 31062 Toulouse Cedex 04 {pinquier, senac, obrecht}@irit.fr In order to index efficiently the soundtrack of multimedia documents, it is necessary to extract elementary and homogeneous acoustic segments. In this paper, we explore such a prior partitioning which consists to detect the two basic components: the speech and the music components. The originality of this work is that music and speech are not considered as two classes. Two classification systems are independently defined: a speech/non-speech one and a music/non-music one. This approach permits to better characterize and discriminate each component: in particular, two different feature spaces are necessary as two pairs of Gaussian mixture models. Then, the acoustic signal is classified into four types of segments: speech, music, speech-music and other. The experiments are performed on the soundtracks of audio video documents (films, TV sport broadcasts). The performance proves the interest of this approach, so called the Differentiated Modeling Approach. • Audio indexing: finding appropriate audio descriptors in signal to better characterize each document. • Detect two basic components (Speech and Music). • The indexing system is based on a differentiated modeling. • The modeling is based on a Gaussian Mixture Model (GMM). • Goal: detection of the two basic class (speech and music) with an equal performance. • How: extraction of elementary and homogeneous acoustic components for each class. • Speech: formantic structure • Music: harmonic structure • 1 class = {Representation space, Class model, Non-Class model} • Corpus (114 mn) • Championship of figure skating, sport report and TV movie • Contents: speech, music and ‘mixed’ zones • Speech: phone call, outdoor recordings, crowd… • Music: stringed and wind instrument, bass, drum… • Main speakers: 9 (3 women and 6 men) • Training corpus: 74 mn • Test corpus: 40 mn • Models: Speech, Non-Speech, Music and Non-Music Results Speech Music Delay < 20 cs (1) 96 % 91 % Accuracy (2) 99 % 93 % Speech class = {Cepstral space, Speech model, Non-Speech model} Music class = {Spectral space, Music model, Non-Music model} • Delay measure between manual and automatic labels. • Accuracy = (length test corpus – length insertions – length deletions) / (length test corpus) • (a) Manual indexing • (b) Speech / Non-Speech indexing • (c) Music / Non-Music indexing • (d) Fusion of speech and music classifications • with P (speech), M (music), NP (non-speech), NM (non-music) and – (other) ICASSP’2002 Orlando (Florida) We would like to thank the CNRS (French national center for scientific research) for its support to this work under the RAIVES project.

More Related