490 likes | 613 Views
Description et Classification automatique des sons instrumentaux . Geoffroy Peeters Ircam (Analysis/Synthesis Team) peeters@ircam.fr . 1. Introduction. trumpet. Musical Instrument Sound Classification numerous studies on sound classification
E N D
Description et Classification automatique des sons instrumentaux Geoffroy PeetersIrcam (Analysis/Synthesis Team) peeters@ircam.fr
1. Introduction trumpet • Musical Instrument Sound Classification • numerous studies on sound classification • few of them address the problem of generalization of sound sources (recognition of the same source possibly recorded in different conditions with various instrument manufacturers and players) • Evaluation of the system performance • training on a subset of the database, evaluation on the rest of the database • does not prove any applicability for the classification of sounds which do not belong to the database • Martin [1999] 76% (family) 39% for 14 instruments • Eronen [2001] 77% (family) 35% for 16 instruments • Goal of this study • study large database classification • How ? New classification system • Extract a large amount of features • New feature selection algorithm • Compare flat and hierarchical gaussian classifier
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
2. Feature extraction • Features for sound recognition: • speech recognition community, previous studies on musical instrument sounds classification, results of psycho-acoustical studies. • each feature set is supposed to perform well for a specific task • Principle: • 1) extract a large set of features • 2) filter the feature set a posteriori by a Feature Selection Algorithm
2. Feature extraction Audio features Taxonomy • Global descriptors • Instantaneous descriptors • Temporal modeling • Mean, • Variance • Modulation (pitch, energy)
2. Feature extraction Audio features Taxonomy • DT: temporal descriptors • DE: energy descriptors • DS: spectral descriptors • DH: harmonic descriptors • DP: perceptual descriptors
DT.zero-crossing rate DT.auto-correlation 2. Feature extraction DT/DE: Temporal/Energy descriptors Envelop Energy sound • DT.log-attack time • DT.temporal increase • DT.temporal decrease • DT.temporal centroid • DT.effective duration • DE.total energy • DE.energy of harmonic part • DE.energy of noise part
2. Feature extraction DS: Spectral descriptors Window FFT sound • DS.centroid, DS.spread, DS.skewness, DS.kurtosis • DS.slope, DS.decrease, DS.roll-off • DS.variation
2. Feature extraction DH: Harmonic descriptors Window FFT sound Sinudoidal model • DH.Centroid, DH.Spread, DH.Skewness, DH.Kurtosis • DH.Slope, DH.Decrease, DH.Roll-off • DH.Variation • DH.Fundamental frequency • DH.Noisiness, DH.OddEvenRatio, DH.Inharmonicity • DH.Tristimulus • DH.Deviation
2. Feature extraction DP: Perceptual descriptors / DV: Various descriptors Window FFT sound Perception • DP.Centroid, DP.Spread, DP.Skewness, DP.Kurtosis • DP.Slope, DP.Decrease, DP.Roll-off • DP.Variation • DP.Loudness, RelativeSpecific Loudness • DP.Sharpness, DP.Spread • DP.Roughness, DP.FluctuationStrength • DV.MFCC, DV.Delta-MFCC, DV.Delta-Delta-MFCC • DV.SpectralFlatness, DV.SpectralCrest Mid-ear filering Bark scale Mel scale
2. Feature extraction Audio features design • No consensus on the use of amplitude and frequency scale • All features are computed using the following scale: • Frequency scale: linear / log / bark-bands • Amplitude scale: linear / power / log • note: log(0.0)=-infty -> normalization 24bits • Features must be independent of the recording level • Normalization in linear, in power scale • Normalization in logarithmic scale • Features must be independent of the sampling rate • Maximum frequency taken into account: 11025/2 Hz • Resampling (for zcr, xcorr)
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
3. Feature selection algorithm (FSA) • Problem: using a high number of features • some features can be irrelevant for the given task • over fitting of the model to the training set (especially with LDA) • classification models are difficult to interpret by human • Goal of feature selection algorithm (FTA) • find the minimal set ofcriterion 1) informative features with respect to the classescriterion 2) features that provide non redundant information • Forms of feature selection algorithm • embedded: the FSA is part of the classifier • filter: the FSA is distinct from the classifier and used before the classifier • wrapper: the FSA makes use of the classification results
Criterion 1 informative features with respect to the classes principle: “feature values for sounds belonging to a specific class should be separated from the values for all the other classes » measure: for a specific feature i ratio of the Between-class inertia B to the Total class inertia T 3. Feature selection algorithm: IRMFSP • Inertia Ratio Maximization using Feature Space Projection • Criterion 2 features that provide non redundant information • apply an orthogonalization process of the feature space after the selection of each new feature (Gram-Schmidt Orthogonalization)
3. Feature selection algorithm: IRMFSP • Example :sustained/non-sustained sound separation • computation of the BT ratio for each feature • feature with the weakest ratio (r=6.9e-6) • Specific loudness m8 mean • feature with the highest ratio (r=0.58) • Energy temporal decrease • first three selected dimensions • 1st dim: temporal decrease • 2nd dim: spectral centroid • 3rd dim: temporal increase
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
4. Feature transformation: LDA • Linear Discriminant Analysis • find linear combination among features in order to maximize discrimination between classes: F -> F’ • Total inertia • Between Class Inertia • Transform initial feature space F by a transformation matrix Uin order to maximize the ratio • Solution: • eigen vectors of • associated to the eigen values (discriminative power)
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
5. Class modeling: flat classifiers • Flat classifiers • Flat gaussian classifier (F-GC) • “Flat”= all classes considered on a same level • Training: model each class k by a multi-dimensional gaussian pdf (mean vector, covariance matrix) • Evaluation: Bayes formula • Flat KNN classifier (F-KNN) • instance-based algorithm • assign to the input sound the majority class among its K Nearest Neighbors in the Feature Space • Euclidean distance => weighting of the axes ? • Apply to the output of the LDA (implicit weighting of the axes)
5. Class modeling: hierarchical classifiers • Hierarchical classifiers (F-GC) • Hierarchical gaussian classifier (H-GC) • Training: a tree of flat gaussian classifier each node has its own FSA, FTA and F-GC • Tree construction is supervised (>< decision tree) • Only the subset of sounds belonging to the classes of the current node are used • Evaluation: local probability decides which branch of the tree to follow • Advantages of H-GC • Learning facilities: it is easier to learn differences in a small subset of classes • Reduced class confusion: benefit from the higher recognition rate at the higher levels of the tree • Hierarchical KNN classifier (H-KNN)
5. Class modeling: hierarchical classifiers • Hierarchical classifiers (F-GC) • Hierarchical gaussian classifier (H-GC) • Training: a tree of flat gaussian classifier each node has its own FSA, FTA and F-GC • Tree construction is supervised (>< decision tree) • Only the subset of sounds belonging to the classes of the current node are used • Evaluation: local probability decides which branch of the tree to follow • Advantages of H-GC • Learning facilities: it is easier to learn differences in a small subset of classes • Reduced class confusion: benefit from the higher recognition rate at the higher levels of the tree • Hierarchical KNN classifier (H-KNN) • Decision Trees: • Binary Entropy Reduction Tree (BERT) • C4.5. • Partial Decision Tree (PART)
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
6. EvaluationTaxonomy used • Three different levels • T1: sustained/non-sustained sounds • T2: instrument families • T3: instrument names
6. EvaluationTest set • 6 databases • Ircam Studio OnLine (1323 sounds, 16 instruments), • Iowa University database (816 sounds, 12 instruments), • McGill University database (585 sounds, 23 instruments), • Microsoft “Musical Instruments” CD-ROM (216 sounds, 20 instruments), • two commercial databases Pro (532 sounds, 20 instruments) Vi databases (691 sounds, 18 instruments), • total = 4163 sounds. • notes: • 27 instrument have been considered • a large pitch range has been considered (4 octaves on average) • no muted, martele/staccato sounds
6. EvaluationEvaluation process • 1) Random 66%/33% partition of database (50 sets) • 2) One to One (O2O) [Livshin2003]: each database is used in turns to classify all other databases • 3) Leave One Database Out (LODO) [Livshin 2003]: all database except one are used in turnsto classify the remaining one
6. EvaluationResults O2O (II) • O2O (mean value over the 30 (6*5) experiments) • Discussion • low recognition rate for O2O compared to 66%/33% -> problem of generalization ? • system mainly learns the instrument instance instead of the instrument (each database contains a single instance of an instrument) • LODO (mean value over the 6 Left Out databases) • Goal: to increase the number of instances of each instrument • How: by combining several databases
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
5. EvaluationConfusion matrix • Low confusion between sustained / non-sustained sounds
5. EvaluationConfusion matrix • Largest confusions inside each instrument family
5. EvaluationConfusion matrix • Lowest recognition rates -> smallest training sets
5. EvaluationConfusion matrix • Confusion piano / guitar-harp
5. EvaluationConfusion matrix • Cross-family confusions
5. EvaluationConfusion matrix • Cross-family confusions • Cornet -> Bassoon • Cornet -> English-horn • Flute -> Clarinet • Oboe -> Flute • Trombone -> Flute
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
5. EvaluationMain selected features • Par FSA (IRMFSP)
5. EvaluationMain selected features • Par arbre de décision (C4.5)
5. EvaluationMain selected features • Par arbre de décision, décision regroupée (PART)
Feature extraction Feature selection Feature Transform Classification Evaluation Confusion matrix Which features Classes organization
7. Instrument Class Similarity ? • Goal: • check that the proposed tree structure corresponds to natural class organization • How ? • Most people use Martin hierarchy • 1) check the grouping among the decision trees leaves • 2) MDS ? • MDS on acoustic features ? [Herrera AES114th] • Compute the dissimilarity between each class • How ?Compute the between-group F-matrix between class models • Observe the dissimilarity between the classes • How ? MDS (Multi-dimensional scaling) analysis • MDS preserve as much as possible distances between the dataand allows representing them into a lower dimensional space • usually MDS is used for representing dissimilarity judgements (Timbre similarity), used here on acoustic features • MDS (Kruskal’s STRESS formula 1 scaling method) • 3 dimensional space
7. Instrument Class Similarity • Clusters ? • non-sustained sounds
7. Instrument Class Similarity • Clusters ? • non-sustained sounds • Bowed-strings sounds
7. Instrument Class Similarity • Clusters ? • non-sustained sounds • Bowed-strings sounds • Brass sounds (TRPU ?)
7. Instrument Class Similarity • Clusters ? • non-sustained sounds • Bowed-strings sounds • Brass sounds (TRPU ?) • mix between single/double reeds and brass instruments
7. Instrument Class Similarity • Dimension 1: • separate sustained sounds / non sustained sounds • negative values: PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLP • -> attack-time, decrease time
7. Instrument Class Similarity • Dimension 1: • separate sustained sounds / non sustained sounds • negative values: PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLP • -> attack-time, decrease time • Dimension 2: • brightness • dark sounds:TUBB, BSN, TBTB, FHOR • bright sounds: PICC, CLA, FLUT • problem DBL ?
7. Instrument Class Similarity • Dimension 1: • separate sustained sounds / non sustained sounds • negative values: PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLP • -> attack-time, decrease time • Dimension 2: • brightness • dark sounds TUBB, BSN, TBTB, FHOR • bright sounds: PICC, CLA, FLUT • problem DBL ? • Dimension 3: • ? • Separation of bowed stings (VLN, VLA, CELL, DBL) • amount of modulation ?
Conclusion • State of the art • Martin [1999] 76% (family) 39% for 14 instruments • Eronen [2001] 77% (family) 35% for 16 instruments • This study • 85% (family) 64% for 23 instruments • increased recognition rates mainly explained by the use of new features • Perspectives • derive automatically the tree structure (analysis of decision tree ?) • test other classification algorithm (GMM, SVM, …) • test the system for other sound classes (non-instrumental sounds, sound FX) • extend the system to musical phrases • extend the system to polyphonic sounds • extend the system to multi-sources sounds • Links: http://www.cuidado.mu http://www.cs.waikato.ac.nz/ml/weka/