330 likes | 466 Views
Instrument Identification in Polyphonic Audio. Sarah Smith: Presentation for ECE 492 . Defining the task. GOAL: given a (single channel) recording of polyphonic music, identify the instruments that are playing Full source separation not necessary? How do we define ‘instrument’
E N D
Instrument Identification in Polyphonic Audio Sarah Smith: Presentation for ECE 492
Defining the task • GOAL: given a (single channel) recording of polyphonic music, identify the instruments that are playing • Full source separation not necessary? • How do we define ‘instrument’ • Grand piano vs. upright piano? • Electric vs. acoustic bass?
Why is this useful? • Annotation in Musical databases • Enables searching by instrument • Possible to group similar types of ensembles • Can be combined with source separation to extract individual instruments • For editing or remixing • Removing solo over accompaniment
What are the challenges • More parts => more possible combinations • 10 possible instruments => 210 quartet possibilities • Harder to differentiate similar (or same) instruments • String quartet • Often faces many of the same challenges as source separation
Possible approaches • Start by performing source separation then identify an instrument to match each source • Relies heavily on accurate streaming in the source separation stage • Can identify instruments separately • Reinterpret the task as “ensemble identification” to identify the group • This greatly expands the number of possibilities • Does not rely on source separation techniques
Instrument Recognition in Polyphonic Music Based on Automatic Taxonomies Slim Essid, Gaël Richard, and Bertrand David
Proposed System • Develop a taxonomy of musical instrument classes • Using the identified features, cluster instrument classes based on their separation in the feature space • Develop a system of binary classifiers to identify the instrumentation of a test piece • Feature selection used at each node to determine the optimal features for classification
Audio Features • In order to find the optimal feature set, calculate everything and then choose the best ones • Combination of ~ 100 spectral, temporal, and statistical features. • Could then use PCA or similar to reduce the dimensionality of feature space • This results in basis vectors that are a combination of many calculated features.
Creating a taxonomy • Perform Principal component analysis on the extracted features to identify key components • Calculate a distance between each pair of instrument classes • Similar classes are clustered together • Iterate through multiple levels of the taxonomy
Defining a distance metric • Using principal components features, a distance can be calculated between each pair of instrument classes • Divergence distance • Bhattacharryya distance
Resulting taxonomy Bs = Double Bass Dr = Drums Eg = Electro-acoustic guitar Gt = Spanish guitar Pn = Piano Pr = Percussion Tr = Trumpet Ts = Tenor sax Vf, Vm = female, male voice V = voice W = Wind Instrument M = Melody (W, Vm or Eg)
Learned Classifiers • Given the taxonomy, and unknown ensemble can be identified by using a series of classifiers at each node • Optimizing each classifier individually is a lot of work • Want an optimization method that can be easily generalized
Feature Selection • Starting with the full database of features (D) we want to choose a subset (d) • Choose an optimization criteria and search the database to find the best features. • Desirable characteristics of chosen feature • Varies between instruments • Doesn’t vary for the same instrument class
Feature Optimization • variables: • M = number of classes (=2 for pairwise feature selection) • Nm = Number of training data instances of class m • N = Total number of training data feature vectors • m(i) = mean of feature i over the training dataset • mm(i) = mean of feature i for class m • xnm(i)= value of the nmthfeature vector within class i Variation between classes Variation within class nm
Testing the classifier • Data used included jazz ensembles ranging from solo to quartet • At each level of the taxonomy, use a “one vs.. one” binary classifier to choose which cluster the sound belongs to • For nodes with more than two possible classes, a majority vote is used to decide
Correct common confusions • Use knowledge about which classes are commonly confused to go back and reevaluate classification • Example: solo drums are often classified as larger ensembles that include drums • If a sample is classified as an ensemble with drums, check to see if the second best fit is solo drums
Overall Performance • Proposed system has 53% accuracy when compared with baseline model at 47% • Baseline (no hierarchy) model performs better for a certain subset of classes • Classification for the top level of taxonomy was 65%
Summary • Treat each possible ensemble as an instrument for purposes of classification • Taxonomy structure gives classification into broad groupings of ensembles • Binary classifier is choosing between a smaller number of options at each node • However, a missed top level classification cannot be corrected later
Musical Instrument Recognition in Polyphonic Audio using source-filter model for sound separation Toni Heittola, AnssiKlapuri, Tuomas Virtanen
Proposed Method • First perform source separation on the input audio (NMF + Source Filter Model) • Run instrument detection on each of the separated streams (GMM)
Source Filter Model • Sound coming from an instrument can be viewed as the product of a harmonic excitation (source) filtered according to the acoustical properties of the instrument • Example: Speech • Source: glottal pulse • Filter : vocal tract
Source Filter Model Applied • Excitation spectrum includes fundamental frequency plus integer overtones, with equal weights • Filter modeled with filter bank coefficients on mel scale
Creating an NMF Model • A naïve NMF approach to both separating the audio streams and labeling instruments would require a large number of parameters # of basis vectors= (# of instruments)(# of pitches) - Each basis vector has the same dimension as the FFT • If we limit ourselves to cases described by the source filter model, we now only need # Basis Vectors= (# of instruments) + (# of pitches) - Each instrument vector characterized by ~30 filter coefficients. - Excitation determined by a single pitch value.
Signal Model Individual filter frequency responses Filter bank coefficients Hi(k) Mixture weights Excitation Spectrum (depends on note) Filter model (depends on instrument)
EstimatingtheCoefficients • Filter bank responses aj(k) and excitation spectrum en,t(k) are known. • Multipitch estimation to find excitation spectrum • Can use a multiplicative update to find values for ci,j and gn,i,j within the mixture.
Optional Streaming Model • The basic model is very general • Any instrument can play any or all of the notes in a single frame. • Each note can be played by more than one instrument • If we know that each instrument only plays one note, then we can form streams and identify the instrument for each stream
Identifying the individual instruments • Using the source separation data, reconstruct sound for each of identified instrument sources • After extracting the MFCC data from the separated tracks, a Gaussian Mixture Model can be used to identify the instrument associated with each track
Evaluation • Test signals were synthesized to be a random combination of instruments, each playing a random series of notes • Range from one to six part polyphony • Number of instruments is known
Results • Accuracy evaluated using F measure • Separation quality measured using SNR Classification Accuracy SNR (dB)
Observations • Source filter model improves instrument separation • Provides additional information that can be used in streaming • Randomly selected notes and instruments will tend to be easier to separate than music • Only 62% accuracy for the monophonic case is rather low
Comparing the approaches • Decompose the task into source separation + instrument ID • Breaks the problem down into previously addressed problems • Identifying the whole ensemble at once • Doesn’t require information about the score or number of parts • Difficult to extend to purely arbitrary ensembles
Conclusions • Much work remains to be done in this area. • Existing approaches can only achieve about 50 – 70% accuracy in this task, even with a large training data set • Similarly trained humans would likely perform much better. • Both approaches would struggle for groups of similar instruments (i.e. string quartet)