190 likes | 209 Views
Explore the automatic music transcription technique for distinguishing instruments in polyphonic music using key features for instrumental timbre recognition. Compare human vs. computer identification accuracy with different approaches.
E N D
A MISSING FEATURE APPROACH TO INSTRUMENT IDENTIFICATION IN POLYPHONIC MUSIC Jana Eggink and Guy J. Brown University of Sheffield
Automatic Music Transcription • input: audio recording• output: score or other symbolic representation • needed (for every note): • pitch• start and duration • instrument• extras: key (C major), meter (4/4), bars, loudness, expression... • useful for: • musicologists• musicians• music information retrieval
Instrument Identification possible clues: • method of excitation (hitting, blowing, plucked or bowed strings) causes:• noise during onset• delayed begin of individual partials during onset• spectral fluctuations during steady state • resonance properties of the instrument body mostly effect the steady state:• energy distribution among high and low partials• formant regions• spectral bandwidth
Example Spectrograms oboe cello
Human Instrument Identification • different clues from onset and steady state are used, individual clues like e.g. static spectrum can be enough to identify some, but not all instruments • onset seems most relevant for instrument family discrimination • better performance on musical phrases than on single tones • experts are better than non-experts
Computer Instrument Identification JC Brown et al. (2001): • GMM classifier • frame based cepstral coefficients • 4 woodwinds (flute, clarinet, oboe, saxophone) • realistic, monophonic phrases • computer:60% correct average80% best parameter choice • humans: 85% KD Martin (1999): • hierarchical classification scheme • different features, both temporal and spectral • 27 different instruments • realistic, monophonic phrases and single notes • computer:48% instrument correct75% instrument family • humans: 57% instrument correct95% instrument family
Polyphonic Kashino & Murase (1999) • time domain approach • example waveforms stored for each note of each instrument • best match found using adaptive filtering techniques • iterative subtraction scheme • 3 instruments: flute, violin, piano • specially made recording • F0s and onset times supplied • 68% correct (max. polyphony 3) Kinoshita et al. (1999) • frequency domain approach • features measuring temporal variation at the onset, and spectral energy distribution • colliding partials are identified and • corresponding feature values are (mostly) ignored • 3 instruments: clarinet, violin, piano • random chord combinations made from 2 isolated tones • 70% correct (78% if correct F0s were supplied)
Our System • missing feature approach – works for speech recognition in the presence of noise • GMMs trained with spectral features perform well for realistic monophonic music and • GMMs have also been used in combination with a missing feature approach for speaker identification in noise use a GMM classifier in combination with a missing feature approach for instrument recognition in realistic, polyphonic music
F0-analysis • iterative approach based on harmonic sieves (Scheffers, 1983) bad fitting sieve best fitting sieve determines F0
Missing Feature Estimation • finding reliable and unreliable features is one of the main problems • instrument tones have an approximately harmonic overtone series • based on the extracted F0s, all frequency regions where a partial from a non-target tone is found are marked as unreliable and excluded from the recognition process
Features • local spectral features are required for missing feature • frame based (exact onset detection is hard in polyphonic music) • energy in narrow frequency bands (60 Hz) • linear spacing, corresponding to linear spacing of partials
Example Features with Mask target tone (violin D) non-target tone(oboe G sharp) mixture target tone + mask non-target tone + mask mixture + mask
GMMs • approximate a distribution by a combination of individual gaussians example of a 2-dimensional distribution modeled by a GMM consisting of 3 individual Gaussians • means and covariances trained by EM-algorithm
GMMs with Missing Features probability density function (pdf) of observed spectral D-dimensional feature vector x is modeled as: assuming feature independence, this can be rewritten as: approximating the pdf from reliable data only leads to: N = number of Gaussians in the mixture model, pi = mixture weight, Fi = univariate Gaussians with mi = mean vector, mij = mean, Si = covariance matrix, s2ij = standard deviation, M’ = subset of reliable features in Mask M
Results Monophonic • GMMs trained for 5 instruments: flute, clarinet, oboe, violin, cello • realistic monophonic phrases (3-4 per instrument) 83% correct • single notes: 66% instrument correct, 85% instrument family correct
Random 2-tone Chords • correct F0 were provided • 49% instrument correct, 72% instrument family
Realistic Duet Recording • duet for flute and clarinet by H. Villa-Lobos• F0s extracted by the system system output: original score: flute clarinet in A fundamental frequency (Hz) F0s according to the score in Hz:415 - 415 - 415 - 622 - 622208 - 185 - 175 - 277 - 294 - 247 - 220 - 208 time (frames)
Conclusions • looks promising for small ensembles• works with realistic stimuli Future Work • include temporal information• idea: one HMM for every instrument tone• missing feature approach comparable to the one used or • spectral subtraction based on templates