170 likes | 308 Views
Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort. Objective. To give an overview of this particular technique for automatic transcription Original implementation: ICMC 1993. Introduction. Sound Source Separation System
E N D
Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort
Objective • To give an overview of this particular technique for automatic transcription • Original implementation: • ICMC 1993
Introduction • Sound Source Separation System • Extracting sound source in the presence of multiple sources • Physical vs. Perceptual sound source • Physical: actual source itself • Perceptual: Humans hear as single source • Ex: Piano, Loudspeaker
Perceptual Sound Source Separation • Creating system which simulates human perceptual system • Extraction of parameters based on perceptual model, grouping of parameters based on certain criteria
This PSSS System • Kashino et al. • U. of Tokyo • OPTIMA: Organized Processing Towards Intelligent Music Scene Analysis • First to use human auditory seperation rules
This PSSS System • Suppose: Input = mono audio signal, output = multiple midi channels (and graphic display) • Given signal S(t), comprised of mix of M sound sources • Assume S(t) = {F1(t),…,FL(t)} • Where Fj(t) = {pj(t),fj(t),psij(t)} • Pj = power of spectral peak • Fj = freq of spectral peak • Psij = bandwidth of spectral peak • Wish to: • Extract Fj(t) from S(t) • Cluster Fj(t) into groups which (ultimately) represent different sound sources
System Overview • Extraction of Frequency Components • Analysis first taken • All signals are 16 bit/ 48 khz • Bank of 2nd order IIR bandpass filters (log freq scale) implemented • Peak Selection/Tracking: • “pinching Plane” method • Regression planes, calculated via least squares • In other words, minimization of sum of squares in z direction (power), leaving x and y (time and freq) fixed • Normal vector for each plane calculated. Angle between gives psij(t), direction vector gives fj(t), pj(t) • First regression plane analysis sets threshold by which other potential peaks are measured
Bottom Up Clustering of Freq Components • Grouping freq components based on perceptual criteria • Goal is to group sounds humans hear as one • calculations made for harmonic mistuning and onset asynchrony between pairwise freq components, then evaluated for probability of auditory separation • probability functions based on approximations of psychoacoustic experiments • given prob functions p1 and p2, the integrated prob of auditory separationis given by m = 1-(1-p1)(1-p2) • this is from Dempster's law of prob. • m is used as distance measure in clustering
Clustering for Source Identification • identify sound sources by global characteristics of clusters • goal is to group sounds based on same source (thus uses direct signal attributes apart from any psychoacoustic metric of determination) • if a cluster contains a single note we’re good
Clustering for Source Identification • uses distance function to determine source • D = c1fp+c2fq+c3ta+c4ts • Where: • fp = peak power ratio of second harmonic to fundamental component • Fq = peak power ratio of third harmonic to fundamental component • Attack time • Sustain time
tone model based processing • unit of input is a "processing scope” • proc scope consists of one cluster, or several if they share a freq component • a tone model is a 2D matrix with each row being a freq component over time (column rep. time). each element is a 2D vector of normalized power and freq. • "mixture hypotheses" generated for each tone model, and matched with a processing scope to find the closest fit • distance function minimizes power difference at given time/freq location • effective in recognizing chords • -but, is model based
Automatic tone modeling • -automatic acquisition of tone models from analysed signal • -based on "old-plus-new heuristic" [bregman 90] • a complex sound is interpreted as everything old which remained is perceived as new sound
A Few Probs and Limitations Octave = no good • Psychoacoustic Models • Not tested over large enough group • Detuning • May not leave enough space for variance in real instruments (2.6% in prob function) Lots of free parameters • Seemingly a lot of tuning involved
Conclusion • Works Well for 3 note polyphony • Anssi Klapuri claim: 18 note range, works for flute, piano, trumpet • Groundbreaking in that it used Perceptual system model • Based on auditory scene analysis • Lots of free parameters • Seemingly a lot of tuning involved