1 / 16

Automatic Transcription System of Kashino et al.

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort. Objective. To give an overview of this particular technique for automatic transcription Original implementation: ICMC 1993. Introduction. Sound Source Separation System

jalene
Download Presentation

Automatic Transcription System of Kashino et al.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort

  2. Objective • To give an overview of this particular technique for automatic transcription • Original implementation: • ICMC 1993

  3. Introduction • Sound Source Separation System • Extracting sound source in the presence of multiple sources • Physical vs. Perceptual sound source • Physical: actual source itself • Perceptual: Humans hear as single source • Ex: Piano, Loudspeaker

  4. Perceptual Sound Source Separation • Creating system which simulates human perceptual system • Extraction of parameters based on perceptual model, grouping of parameters based on certain criteria

  5. This PSSS System • Kashino et al. • U. of Tokyo • OPTIMA: Organized Processing Towards Intelligent Music Scene Analysis • First to use human auditory seperation rules

  6. This PSSS System • Suppose: Input = mono audio signal, output = multiple midi channels (and graphic display) • Given signal S(t), comprised of mix of M sound sources • Assume S(t) = {F1(t),…,FL(t)} • Where Fj(t) = {pj(t),fj(t),psij(t)} • Pj = power of spectral peak • Fj = freq of spectral peak • Psij = bandwidth of spectral peak • Wish to: • Extract Fj(t) from S(t) • Cluster Fj(t) into groups which (ultimately) represent different sound sources

  7. System Overview • Extraction of Frequency Components • Analysis first taken • All signals are 16 bit/ 48 khz • Bank of 2nd order IIR bandpass filters (log freq scale) implemented • Peak Selection/Tracking: • “pinching Plane” method • Regression planes, calculated via least squares • In other words, minimization of sum of squares in z direction (power), leaving x and y (time and freq) fixed • Normal vector for each plane calculated. Angle between gives psij(t), direction vector gives fj(t), pj(t) • First regression plane analysis sets threshold by which other potential peaks are measured

  8. Pinching Planes

  9. Bottom Up Clustering of Freq Components • Grouping freq components based on perceptual criteria • Goal is to group sounds humans hear as one • calculations made for harmonic mistuning and onset asynchrony between pairwise freq components, then evaluated for probability of auditory separation • probability functions based on approximations of psychoacoustic experiments • given prob functions p1 and p2, the integrated prob of auditory separationis given by m = 1-(1-p1)(1-p2) • this is from Dempster's law of prob. • m is used as distance measure in clustering

  10. Clustering for Source Identification • identify sound sources by global characteristics of clusters • goal is to group sounds based on same source (thus uses direct signal attributes apart from any psychoacoustic metric of determination) • if a cluster contains a single note we’re good

  11. Clustering for Source Identification • uses distance function to determine source • D = c1fp+c2fq+c3ta+c4ts • Where: • fp = peak power ratio of second harmonic to fundamental component • Fq = peak power ratio of third harmonic to fundamental component • Attack time • Sustain time

  12. tone model based processing • unit of input is a "processing scope” • proc scope consists of one cluster, or several if they share a freq component • a tone model is a 2D matrix with each row being a freq component over time (column rep. time). each element is a 2D vector of normalized power and freq. • "mixture hypotheses" generated for each tone model, and matched with a processing scope to find the closest fit • distance function minimizes power difference at given time/freq location • effective in recognizing chords • -but, is model based

  13. Automatic tone modeling • -automatic acquisition of tone models from analysed signal • -based on "old-plus-new heuristic" [bregman 90] • a complex sound is interpreted as everything old which remained is perceived as new sound

  14. Hierarchy of Perceptual Sound Events

  15. A Few Probs and Limitations Octave = no good • Psychoacoustic Models • Not tested over large enough group • Detuning • May not leave enough space for variance in real instruments (2.6% in prob function) Lots of free parameters • Seemingly a lot of tuning involved

  16. Conclusion • Works Well for 3 note polyphony • Anssi Klapuri claim: 18 note range, works for flute, piano, trumpet • Groundbreaking in that it used Perceptual system model • Based on auditory scene analysis • Lots of free parameters • Seemingly a lot of tuning involved

More Related