1 / 61

On the use of statistical tools for audio processing

On the use of statistical tools for audio processing. Mathieu Lagrange and Juan José Burred Analyse / Synthèse Team, IRCAM Mathieu.lagrange@ircam.fr. École centrale de Nantes Filière ISBA . Outline . Introduction Context and challenges Past and Present Speech Model

theola
Download Presentation

On the use of statistical tools for audio processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the use of statistical tools for audio processing • Mathieu Lagrange and Juan José Burred • Analyse / Synthèse Team, IRCAM • Mathieu.lagrange@ircam.fr École centrale de Nantes Filière ISBA

  2. Outline • Introduction • Context and challenges • Past and Present • Speech • Model • Applications (coding, speaker recognition, speech recognition) • Audio (Music) • Sound models • Applications (classification, similarity) • Limitations • Sound source separation • Paradigms, tasks and applications • Mixing Models • Methods for the under determined case • Clustering of Spectral Audio (CoSA) • Auditory Scene Analysis (ASA) • Clustering

  3. Outline • Introduction • Context and challenges

  4. Technological Context « We are drowning in information and starving for knowledge » R. Roger • Needs: • Measurement • Transmission • Access • Aim of a numerical representation: • Precision • Efficiency • Relevance • Means • Mechanical biology • Psycho-acoustic • Cognition

  5. Challenges « Forty-two! yelled Loonquawl. Is that all you've got to show for seven and a half million years' work?  » D. Adams Music is great to study as it is both: • object : arrangement de sons et de silences au cours du temps • function: more or less codified form of expression of : • Individual feelings (mood) • Collective feelings (party, singing, dance)

  6. Audio Processing: Past and Present

  7. Outline • Introduction • Context and challenges • Past and Present • Speech • Model • Applications (coding, speaker recognition, speech recognition)

  8. Speech signal • The speech signal is produced when the air flow coming from the lungs go through the vocal chords and the vocal tract. • The size and the shape of the vocal tract as well as the vocal chords excitations are changing relatively slowly • The speech signal can therefore be considered as quasi-stationary over short period of about 20 ms. • Type of speech production • Voiced: <a>, <e>, … • Unvoiced: <s>, <ch>, • Plosives: <pe>, <ke>

  9. Source / Filter Model • In the case of an idealized voiced speech signal, the vocal chords are producing a perfectly periodic harmonic signal • The influence of the vocal tract can be considered as a filtering with a given frequency response whose maximas are called formants.

  10. Source / Filter Coding • Algorithm : • Voiced / Unvoiced detection; • Voiced case: the source signal is approximated with a Dirac comb: • a Dirac comb whose successive Diracs are respectively T spaced by T as a spectrum which is a Dirac comb whose successive combs are 1/T spaced. • Parameters : T, gain • Unvoiced: the source signal is approximated by a stochastic signal: • Parameter : gain. • The Source signal is next filtered. • Parameters : filter coefficients.

  11. « Code-Excited Linear Predictive » (CELP) For each frame of 20 ms : • Auto-Regressive coefficients are computed such that the prediction error is minimized over the entire duration of the frame: • Quantified coefficients and an index encoding the error signal are transmitted.

  12. « Code-Excited Linear Predictive » (CELP) index Residual AR Coefficients Signal

  13. Speaker Recognition • Classical pattern recognition problem • Specific problems: • Open Set / Closed Set: rejection problem • Identification / Verification • Text Dependency • Method • Feature extraction: model each speech with Mel-Frequency Cepstral Coefficients (MFCCs) and their derivatives. • Classification • Text independent: Vector Quantization Codebooks or Gaussian Mixture Models (GMMs) • Text dependent: Dynamic Time Warping (DTW) or Hidden Markov Model (HMM) s1 s2 s3

  14. Speech recognition • An Automatic Speech Recognition System is typically decomposed into: • Feature Extraction: MFCCs • Acoustic Models: HMMs trained for set of phones • Each phone is modelled with 3 states • Pronunciation dictionary: convert a series of phones into a word • Language Model: predict the likelihood of specific words occurring one after another with n-grams (Fig. from HTK documentation)

  15. MFCCs rules ? Mel Frequency Cepstral Coefficients are commonly derived as follows: • Take the Fourier transform of (a windowed excerpt of) a signal. • Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. • Take the logs of the powers at each of the mel frequencies. • Take the discrete cosine transform (DCT) of the list of mel log powers, as if it were a signal. • The MFCCs are the amplitudes of the resulting spectrum.

  16. MFCCs Rules ?

  17. Issues with MFCCs computation steps • The MEL frequency wraping: • highly criticized form a perceptual point of view (Greenwood) • conceptually: periodicity analysis over data that are not periodic anymore (Camacho) • The Cepstral Coefficients are COSINE coefficients: • cannot shift with speaker size to capture the shift in formant frequencies that occurs as children grow up and their vocal tracts get longer • Not a sound representation: • no way to provide enhancements such as speaker and channel adaptation, background noise suppression, source separation

  18. Potentials of the DCT step • Observation of Pols that the main components capture most of the variance using a few smooth basis functions, smoothing away the pitch ripples • Principal components of a collection of vowel spectra on a warped frequency scale aren't so far from the cosine basis functions • Decorrelates the features. • This is important because the MFCC are in most cases modelled by Gaussians with diagonal covariance matrices

  19. Outline • Introduction • Context and challenges • Past and Present • Speech • Model • Applications (coding, speaker recognition, speech recognition) • Audio (Music) • Sound models • Applications (classification, similarity) • Limitations

  20. Sound Models • Major classes of sounds • Transients (castanets, …) • Pseudo periodic (flute, …) • Stochastic (waves, …) • Models • Impulsive noise • Sum of sinusoids • Wide band noise

  21. Classification • Method: [Tzanetakis’02] • Agree on mutually exclusive set of tags (the ontology) • Extract features from audio (MFCCs and variations) • Train statistical models: • Due to the high dimensionality of the feature vectors discriminatives approaches are prefered (SVMs) • Segmentation • Smoothing decision using dynamic programming (DP) (Fig. from [Ramona07])

  22. Multi-Class Discriminative Classification • Usually performed by combining binary classifiers • Two approaches: • One-vs-all: For each class build a classifier for that class versus the rest • Often very imbalanced classifiers (use asymmetric regularization) • All-vs-all Build a classifier for each couple of class • A priori a large number of classifiers to build but the pairwise classification are faster and the classifications are balanced (easier to find the best regularization) s1 s2 s3

  23. Multi-Label Discriminative Classification • Each object may be tagged using several labels • Computational approaches • Power Sets • Binary Relevance (equivalent to one-vs-all) • Multiple criteria: • « Flattening » the ontology • Research trend: considering the ontology structure to benefit from co-occurrence labels of different semantic criterion C2 C1 C3

  24. Music Similarity • Question to solve: « Given a seed song, provide us with the entries of the database which are the most similar » • Annotation type: Artist / Album • Method: [Aucouturier’04] • Songs are modeled as GMM of MFCCs • Proximity of GMMs are considered as similiarity measure: • Likelihood (requires access to the MFCCs) • Sampling

  25. Cover Version Detection • Question to solve: « Given a seed song, provide us with the entries of the database which are cover versions » • Annotation: canonical song • Method: [Serra’08] • Songs are modeled as a time series of Chromas • Computation of the similarity matrix between the two time series • Similarity is measured using Dynamic Programming Local Alignment (Fig. from [Serr08])

  26. Limitations • Description of audio and music • Polyphonic • Multiple shapes varying in various ways • Statistical Modeling • Curse of dimensionality • Sense of structure relevant at multiple levels of temporality

  27. Outline • Introduction • Context and challenges • Past and Present • Speech • Model • Applications (coding, speaker recognition, speech recognition) • Audio (Music) • Sound models • Applications (classification, similarity) • Limitations • Sound source separation • Paradigms, tasks and applications • Mixing Models • Methods for the under determined case

  28. Sound Source Separation • “Cocktail party effect” • E. C. Cherry, 1953. • Ability to concentrate attention on a specific sound source from within a mixture. • Even when interfering energy is close to energy of desired source. • “Prince Shotoku Challenge” • Legendary Japanese prince Shotoku (6th Century AD) could listen and understand simultaneously the petitions by ten people. • Concentrate attention on several sources at the same time! • “Prince Shotoku Computer” (Okuno et al., 1997) • Both allegories imply an extra step of semantic understanding of the sources, beyond mere acoustical isolation.

  29. The paradigms of Musical Source Separation • (based on [Scheirer00]) • Understanding without separation • E.g. music genre classification • “Glass ceiling” of traditional methods (MFCC+GMM) [Aucouturier&Pachet04] • Separation for understanding • First (partially) separate, then feature extraction • Source separation as a way to break the glass ceiling? • Separation without understanding • BSS: Blind Source Separation (ICA, ISA, NMF) • Blind means: only very general statistical assumptions taken. • Understanding for separation • Supervised source separation (based on a training database)

  30. Required sound quality • Audio Quality Oriented (AQO) • Aimed at full unmixing at the highest possible quality. • Applications: • Unmixing, remixing, upmixing • Hearing aids • Post-production • Significance Oriented (SO) • Separation quality just enough for facilitating semantic analysis of complex signals. • Less demanding, more realistic. • Applications: • Music Information Retrieval • Polyphonic Transcription • Object-based audio coding

  31. Musical Source Separation Tasks • Classification according to the nature of the mixtures: • Classification according to available a priori information:

  32. Linear mixing model • Only amplitude scaling before mixing (summing) • Linear stereo recording setups: XY Stereo MS Stereo Close miking Direct injection

  33. Delayed mixing model • Amplitude scaling and delay before mixing • Delayed stereo recording setups: Close miking with delay Direct injection with delay AB Stereo Mixed Stereo

  34. Convolutive mixing model • Filtering between sources and sensors • Convolutive stereo recording setups: Close miking with reverb Direct injection with reverb Reverberant environment Binaural

  35. Some terminology • System of linear equations: • Usual algebraic methods from high school: X known, A known, S unknown • But in source separation: unknown variables (S, sources) AND unknown coefficients (A, mixing matrix) • Algebra terminology is retained for source separation: • More equations (mixtures) than unknowns (sources): overdetermined • Same number of equations (mixtures) than unknowns (sources): determined (square A) • Less equations (mixtures) than unknowns (sources): underdetermined • The underdetermined case is the most demanding, but also the most important for music! • Music is (still) mostly in stereo, with usually more than 2 instruments • Overdetermined and determined situtations are only of interest for arrays of sensors or arrays of microphones (localization, tracking)

  36. Binaural Case (1) • Goal: find a mask M that retrieves one source when used to filter a given time-frequency representation. • DUET (Degenerate Unmixing Estimation Technique) [Yilmaz&Rickard04] • Histogram of Interchannel Intensity (IID) and Phase (IPD) Differences • Binary Mask created by selecting bins around histogram peaks. • Drawback of t-f masking: “musical noise” or “burbling” artifacts º is the Hadamard (element-wise) product (Fig. from [Vincent06]) (Fig. from [Yilmaz&Rickard04])

  37. Binaural Case (2) • Human-assisted time-frequency masking[Vinyes06] • Human-assisted selection of the time-frequency bins out of the DUET-like histogram for creating the unmixing mask • Implementation as a VST plugin (“Audio Scanner”)

  38. Monaural Case • Classification according to a priori knowledge • Supervised • Based on training the model with a sound example database • Better quality and more demanding situations at the cost of less generality • Unsupervised • Classification according to model type • Adaptive basis decompositions (ISA, NMF, NSC) • Sinusoidal Modeling • Classification according to mixture type • Monaural systems • Hybrid systems combining advanced source models with spatial diversity

  39. Independent Subspace Analysis • Application of ISA to audio: Casey and Westner, 2000. • Application of ICA to the spectogram of a mono mixture. • Each independent component corresponds to an independent subspace of the spectrogram. (Fig. from [Casey&Westner00]) • Component-to-source clustering • The extracted components usually do not directly correspond to the sources. • They must be clustered together according to some similarity criterion. • Casey&Westner use a matrix of Kullback-Leibler divergences called the ixegram.

  40. ICA for Audio (Figs from Virtanen)

  41. Nonnegative Matrix Factorization • Matrix factorization ( ) imposing non-negativity. • Needed when using magnitude or power spectrograms. • NMF does not aim at statistical independence, but: • It has been proven that, under some conditions, non-negativity is sufficient for separation. • NMF yields components that very closely correspond to the sources. • To date, there is no exact theoretical explanation why is that so! • Use for transcription: • P. Smaragdis and J.C. Brown. Non-Negative Matrix Factorization for Polyphonic Music Transcription. Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2003. • Use for separation: • B. Wang and M. D. Plumbley. Musical Audio Stream Separation by Non-Negative Matrix Factorization. Proc. UK Digital Music Research Network (DMRN) Summer Conf., 2005.

  42. NMF for Audio (Figs from Virtanen)

  43. NMF for Vision • By representing signals as a sum purely additive, non- negative sources, we get a parts-based representation [Lee’99]

  44. Mixture Component 1 Component 2 Mixture Component 1 Component 2 Nonnegative Sparse Coding • Combination of non-negativity and sparsity constraints in the factorization. • [Virtanen03]: NSC is optimized with an additional criterion of temporal continuity. • Measured by the absolute value of the overall amplitude difference between consecutive frames. • [Virtanen04]: Convolutive Sparse Coding • Improved temporal accuracy by modeling the sources as the convolution of spectrograms with a vector of onsets.

  45. Separated sources Mixture Sinusoidal Methods • Sinusoidal Modeling: detection and tracking of the sinusoidal partial peaks on the spectrogram. • Based on Auditory Scene Analysis (ASA) cues of good-continuation, common fate and smoothness of sinusoidal tracks. • Overall, very good reduction of interfering sources, but moderate timbral quality. • Appropriate for Significance-Oriented applications • [Virtanen&Klapuri02]: model of spectral smoothness of harmonic sounds • Based on basis decomposition of harmonic structures • Additive resynthesis of partial parameters • [Every&Szymanski06] • Spectral subtraction instead of additive resynthesis (Fig. from [Every06])

  46. Separation of chords Inharmonic separation Mixture Separated sources Mixture Separated sources Supervised Methods (1) • Use of a training database to create a set of source models, each one modeling a specific instrument. • Better separation as a trade-off for generality. • Supervised sinusoidal methods • [Burred&Sikora07] • The source models are compact descriptions of the spectral envelope and its temporal evolution. • The detailed temporal evolution allows to ignore harmonicity constraints, and thus separation of chords and inharmonic sounds is possible.

  47. Separated sources Mixture Separated sources Mixture Supervised Methods (2) • Bayesian Networks • [Vincent06] • Multilayered model describing note probabilities (state layer), spectral decomposition (source layer) and spatial information (mixture layer). • Trained on a database of isolated notes. • Allows separation of sounds with reverb. • Learnt priors for Wiener-based separation • [Ozerov05] • Single-channel • GMM models of singing voice and accompaniment.

  48. Conclusions • Still far from fully-general, audio-quality-oriented system. • More realistic: significance oriented • Separation good enough to facilitate content analysis • Methods based on adaptive models, time-frequency masking: • More realistic mixtures, but more artifacts and interferences • Methods based on sinusoidal modeling: • More artificial timbre, but less interferences. • Current polyphony limitations: • Mono signals: up to 3, 4 instruments • Stereo signals: up to 5, 6 instruments

  49. Outline • Introduction • Context and challenges • Past and Present • Speech • Model • Applications (coding, speaker recognition, speech recognition) • Audio (Music) • Sound models • Applications (classification, similarity) • Limitations • Sound source separation • Paradigms, tasks and applications • Mixing Models • Methods for the under determined case • Clustering of Spectral Audio (CoSA) • Auditory Scene Analysis (ASA) • Clustering

  50. Binary Masking with Oracle • Binary masking is an effective way of performing the separation • Using an oracle allows to assess the relevance of Fourier spectrograms as an atomic representation • The binary mask is set to 1 if the source of interest is dominant in the considered frequency bin, and 0 otherwise STFT ISTFT |STFT| Mask

More Related