10 likes | 101 Views
AUDIO TONALITY MODE CLASSIFICATION WITHOUT TONIC ANNOTATIONS Zhiyao Duan 1,2 , Lie Lu 1 , and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), China. 2. Department of Automation, Tsinghua University, China. Summary
E N D
AUDIO TONALITY MODE CLASSIFICATION WITHOUT TONIC ANNOTATIONS Zhiyao Duan1,2, Lie Lu1, and Changshui Zhang2 1. Microsoft Research Asia (MSRA), China. 2. Department of Automation, Tsinghua University, China. • Summary • Tonality mode classification for popular songs, only mode is labeled in training data. • Traditional key finding algorithms often rely on tonic annotations of the training songs. • Keys of popular songs are hard to obtain • Easier to label mode than key for a song. • Mode is more important than tonic. • An alignment approach to transpose chroma features to a reference (but unknown) tonic. • Three methods for mode learning: • Single Profile Correlation (SPC) • Multiple Profile Correlation (MPC) • Support Vector Machine (SVM) • Key: C-major, a-minor, Eb-major, etc. • Mode: major/minor • Tonic: C, C#, D, etc. After N times updates, is used to initialize again, and Step 2 is performed once more. The calculated average vector is stable when the sequence of the training chroma vectors being randomly changed. training set is small, i.e., the training samples of major and minor mode are close to each other in the feature space. Therefore, it is hard for SVM to find a good classification surface between two modes. The decisive (shifted) chroma vector among is the furthest one from the classification surface. This makes the distribution of the decisive test vectors different from that of the training vectors in Method (a) and (b). For Method (c), this alignment together with the inner-class alignment, can be seen analog to minimize the intra-class distance while to maximize the inter-class distance. • Learning and Classification • Single Profile Correlation (SPC): • In training, Each mode is represented by one chroma profile, using a 12-d or 7-d feature. • Each element of the 7-d profile corresponds to the diatonic note of the 12-d profile. • In testing, circularly shift the chroma vector of a excerpt 12 times . • Correlate against the major/ minor profiles The highest correlated one indicates the mode. • Majority voting of excerpts for song mode. • Multiple Profile Correlation (MPC): • In training, K profiles (12-d or 7-d) to represent a mode, using a K-kernel Gaussian Mixture Model. • In testing, circularly shift the chroma vector of a excerpt 12 times to generate 12 vectors. • Correlate the shifted vectors with the major/ minor profiles (Eq. (6)). The maximum or the weighted summation of the correlations defines the confidence score. The highest confidence score indicates the mode. • Majority voting of excerpts for song mode. • Support Vector Machine (SVM): • In training, train a SVM using training chroma vectors. • In testing, circularly shift the chroma vector of a excerpt 12 times to generate 12 vectors. • Classify each shifted vector, and the label of the one with the highest classification confidence is assigned to the excerpt. • Majority voting of excerpts for song mode. • The issue of inter-class alignment: • Need to consider the alignment between the vectors of the two modes in the training phase. • Three alignment methods: • Make the major profile and the minor profile have the same tonic (see Fig. 2(a)). (bad) • Make major and minor “relative” (see Fig. 2(b)), such as for C-major and a-minor. (bad) • Make the profiles of major and minor correlate least or apart furthest, as in Fig. 2(c). (good) • Explanation: • For Method (a) and (b), the distance between the major profile and the minor profile in the • Experiments • Materials: • 4,528 (2,786 major and 1,742 minor) songs. • Various genres including rock, electronica, folk, country, jazz, etc.; • Songs having ambiguous modes or major-minor modulations were discarded. • Training set: 25%, test set: 75%. • Results: • In SPC, profiles by aligned features > those without alignment, or Krumhansl’s profiles. • MPC > SPC • 15s- and 30s- excerpts > the whole song • 7 diatonic elements > 12 elements • Least-correlation (or maximum apart) criteria works best. • 15s- and 30s- excerpts > the song-level. • SVM > Profile correlation methods. • The best result using SVM is up to 78.2%. Algorithm Flow Feature Extraction and Alignment Chroma feature extraction: Divide a song into excerpts (15s, 30s, whole). In each frame (130ms with 10ms shift) of an excerpt, a 48-bins CQT in the frequency range from 130Hz (C3) to 1975Hz (B6) is calculated. For each excerpt, a 12-d Chroma vector is calculated from the average CQT vector. Each Chroma vector is normalized. Alignment: To transpose chroma vectors within each mode to a reference (but unknown) tonic. Criteria: Maximize the overall correlation. : inner product; : norm; : the transposition of , by circularly shifting the items j positions to the left; : i-th aligned vector; q: the average vector. A greedy method for alignment: Initialization: Align and update one by one. • Future Work • How to propose a kind of key-independent feature for mode classification? • How to exploit temporal information to improve the mode model building?