組員：江啟賓張展華陳威呈胡家豪

Understandable Models of Music Collections Based On Exhaustive Feature Generation With Temporal StatisticsFabian Moerchen , Ingo Mierswa , Alfred UltschIn KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006), pp. 882-891 組員：江啟賓張展華陳威呈胡家豪

Outline • Introduction • Related Work • Audio Feature Generation • Semantic Audio Features • Evaluation • Conclusions • Discussions • Applications

Introduction(1/2) However, it’s really hard to understand!!! • Confronted with music data, data mining encounters a new challenge of scalability. Music databases store millions of records and each item contains up to several million values.  Extract features from the audio signal which leads to a strong compression of the data set at hand. • Artist and genre classification or retrieval of similar music can be performed with machine learning methods utilizing these features. • Many researchers use features motivated by heuristics on music structure and psychoacoustic analysis of frequency and modulation of sound. But not all features need to be relevant for a particular task. • The result of applying signal processing and statistical methods cannot easily be explained to the common user of music applications.

Introduction(2/2) Contribution: • The authors use logistic regression in order to obtain concise and interpretable features summarizing a subset of the complicated features generated directly from polyphonic audio.

Related Work • Stacking → building the new features → decision model → prediction • Mel Frequency Cepstral Coefficient (MFCC) • Support Vector Machines (SVM) • Linear Discriminant Analysis or linear predictive coefficients(LPC)

Audio Feature Generation(1/5) • The raw audio data of polyphonic music is not suitedfor direct analysis with data mining algorithms. • various sound impressions • Extracting audio features on short time windows • short-term features • long-term features

Audio Feature Generation(2/5) • The authors used four disjoint data sets for the evaluation of our method. • sampling frequency of 22kHz • lead in and lead out effects

Audio Feature Generation(3/5) Short-term features: Including some variants obtained by preprocessing the features, e.g., the algorithm of the Chroma features, a total of 140 short-term features was generated.

Audio Feature Generation(4/5) • Long-term features: • The authors used the first four moments, robust variants by removing the largest and smallest 2.5% of the data prior to estimation. • These ten statistics are also applied to the first and second order differences and the first and second order absolute differences, generating 40 additional features.

Audio Feature Generation(5/5) • The cross-product of short- term and long-term feature functions amounts to 140 × 284 = 39, 760 long-term audio features. • Obviously, this can take a lot of computation timeand memory.

Semantic Audio Feature(1/6) • 40,000 features are huge and hard to be understood. • The goal of the section is to simplify the features and eliminate the irrelevant features. • The author’s idea is to adapt the Stacking.

Semantic Audio Feature(2/6) • “In contrast to Stacking we do not learn the same concept on differentsubsamples but different concepts on the same sample.”

Semantic Audio Feature(3/6) • D are the data sets describing these different concepts. • Dk ∩ Dl= ∅ ⇒ Dk= Dland D = Dk D1 D3 D2 D4

Semantic Audio Feature(4/6) • the authors applied a robust z-transformation to each long-term feature and a logistic regression learner for each of the K classification tasks. • Since the values are already normalized, it is not necessary to apply post-processing scaling schemes after learning a classification function.

Semantic Audio Feature(5/6) Using Laplace priors for the influence of each feature leads to a built-in feature selection that reduces runtime and avoids over-fitting of the final model.

Semantic Audio Feature(6/6) • Therefore, Using these likelihood predictions as new feature set reduces the amount of features from 40,000 to K(K<10).

Evaluation(1/7) • Analysis of semantic audio features • Genre classification • Interpretability

Evaluation(2/7) Analysis of semantic audio features: • The logistic regression learning of the genre ground truth worked very well within the RADIO and GTZAN data sets. • For both the training and the disjunct test part of the data, the separation of Metal from the remaining music is clearly visible.

Recall v.s Precision • Recall |Ra|/ |R| - The fraction of relevant items which have been retrieved • Precision |Ra|/ |A| - The fraction of relevant items which have been relevant

Evaluation(3/7) • The precision and recall values as measured on the test set are listed. • The features columns show the number of samples picked out of the almost 40,000 candidate features.

Evaluation(4/7) • In left table lists the long-term features picked for 5 or 6 of the 7 models. • In right table, the authors investigated which features had the largest absolute weights in the logistic regression models, indicating their relative importance in the decision for a genre.

Evaluation(5/7) Genre classification: SVM, KNN, C4.5

Evaluation(6/7)

Evaluation(7/7) Interpretability:

Discussions • If the users provide a categorization of some music he knows well, our method could generate personalized features that describe how much does this sound like other music that makes me happy. • One advantage of logistic regression is, that the numerical values do not need preprocessing for methods relying on distance calculations like k-nearest neighbor classification, k-Means clustering, or visualization with Emergent Self-Organizing Maps(ECOM). • The amount of candidate features is only limited by the computational resources. Using more long-term features, the accuracy of our models can still be increased.  Calculate quite time consuming.

Discussions • Some of the x-axis and y-axis of figures can not understand what the author mean. • Some references url are not available, for example: http://marsyas.sf.net. • Long-term feature • C4.5 decision tree

Conclusions • Exhaustive feature generation is used to capture many different aspects of the raw audio data that cannot be used directly. • This can be seen as a meta learning technique loosely related to stacking. • The resulting low-dimensional vector based representations can efficiently be used for music mining tasks in like genre classification, recommendation, or visualization of music collections.

Applications • Text mining with large feature sets corresponding to words occurring in documents or video mining where many features could be derived by combining short-term and long-termdescriptions as we did for music. • News or some of the applications. For example: Shazam/SoundHound/Track ID, stock market.

Thanks for your listening

組員： 江啟賓 張展華 陳威呈 胡家豪