An evolutionary approach for song genre classification

An evolutionary approach for song genre classification Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel

Outline Competition overview What is a Bayesian Network? Learning Bayesian Networks through evolution ECOC and Recursive entropy-based discretizaion Decision trees and C4.5 A new prediction model Boosting and K-fold cross validation References

Competition overview • A database of 60 music performers has been prepared for the competition. • The material is divided into six categories: classical music, jazz, blues, pop, rock and heavy metal. • For each of the performers 15-20 music pieces have been collected. • All music pieces are partitioned into 20 segments and parameterized. • The feature vector consists of 191 parameters.

Competition overview (Cont.) • Our goal is to estimate the music genre of newly given fragments of music tracks. • Input: • A training set of 12,495 vectors and their genre • A test set of 10,269 vectors without their genre • Output: 10,269 labels (Classical, Jazz, Rock, Blues, Metal or Pop). One for each vector in the test set. • The metric used for evaluating the solutions is standard accuracy, i.e. the ratio of the correctly classified samples to the total number of samples.

A long story • You have a new burglar alarm installed at home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes. • You also have two neighbors, John and Mary, who have promised to call you at work when they hear the alarm. • John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too. • Mary, on the other hand, likes rather loud music and sometimes misses the alarm altogether. • Given the evidence of who has or has not called, we would like to estimate the probability of a burglary.

A short representation

Observations • In our algorithm, all the values of the network are known except the genre value, which we would like to estimate. • The variables in our algorithm are continuous and not Discrete (except the genre variable). • We divide the possible values of each variables into fixed size intervals. • The number of intervals is changed throughout the evolution. • We refer to this process as the discretization of the variable. • We refer to the Conditional Probability Table of each variable (node) as CPT

Naïve Bayesian Network

Bayesian Network construction • Once we determined the chosen variables (amount and choice), their fixed discretization and the structure of the graph, we can easily compute the CPT values for each of the nodes in the graph (according to the training set). • For each vector in the training set, we will update all the network’s CPTs by increasing the appropriate entry by one. • After this process, we will divide each value with the sum of its row (Normalization).

Exact Inference in Bayesian Networks • For each vector in the test set, we compute six different probabilities (Multiplying the appropriate entries of all the network’s CPTs) and chose the highest one as the genre of this vector. • Each probability is for a different assumption on the genre variable value (Rock, Pop, Blues, Jazz, Classical and Metal).

Preprocessing • I divided the training set into two sets. • A training set – used for constructing each Bayesian Network in the population. • A validation set – used for computing the fitness of each network in the population. • These sets has the same amount of vectors for each category (Rock vectors, Pop vectors, etc.)

The three dimensions of the evolutionary algorithm • The three dimensions are: • Variables amount. • Variables choice. • Fixed discretization of the variables. • Every network in the population is a Naïve Bayesian Network, which means that its structure is already determined.

Fitness function • In order to compute the fitness of a network, we estimate the genre of each vector in the validation set, and compare it to it’s known genre. • The metric used for computing the fitness is standard accuracy, i.e. the ratio of the correctly classified vectors to the total number of vectors in the validation set.

Selection • In each generation, we choose population_size/2 different networks at most. • We prefer networks that have the highest fitness and are distinct from each other. • After choosing these networks we use them to build a fully sized population by mutating each one of them. • We use bitwise mutation to do so. • Notice that we may use a mutated network to generate a new mutated network.

Mutation • Bitwise mutation. • Parent: • BitSet • Dis • Child: • BitSet • Dis

Crossover • Single point crossover. • Parent 1: • Parent 2: • Child 1: • Child2:

Results (Cont.) • Model - Naive Bayesian • Population size - 40 • Generations - 400 • Variables - [1,191] • discretization - [5,15] • First population score - 0.7878 • Best score - 0.8415 • Test Set score - 0.7323 • Website’s score: • Preliminary result - 0.7317 • Final result - 0.73024 • “Zeroes” = cpt_min/10

Observation • Notice that there’s approximately 10% difference between my score and the website’s score. • We will discuss this issue (over fitting) later on.

Adding the forth dimension • The forth dimension is the structure of the Bayesian Network • Now, the population includes different Bayesian Networks. Meaning, networks with different structures, variables choice, variables amount and Discretization array.

Evolution operations • The selection process is the same as in the previous algorithm. • The crossover and mutation are similar. • First, we start like the previous algorithm (Handling the BitSet and the discretization array) • Then, we add all the edges we can from the parent (mutation) or parents (crossover) to the child’s graph. • Finally, we make sure that the child’s graph is a connected acyclic graph.

Results • Model - Bayesian Network • Population size – 20 • Generations – Crashed on generation 104 • Variables - [1,191] • discretization - [2,6] • First population score - 0.4920 • Best score - ~0.8559 • Website’s score : • It Crashed

Memory problems • The program was executed on amdsrv3, with a 4.5 GB memory limit. • Even though the discretization interval is [2-6], the program has crashed due to java heap space error. • As a result I decided to decrease the population size to 10 instead of 20.

Results (Cont.) • Model - Bayesian Network • Population size – 10 • Generations – 800 • Variables - [1,191] • discretization - [2,10] • First population score - 0.5463 • Best score - 0.8686 • Website’s score: • Preliminary score - 0.7085

Results (Cont.) • Model - Bayesian Network • Population size – 10 • Generations – 800 • Variables - [1,191] • discretization - [2,20] • First population score - 0.5978 • Best score - 0.8708 • Website’s score: • Preliminary score - 0.6972

Overfitting • As we increase the discretization interval, my score increases and the website’s score decreases. • One explanation can be that increasing the search space may cause the algorithm to find patterns with strong correlation to the specific input data I received. While these patterns has no correlation at all to the real life data. • One possible solution is using k-fold cross validation

Final competition scores

My previous score

ECOC • ECOC stands for Error Correcting Output Codes • ECOC is a technique for using binary classification algorithms to solve multi-class problems. • Each class is assigned a unique code word – a binary string of length . • A set of n binary classifiers is then trained, one for each bit position in the code word. • New instances are classified by evaluating all binary string classifiers to generate a new n-bit string, which is then compared to all the code words using Humming distance.

ECOC properties • The ECOC codes are generated so that their pairwise Hamming distances are maximized. • In general, a code with minimum pairwise Hamming distance is able to correct up to individual bit (classifier) errors. • In our case:

ECOC (Cont.)

Entropy

Recursive minimal entropy partitioning (Fayyad & Irani - 1993) • The goal of this algorithm is to discretizes all numeric attributes in the dataset into nominal attributes. • The discretization is performed by selecting a bin boundary minimizing the entropy in the induced partitions. • The method is then applied recursively for both new partitions until a stopping criterion is reached. • Fayyad and Irani make use of the Minimal-Description length principle to determined a stopping criteria.

RMEP (Cont.) • Given a set of instances , a feature , and a partition boundary , the class information entropy of the partition induced by is given by: • For a given feature A, the boundary which minimizes the entropy function over all possible partition boundaries is selected as a binary discretization boundary.

RMEP (Cont.) • The stopping criteria is: • Since the partitions along each branch of the recursive discretization are evaluated independently using this criteria, Some areas in the continuous spaces will be partitioned very finely whereas others (which have relatively low entropy) will be partitioned coarsely.

Results • These improvements resulted in a standard accuracy score of for the test set. • This score is for a population of networks and generations

Example of a decision tree

C4.5 algorithm • Check for base cases which are: • All instances are from the same class • For each attribute • Find the maximal from splitting onusing all possible thresholds • Let be the attribute with the highest • Create a decision that splits on • Recurse on the sublists obtained by splitting on , and add those nodes as children of

Splitting criteria • Notice that tends to increase with the number of outcomes of a split. • As a result, splitting on a variable with maximal bin boundaries will get a low

Results of c4.5 alone • As noted above, uses binary discretization. • Using alone (after ECOC) yields a standard accuracy score of on the test set.

A new prediction model • For each instance in the test set • Let be a -bit string • For to • If • Else • Let be the closest genre code word to according to their Humming distance • Return

Results • The new prediction model yields a standard accuracy score of for the test set which puts me in the place (estimation) out of active teams. • Registered teams: (with members) • Active teams: • Total number of submitted solutions: • This score is for a population of networks and generations

Results (Cont.)

My new score

An evolutionary approach for song genre classification

An evolutionary approach for song genre classification

Presentation Transcript

Evolutionary relationships – classification

Musical Genre Classification

Evolutionary approach review

Darwinian / Evolutionary approach

Genre Classification

Evolutionary Classification

18.2 Modern Evolutionary Classification

EVOLUTIONARY CLASSIFICATION

Activity Recognition: An Evolutionary Ensembles Approach

Evolutionary Classification

An Evolutionary Approach for Gene Expression Patterns

BDII Story An Evolutionary Approach

Song Genre and Artist Classification via Supervised Learning from Lyrics

An Evolutionary Approach to Financial History

Product / Service Innovation… An Evolutionary Approach

Evolutionary Classification and Phylogeny

18.2 Modern Evolutionary Classification

18.2 Modern Evolutionary Classification

Animal Evolutionary Classification