190 likes | 272 Views
Clustering by Compression. Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA). Overview. Input to the software is a set of files Output is a hierarchical clustering shown as an unrooted binary tree This is a case of unsupervised learning (example follows). Process Overview.
E N D
Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)
Overview • Input to the software is a set of files • Output is a hierarchical clustering shown as an unrooted binary tree • This is a case of unsupervised learning • (example follows)
Process Overview • 1. File translations, if necessary, for example from MIDI to “player-piano” type format. • 2. Calculation of Normalized Compression Distance, or NCD. • 3. Representation as an unrooted binary tree.
What’s Unique? • This clustering system is unique in that it can be described as feature-free • There are no parameters to tune, and no domain-specific knowledge went into it. • Using general-purpose data compressors gives us a parameterized family of features automatically for each domain
Featureless Clustering • No parameters and no customized features makes it convenient to develop as well as use • Since it is based on information-theoretic foundations, it tends to be less brittle than other methods that make considerably more domain-specific assumptions • So how does it work?
Midi Translation • In order to restrict information entering the algorithm, we removed undesirable MIDI fields such as artist or composer name, headers, and other non-musical data. • We keep only the basic MIDI-track decomposition as well as note timing and duration events. We throw away individual note volume.
Gene sequence translation • Genetic sequences are represented in ASCII ain four letter alphabets: A,T,G,C • Almost no translation at all
Image Translation • Black and white images are converted to ASCII using spaces for black and # for white • Newlines are used to separate rows
NCD • Once a group of songs has been acquired and translated, a quantity is computed on each pair in the group • Normalized Compression Distance measures how different two files are from one another.
NCD • NCD is based on an earlier idea called Normalized Information Distance. • NID uses as compressor a mathematical abstraction called Kolmogorov Complexity, often abbreviated K. • K represents a perfect data compressor, and is therefore uncomputable.
NCD • Since we cannot compute K, we approximate it using real general-purpose file-compressors like gzip, bzip2, winzip, ppmz, and others • NCD depends on a particular compressor and NCD with different compressors may give different results for the same pair of objects
NCD • C(x) means “the compressed size of x” • C(xy) means “compressed size of x and y” • 0 <= NCD(x,y) <= 1 (roughly)
NCD • NCD measures how similar or different two strings (or equivalently, files) are. • NCD(x,x) = 0, because nothing is different from itself • NCD(x,y) = 1 means that x and y are completely unrelated • Often less extreme values in real cases
NCD • Computing NCD of every song with every other song yields a 2-dimensional symmetric distance matrix • Next step is transforming this array of distances into something easier to grasp • We use the Quartet Method to construct an unrooted binary tree from the NCD matrix
Quartet Method • Our algorithm is a slight enhancement of the standard quartet method of tree reconstruction popular for the last 30 years • The input is a matrix of distances (NCD) • The output is an unrooted binary tree topology where each song is at a leaf and each non-leaf node has exactly three connections. • Tree is just one visualization of NCD matrix
Newer developments • Since the original Algorithmic Clustering of Music paper, we have since developed further the underlying mathematical formalisms upon which the method is based in a new paper, Clustering by Compression • We’ve included experiments from many other areas: biology, astronomy, images…
Current and future work • This year, we’ve begun experimenting with automatic conversion from .mp3 (and most other audio formats) to MIDI. This enables us to participate in new emerging spaces • We’re investigating alternatives for all stages of this process, to try to understand more about this apparently general machine learning algorithm
New directions • Combination of NCD and Support Vector Machine (SVM) learning for providing scalable generalization in a wide class of domains both musical and otherwise • Application of our techniques in real outstanding questions within the musical community
Contact and more info • Related papers and information: http://www.cwi.nl/~cilibrar • Software: http://complearn.sourceforge.net/ • Rudi.Cilibrasi@cwi.nl • Paul.Vitanyi@cwi.nl • Ronald.de.Wolf@cwi.nl