190 likes | 298 Views
Similarity Analysis by Data Compression. Peter Gr ü nwald, CWI, Amsterdam Petri Myllymäki, University of Helsinki, CoSCo. Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig. CWI is the National Centre of Mathematics and Computer Science in the Netherlands.
E N D
Similarity Analysis by Data Compression Peter Grünwald, CWI, Amsterdam Petri Myllymäki, University of Helsinki, CoSCo Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig. CWI is the National Centre of Mathematics and Computer Science in the Netherlands. CoSCo is the Complex Systems Computation Research Group.
Data Compression… • Consider two files A and B • Let’s compress these with your favourite general-purpose data compressor, e.g. gzip • Let L(A) and L(B) be the compressed length (in bits) of A and B, respectively
…and Similarity • Suppose we want to compress both A and B. • We can either first compress A and then B • Resulting length: L(A)+L(B) • Or we can glue A and B together and compress the resulting file AB • Resulting length L(AB)
…and Similarity • Suppose we want to compress both A and B. • We can either first compress A and then B • Resulting length: L(A)+L(B) • Or we can glue A and B together and compress the resulting file AB • Resulting length L(AB) CLAIM: if (and only if) A and B are ‘similar’, then L(AB) << L(A) + L(B)
“Domain-Independent”Notion of Similarity • Consider same ASCII text in many different languages, e.g., Declaration of Human Rights • English close to German • English reasonable close to French • German farther from French • All three far from, say, Polish • Consider DNA of different species • Human very close to Chimpanzee, somewhat less close to Gorilla, even less close from Baboon…and very far from Wheat • Consider MIDI-files of popular songs…
Background • For a given compressor with length function L, define Normalized Compression Distance as • If L is taken to be Kolmogorov complexity, this becomes a “universal metric” • essentially, whenever two objects are close according to some computable distance function, they will be close according to NCD as well • For practical applications, use computationally practical general-purpose compressor • gzip, bzip, ppm etc.
Applications • For a set of N possibly related files, compute N2 pairwise normalized compression distances • To visualize, create a binary tree such that close objects are close to each other on the tree • e.g. using quartet puzzling method You can do this at home! You cannot do this at home!
Pump-Priming • Pre-Pump Priming: • Theory developed and tested on several data sets at CWI; featured in New Scientist, Pour La Science, Izvestija… • Successes include: SARS is CORONA • Pump Priming: • Development of popular Open-Source Package CompLearn (www.complearn.org, Rudi Cilibrasi) • Application of CompLearn and other compression-based methods to stemmatology
Compression-Based Methods in Stemmatic Analysis Legend of St. Henry of Finland, Manuscript H, Helsinki University Library
Before Gutenberg... • Historical manuscripts were repeatedly copied by hand • Typical ’errors’ include misspellings, omissions, change of word order, etc....
Manuscript Evolution • The texts spread out in a number of copies, following a tree-like graph • Typically only a fraction of the manuscripts remain to our date
Stemmatic Analysis • Stemmatology: ”Discipline that attempts to reconstruct the transmission of a text on the basis of relations between the various surviving manuscripts.” • Cf. Phylogenetics: ”The study of evolutionary relatedness among various groups of organisms.” manuscript individual written text DNA copying reproduction modification mutation ’contamination’ horizontal transfer
Compression-Based Approach • Most existing approaches (distance-based methods, parsimonial methods, Bayesian methods, etc.) based on methods developed for biological phylogeny: • Pascal pump priming compression-based approach for stemmatic analysis • Cost function: amount of information required to describe B given A.
Constructing the stemma • Dynamic programming for handling the missing nodes • With 52 existing documents, the number of trees is about 2.7 x 1078 simulated annealing search
How Does It Work? • Actually, surprisingly well! • In Helsinki, we have started a 2-year project with the historians, funded by the Emil Aaltonen Foundation,to study thisapproach further
The Pascal Computer-Assisted Stemmatology Challenge • Data set #1: Heinrichi data, collected specifically for this challenge • Data set #2: The Parzival data - text is beginning of German poem Parzival by Wolfram von Eschenbach (translated to English by A.T. Hatto). Data kindly provided to us by M. Spencer and H. F. Windram • Data set #3: Notre Besoin - text is from Stig Dagerman's, Notre besoin de consolation est impossible à rassasier, Paris: Actes Sud, 1952 (translated to French from Swedish by P. Bouquet). Data kindly provided to us by Caroline Macé.
Challenge results • No clear overall winner over all data sets • CompLearn performed very well in Parzival, but poorly in Heinrichi, why? more research is required • Nice side result: the Heinrichi is internationally a quite unique data set a platform for future collaboration with other sciences?
Future work • Analysis of Challenge results • New Challenge? • Application to the Finnish Cultural Foundation to fund a two-year European research network on stemmatology • built aroundseries of 4-5 international workshops gathering top experts of the field. • names in application represent various disciplines including historical studies, theology, philology, computer science, mathematics and biology • Workshop on information-theoretic approaches to modeling in Helsinki? • July 2008, during ICML, UAI & COLT