Use of Kolmogorov distance identification of web page authorship, topic and domain

Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz

Overview • Problem Statement • Kolmogorov distance • Experimental methods • Results • Clustering • Conclusions

Problem statement • It is often desirable for information retrieval systems to calculate a measure of similarity between documents. • Similarity measures generally rely on some sort of parsing, or understanding of documents, but effective parsing often depends on detailed knowledge of document structure.

General-purpose similarity • Acts on any string of data points. • Useful for: • Clustering • Verification • Filtering • Motif analysis • Exception detection.

Use of the “zip” technique • In 2002 Benedetto, Caglioti, & Loreto used the “Zip” compression algorithm to identify the language documents. • Technique involved concatenating a known language file with an unknown one and comparing the length of the zipped file. • The shortest concatenated zip file occurred when the known file was written in the same language as the unknown file.

Extensions to this technique • This approach was also used for author confirmation. • Used an hierarchical clustering algorithm for the construction of language trees.

Kolmogorov Distance Li, Chen, Li, Ma, & Vitenyi, 2003 - Assuming C(A|B) is the compressed size of A using the compression dictionary used in compressing B , and vice versa for C(B|A) and C(A), C(B) represent the compressed length of A and B using their own compression dictionaries. The kolmogorov distance between A and B , D(A,B) is given by:

Modified approach Obtain the two files – file1 and file2 Concatenate them in two ways, file1+ file2 = (file12) and file2+ file1 =(file21) Calculate the compressed length of: file1 as zip1 file2 as zip2 file12 as zip12 file21 as zip21 The Kolmogorov distance (D) is then given by:

Experiments • Author Identification from an online discussion board • Domain detection from sets of WWW pages • Topic detection from a collection of related WWW pages.

Methods • Load files from WWW • Compare test file with 10 others, one of which is {by the same author,from the same domain,on the same topic} • Use the modified kolomogorov distance algorithm. • Select the combination with the shortest distance.

Analysis • Chi-squared used to analyse the results. • Not really an IR system, as the number of documents “retrieved” always =1, from 10. • Precision can be related to the percentage of times when the lowest Kolmogorov distance is found for the desired outcome.

Percent in sample Status Percent Shortest KD 90% Author1<>Author2 51.88% 10% Author1=Author2 48.13% Results – Authorship 160 initial documents, 1600 total, Using Chi-Squared, this result is significant at the p<0.001 level (SPSS 11) c2=(1,N=160)=258,p<0.001.

Domain Name Number of Pages Average File Length AUT 2192 58518 OBGYN 203 25937 Microsoft 442 882771 Hon 19 21600 Apple 588 37319 Guardian 234 38326 Total 3678 177411.8 Web domains sampled

Percent in sample Status Percent lowest KD 90% Different Domain 18.75% 10% Same Domain 81.25% Results – Web domain 80 seed files, from 6 domains Using Chi-Squared, this result is significant at the p<0.001 level c2=(1,N=80)=451,p<0.001

Source Occurrences with shortest distance Percent in sample Different topic domain 17.89% 90% Same topic domain 82.11% 10% Results - Topics c2=(1,N=665)=3839,p<0.001

Conclusions • The modified Kolomogorov distance algorithm is capable of identifying related documents more often than chance. • This distance measure does not rely on parsing or semantic analysis. • This method may have application as part of an IR system.

Use of Kolmogorov distance identification of web page authorship, topic and domain

Use of Kolmogorov distance identification of web page authorship, topic and domain

Presentation Transcript

Web Page as User Interface: Form and Web Application Research Topic Presentation

‘Authorship Skills’ Web-Bibliography

‘ Authorship Skills’ Web-Bibliography

Use of Web Services

USE OF GIS IN JOB IDENTIFICATION

Automatic Domain Identification

Identification of protein homology using domain architecture

NIHR Programmes and topic identification

Tool Identification and Use

The Authorship of Isaiah

Kolmogorov :

Use of Frequency Domain

IDENTIFICATION METHODS OF LEGITIMATE WEB SITES

Identification principles and the use of forms and codes

Web Page Language Identification Based on URLs

(Title Page) Enter name of topic here

Advantages of Responsive Web Page Design

Web Hosting And Domain

‘Authorship Skills’ Web-Bibliography

Basic Properties of a Web Page

Language use and identification

Topic Distillation and Web Page Categorization