270 likes | 400 Views
Measuring Contribution of HTML Features in Web Document Clustering. Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR. Motivation. Motivation. Which HTML feature is the most important to provide good clustering results?
E N D
Measuring Contribution of HTML Features in WebDocument Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR
Motivation Which HTML feature is the most important to provide good clustering results? Using symbolic objects to cluster web documents. 15th World Wide Web Conference (2006)
HTML Document Clustering Find meaningful groups from a web document collection. Effectively represent web document clusters for further analysis.
Classical Representations Different approaches for representing a web document. <5,22,19,4,...,38>
Vectorial Representation Every document is represented by a vector inn-dimensional space. Bag of words scheme. Each variable represents the relative weight of a term in the document.
Symbolic Objects Real-life objects are too complex to be represented by points in a vectorial space.[Bock&Diday, 2000] Symbolic objects overcome this limitation by representing concepts rather than individuals. In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.
Symbolic Data Table From relational data bases to symbolic data bases Millions… Multivariate Numeric Analysis Data Hundreds… Multivariate Symbolic Analysis Concepts
Symbolic Data Base Relational Data Base Symbolic Data Base 100% knowledge 15 Gigabyte 90 % knowledge 10.3 Megabyte
Symbolic Representations A complex representation that takes into account: term frequency, word order and phrases.
But, there are some problems …….
Teorema: Igualdad de Fisher • Inercia total = Inercia inter-clases • + • Inercia intra-clases
Representar una clase por su centro de gravedad, esto es, por su vector de promedios. ¿Qué es el centro de gravedad? Problemas en el caso simbólico:
Evaluation Criteria • Rand Index • Mutual Information • F-Measure • Entropy
Conclusions Symbolic representations are richer and more flexible than classical representations. The text in the HTML document seems to be the more important factor to cluster HTML documents.