1 / 27

Measuring Contribution of HTML Features in Web Document Clustering

Measuring Contribution of HTML Features in Web Document Clustering. Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR. Motivation. Motivation. Which HTML feature is the most important to provide good clustering results?

skylar
Download Presentation

Measuring Contribution of HTML Features in Web Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring Contribution of HTML Features in WebDocument Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR

  2. Motivation

  3. Motivation Which HTML feature is the most important to provide good clustering results? Using symbolic objects to cluster web documents. 15th World Wide Web Conference (2006)

  4. HTML Document Clustering Find meaningful groups from a web document collection. Effectively represent web document clusters for further analysis.

  5. HTML Document

  6. Classical Representations Different approaches for representing a web document. <5,22,19,4,...,38>

  7. Vectorial Representation Every document is represented by a vector inn-dimensional space. Bag of words scheme. Each variable represents the relative weight of a term in the document.

  8. Symbolic Objects Real-life objects are too complex to be represented by points in a vectorial space.[Bock&Diday, 2000] Symbolic objects overcome this limitation by representing concepts rather than individuals. In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.

  9. Symbolic Data Table

  10. Symbolic Data Table From relational data bases to symbolic data bases Millions… Multivariate Numeric Analysis Data Hundreds… Multivariate Symbolic Analysis Concepts

  11. Symbolic Data Base Relational Data Base Symbolic Data Base 100% knowledge 15 Gigabyte 90 % knowledge 10.3 Megabyte

  12. Symbolic Representations A complex representation that takes into account: term frequency, word order and phrases.

  13. The K-Means Clustering Method

  14. But, there are some problems …….

  15. Distance Measures

  16. Teorema: Igualdad de Fisher • Inercia total = Inercia inter-clases • + • Inercia intra-clases

  17. Representar una clase por su centro de gravedad, esto es, por su vector de promedios. ¿Qué es el centro de gravedad? Problemas en el caso simbólico:

  18. ¿Qué el centro de gravedad?

  19. Evaluation Criteria • Rand Index • Mutual Information • F-Measure • Entropy

  20. Experiments

  21. Experiments

  22. Experiments

  23. Experiments

  24. Conclusions Symbolic representations are richer and more flexible than classical representations. The text in the HTML document seems to be the more important factor to cluster HTML documents.

  25. Thank you!

More Related