1 / 20

Cluster Analysis of Homogeneous XML Documents for Theme Validation

This study presents experiments in clustering homogeneous XML documents to validate an existing typology. Various techniques such as selective structuring and linguistic term selection were employed to evaluate the quality of clustering against a predefined typology. The methodology involved using the XML structure, syntactic typing, and stemming tools like TreeTagger for clustering documents. The clusters were evaluated using the F-measure and the corrected Randindex measures. The results highlighted the importance of selected content in the clustering quality, suggesting potential areas of improvement in the document structure. Future work includes measuring cluster stability with varying parameters and exploring clustering techniques on different data collections.

ddoss
Download Presentation

Cluster Analysis of Homogeneous XML Documents for Theme Validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thierry Despeyroux Yves Lechevallier Brigitte Trousse Anne-Marie Vercoustre Inria Projet Axis E_mail: firstname.surname@inria.fr Experiments in clustering homogeneous XML documents to Validate an Existing Typology I-Know 2005

  2. Scientific Activity Report at Inria I-Know 2005

  3. Homogeneous presentation I-Know 2005

  4. 146 files 229 000 text lines 14,8 M octets of data oneDTD Optional sections Free style and content Some RA figures I-Know 2005

  5. Grouping by Themes (2003) I-Know 2005

  6. Grouping by Themes (2004) I-Know 2005

  7. Presentation by Research themes That varies overtime Not politically neutral (funding, evaluation) Is there any natural grouping? What is the role of different parts of the report in highlighting the themes? Problem I-Know 2005

  8. Select specific parts by using the XML structure Select significant words by using a tool for syntactic typing and stemming (TreeTagger) Cluster the documents into disjoined clusters Evaluate those clusters Methodology I-Know 2005

  9. K-F: Keywords from sections foundations K-all: all Keywords T-P: text in section presentation T-PF: text in sections presentation et foundations T-C: names of conferences, workshops, congress etc. in the bibliography Various experiments I-Know 2005

  10. XML Tree Tagger A3 presentation a3 JJ <unknown> A3 presentation designs NNS design A3 presentation methods NNS method A3 presentation and CC and A3 presentation tools NNS tool A3 presentation used VVN use A3 presentation by IN by A3 presentation compilers NNS compiler A3 presentation or CC or A3 presentation users NNS user A3 presentation for IN for A3 presentation code NN code A3 presentation analysis NN analysis TreeTagger I-Know 2005

  11. Clustering Method The objective of the 3rd step is to cluster documents in a set of disjoint classes, from the vocabularies selected for the five experiments. We use a partition method close to the k-means algorithm where the distance between documents is based on the word frequency. I-Know 2005

  12. Classe 1: 3d approximation, computer, differential, environment , modeling, processing , programming , vision Classe 2 : computing, equation, grid, problem, transformation Classe 3 : code, design, event, network, processor, time, traffic Classe 4 : calculus, database, datum, image, indexing, information, integration, knowledge, logic, mining, pattern, recognition, user, web K-F-a experiment: list of representative Keywords For each cluster, the list of most representative words can be associated. Those words can be interpreted as summaries for those classes. I-Know 2005

  13. Repartition of clusters compared to themes 2003 I-Know 2005

  14. Repartition of themes 2003 compared to clusters I-Know 2005

  15. Partition of projects I-Know 2005

  16. Partition des projets I-Know 2005

  17. Extern Evaluation The evaluation of the quality of clusters can be done by comparing the resulting clusters with the two lists of themes used by INRIA nij is the number of research projects with their report classed in cluster Ui and allocated to group Cj (theme j). ni. is the number of research reports in cluster Ui , n.k is the number of researchprojects allocated to group Ck , n is the total number of research projectsanalysed. I-Know 2005

  18. Two evaluation measures • The F-measure proposed by (Jardine and Rijsbergen, 1963) combines the precision and recall measure betweenUi and Ck. • recall is defined by R(i,k)=nik /ni. • precision is defined by P(i,k)= nik /n.k The F-measure between the a priori partition U in K groupes and partition C of INRIA projects by the clustering method is: The corrected Randindex (CR) proposed by (Hubert and Arabie (1985)) to compare two partitions. I-Know 2005

  19. Results I-Know 2005

  20. Combination of selection by structure and by linguistic terms Evaluation of clustering compared to an existing typology The quality of clustering strongly depends on the selected parts in the activity reports (which in turn gives an indication on where the report could be improved) Future : Measuring the stability of clusters when K varies Evolution of classes overtime Experiences with other collections Conclusion I-Know 2005

More Related