1 / 27

A New Evolving Tree for Text Document Clustering and Visualization

A New Evolving Tree for Text Document Clustering and Visualization. 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti Malaysia Sarawak, Malaysia 2 Centre for Intelligent Systems Research, Deakin University, Australia. 1* kmtay@feng.unimas.my.

catori
Download Presentation

A New Evolving Tree for Text Document Clustering and Visualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Evolving Tree for Text Document Clustering and Visualization 1Wui Lee Chang, 1*Kai Meng Tay, 2CheePeng Lim 1Faculty of Engineering, Universiti Malaysia Sarawak, Malaysia 2Centre for Intelligent Systems Research, Deakin University, Australia. 1*kmtay@feng.unimas.my WSC 17 ( 2012)

  2. Presentation Outline • Introduction • Problem Statements • Motivations and Objectives • Preliminary • Evolving Tree • A General Application framework for Evolving Systems • The Proposed Procedure • Experimental results • Concluding Remarks

  3. Introduction: Clustering • To group sets of data based on their similarity levels to groups/clusters • Examples are Self Organizing Map(SOM), K-mean, Fuzzy C-mean.

  4. Introduction: Textual Document Clustering • To cluster/group sets of textual document based on their similarity levels. • To ease information retrieval. • Examples • the naive Bayes-based document clustering model [21], • WEBSOM[22], and • support vector machines-based for imbalanced text document classification [23]. [21] Lewis, D.: Naïve Bayes at forty: The independence assumption in information retrieval. In: ECML (1998) [22] Azcarraga, A.P., Yap, T.J., Tan, J., Chua, T.S.: Evaluating keyword selection methods for WEBSOM text archives. In: IEEE Transactions on Knowledge and Data Engineering, vol.16, no.3, pp. 380- 383 (2004) [23] Liu, T., Loh, H.T., Sun, A.: Imbalanced text classification: A term weighting approach. In: Expert Systems with Applications, vol.36, pp.690-701, (2009).

  5. Problem Statements : 1 • Traditional textual document clustering uses off-line learning. • Weakness:- needed to re-learn when new document is fed. • Adaptive or evolving feature model can be the alternative to traditional methods. • Evolving increase the learning flexibility. • WEBSOM focuses on off-line learning

  6. Problem Statement: 2 • For SOM ( or WEBSOM) • the difficulty in determining the map size before learning [19]. • The map size also affects the learning time [19]. [19] Pakkanen, J., Iivarinen, J., Oja, E.: The Evolving Tree – Analysis and Applications. In: IEEE Transactions on Neural Networks, vol. 17, no.3, pp.591-603 (2006)

  7. Motivations and Objectives • To construct an adaptive textual document clustering tool based on Evolving Tree (ETree). • To apply a general application framework for Evolving Systems [24]. • To analyze the adaptive activity of the proposed method with UNIMAS ENCON 2008 articles. [24] Lughofer, E.: Evolving Fuzzy Systems – Methodologies, Advanced Concepts and Applications. Ed.1, Springer (2011)

  8. Preliminary: Evolving Tree (ETree) • Formed a tree structure that contains root node, trunk nodes and leaf nodes. • Root node is the first created node in the tree. • Trunk nodes is connecting the leaf nodes. • Leaf nodes are the clusters formed. • Able to expand hierarchically (form a tree structure) to scale the data. • Hierarchical structure reduce the complexity control.

  9. Preliminary: Evolving Tree (ETree) • Node is indexed by . • Each node is attributed with a best matching unit (BMU) hit counter, . • Splitting threshold, , is predetermined. • Number of split children nodes,, is predetermined.

  10. Preliminary: Evolving Tree (ETree)- The learning Algorithm • Finding of BMU. • Updating leaf nodes. • Expanding the tree.

  11. Preliminary: Evolving Tree (ETree)--Finding BMU Layer 1 Layer 2 Tree depth Layer 3 BMU Layer 4

  12. Preliminary: Evolving Tree (ETree)--Updating Leaf Nodes • Kohonen learning rules: • neighbourhood function, .

  13. 2 3 1 BMU

  14. A General Application framework for Evolving Systems [24] [24] Lughofer, E.: Evolving Fuzzy Systems – Methodologies, Advanced Concepts and Applications. Ed.1, Springer (2011)

  15. The Proposed Procedure Fetching on-line article Updating terms of articles ETree Refining trained model

  16. The Proposed Procedure :Preprocessing Text • A new article (# label as )is fed. • Abstract of the article is extracted. • Stop words (119 words) are removed from the abstract. • Numerical and symbol are also removed. • A corpus, is the article id. • is the term symbols. • is further associated with several attributes, {,, .

  17. The Proposed Procedure :Term Weighting • Term weighting happens at new article only. • Inverse document frequency (idf) computes the importance of a word/term based on its occurrence in .

  18. The Proposed Procedure : Similarity Match Histogram • Training vectors, , are formed through binary descriptions of new article’s term with . • Now, the dimension of is normalized with that of trained article .

  19. The Proposed Procedure : Similarity Match Histogram • Compute the Euclidean distance between and . • Compute the overall distance, , from . • Finding of BMU:

  20. The Proposed Procedure : Expanding Tree • If = , then is split into children nodes.

  21. Experimental results: Observation • and are more similar to each other, as compared to .

  22. Experimental results: Observation Root node Trunk node Leaf node

  23. Time(s) Label for articles Experimental results: Complexity Control

  24. Experimental results: Tree structures with different

  25. Concluding Remarks • With the proposed approach, articles from ENCON 2008 could be clustered and visualized as a tree structure. • In short, the proposed approach constitutes to a new decision support supporting tool for conference organizer. • Besides, the proposed procedure could be useful with a larger number of articles with an expected increase in the computation complexity.

  26. Future Works • An ETree with dynamic setting will be developed. • Other potential applications (e.g., image and signal processing) of ETree will be further investigated.

  27. Thank you for your attentions

More Related