270 likes | 438 Views
A New Evolving Tree for Text Document Clustering and Visualization. 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti Malaysia Sarawak, Malaysia 2 Centre for Intelligent Systems Research, Deakin University, Australia. 1* kmtay@feng.unimas.my.
E N D
A New Evolving Tree for Text Document Clustering and Visualization 1Wui Lee Chang, 1*Kai Meng Tay, 2CheePeng Lim 1Faculty of Engineering, Universiti Malaysia Sarawak, Malaysia 2Centre for Intelligent Systems Research, Deakin University, Australia. 1*kmtay@feng.unimas.my WSC 17 ( 2012)
Presentation Outline • Introduction • Problem Statements • Motivations and Objectives • Preliminary • Evolving Tree • A General Application framework for Evolving Systems • The Proposed Procedure • Experimental results • Concluding Remarks
Introduction: Clustering • To group sets of data based on their similarity levels to groups/clusters • Examples are Self Organizing Map(SOM), K-mean, Fuzzy C-mean.
Introduction: Textual Document Clustering • To cluster/group sets of textual document based on their similarity levels. • To ease information retrieval. • Examples • the naive Bayes-based document clustering model [21], • WEBSOM[22], and • support vector machines-based for imbalanced text document classification [23]. [21] Lewis, D.: Naïve Bayes at forty: The independence assumption in information retrieval. In: ECML (1998) [22] Azcarraga, A.P., Yap, T.J., Tan, J., Chua, T.S.: Evaluating keyword selection methods for WEBSOM text archives. In: IEEE Transactions on Knowledge and Data Engineering, vol.16, no.3, pp. 380- 383 (2004) [23] Liu, T., Loh, H.T., Sun, A.: Imbalanced text classification: A term weighting approach. In: Expert Systems with Applications, vol.36, pp.690-701, (2009).
Problem Statements : 1 • Traditional textual document clustering uses off-line learning. • Weakness:- needed to re-learn when new document is fed. • Adaptive or evolving feature model can be the alternative to traditional methods. • Evolving increase the learning flexibility. • WEBSOM focuses on off-line learning
Problem Statement: 2 • For SOM ( or WEBSOM) • the difficulty in determining the map size before learning [19]. • The map size also affects the learning time [19]. [19] Pakkanen, J., Iivarinen, J., Oja, E.: The Evolving Tree – Analysis and Applications. In: IEEE Transactions on Neural Networks, vol. 17, no.3, pp.591-603 (2006)
Motivations and Objectives • To construct an adaptive textual document clustering tool based on Evolving Tree (ETree). • To apply a general application framework for Evolving Systems [24]. • To analyze the adaptive activity of the proposed method with UNIMAS ENCON 2008 articles. [24] Lughofer, E.: Evolving Fuzzy Systems – Methodologies, Advanced Concepts and Applications. Ed.1, Springer (2011)
Preliminary: Evolving Tree (ETree) • Formed a tree structure that contains root node, trunk nodes and leaf nodes. • Root node is the first created node in the tree. • Trunk nodes is connecting the leaf nodes. • Leaf nodes are the clusters formed. • Able to expand hierarchically (form a tree structure) to scale the data. • Hierarchical structure reduce the complexity control.
Preliminary: Evolving Tree (ETree) • Node is indexed by . • Each node is attributed with a best matching unit (BMU) hit counter, . • Splitting threshold, , is predetermined. • Number of split children nodes,, is predetermined.
Preliminary: Evolving Tree (ETree)- The learning Algorithm • Finding of BMU. • Updating leaf nodes. • Expanding the tree.
Preliminary: Evolving Tree (ETree)--Finding BMU Layer 1 Layer 2 Tree depth Layer 3 BMU Layer 4
Preliminary: Evolving Tree (ETree)--Updating Leaf Nodes • Kohonen learning rules: • neighbourhood function, .
2 3 1 BMU
A General Application framework for Evolving Systems [24] [24] Lughofer, E.: Evolving Fuzzy Systems – Methodologies, Advanced Concepts and Applications. Ed.1, Springer (2011)
The Proposed Procedure Fetching on-line article Updating terms of articles ETree Refining trained model
The Proposed Procedure :Preprocessing Text • A new article (# label as )is fed. • Abstract of the article is extracted. • Stop words (119 words) are removed from the abstract. • Numerical and symbol are also removed. • A corpus, is the article id. • is the term symbols. • is further associated with several attributes, {,, .
The Proposed Procedure :Term Weighting • Term weighting happens at new article only. • Inverse document frequency (idf) computes the importance of a word/term based on its occurrence in .
The Proposed Procedure : Similarity Match Histogram • Training vectors, , are formed through binary descriptions of new article’s term with . • Now, the dimension of is normalized with that of trained article .
The Proposed Procedure : Similarity Match Histogram • Compute the Euclidean distance between and . • Compute the overall distance, , from . • Finding of BMU:
The Proposed Procedure : Expanding Tree • If = , then is split into children nodes.
Experimental results: Observation • and are more similar to each other, as compared to .
Experimental results: Observation Root node Trunk node Leaf node
Time(s) Label for articles Experimental results: Complexity Control
Concluding Remarks • With the proposed approach, articles from ENCON 2008 could be clustered and visualized as a tree structure. • In short, the proposed approach constitutes to a new decision support supporting tool for conference organizer. • Besides, the proposed procedure could be useful with a larger number of articles with an expected increase in the computation complexity.
Future Works • An ETree with dynamic setting will be developed. • Other potential applications (e.g., image and signal processing) of ETree will be further investigated.