1 / 34

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining. Farial Shahnaz. Topics. Introduction Algorithm Performance Observation Conclusion and Future Work. Introduction. Basic Concepts. Text Mining : Detection of trends or patterns in text data

ulfah
Download Presentation

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz

  2. Topics • Introduction • Algorithm • Performance • Observation • Conclusion and Future Work

  3. Introduction

  4. Basic Concepts • Text Mining : Detection of trends or patterns in text data • Clustering : Grouping or classifying documents based on similarity of content

  5. Clustering • Manual Vs Automated • Supervised Vs Unsupervised • Hierarchical Vs Partitional

  6. Clustering • Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents • Method : Nonnegative Matrix Factorization or NMF

  7. Vector Space Model of Text Data • Documents represented as n-dimensional vectors • n : terms in the dictionary • vector component : importance of term • Document collection represented as term-by-document matrix

  8. Term-by-Document Matrix • Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the) • Document 1 : a quick brown fox • Document 2 : jumped over the lazy dog

  9. Term-by-Document Matrix

  10. Clustering Method : NMF • Low rank approximation of large sparse matrices • Preserves data nonnegativity • Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)

  11. Other Methods • Other rank reduction methods : • Principal Component Analysis (PCA) • Vector Quantization (VQ) • Produce basis vectors with negative entries • Additive and Subtractive combinations of basis vectors yield original document vectors

  12. NMF • Produces nonnegative basis vectors • Additive combination of basis vectors yield original document vector

  13. Term-by-Document Matrix (all entries nonnegative)

  14. NMF • Basis vectors interpreted as semantic features or topics • Documents clustered on the basis of shared features

  15. NMF • Demonstrated by Xu et. Al (2003): • Outperforms Singular Value Decomposition (SVD) • Comparable to Graph Partitioning methods

  16. Algorithm

  17. NMF : Definition Given • S : Document collection • Vmxn : term-by-document matrix • m : terms in the dictionary • n : Number of documents in S

  18. NMF : Definition NMF is defined as: • Low rank approximation of Vmxn in terms of some metric • Factor V into the product WH • Wmxk : Contains basis vectors • Hkxn : Contains linear combinations • k : Selected number of topics or basis vectors, k << min(m,n)

  19. NMF : Common Approach • Minimize objective function:

  20. NMF : Existing Methods Multiplicative Method (MM) [ by Lee and Seung ] • Based on Multiplicative update rules • || V - WH || is monotonically non-increasing and constant iff W, H at stationary point • Version of Gradient Descent (GD) optimization scheme

  21. NMF : Existing Methods Sparse Encoding [ by Hoyer ] • Based on study of neural networks • Enforces statistical sparsity of H • Minimizes sum of non-zeros in H

  22. NMF : Existing Methods Sparse Encoding [ by Mu, Plemmons and Santago ] • Similar to Hoyer’s method • Enforces statistical sparsity of H using a regularization parameter • Minimizes number of non-zeros in H

  23. NMF : Proposed Algorithm Hybrid Method: • W approximated using Multiplicative Method • H calculated using a Constrained Least Square (CLS) model as the metric • Penalizes the number of non-zeros • Similar to the method by Mu, Plemmons and Santago • Called GD-CLS

  24. GD-CLS

  25. Performance

  26. Text Collections Used • Two benchmark topic detection text collections: • Reuters : Collection of documents on assorted topics • TDT2 : Transcripts from news media

  27. Text Collections Used

  28. Accuracy Metric • Defined by: • di: Document number i • = 1 = 1 if the topic labels match • ∂(di) = 0 otherwise k = 2, 4, 6, 8, 10, 15, 20 λ = 0.1, 0.01, 0.001

  29. Results for Reuters Results for TDT2

  30. Observations

  31. Observations : AC • AC inversely proportional to k • Nature of the collection affects AC • Reuters : earn, interest, cocoa • TDT2 : Asian economic crisis, Oprah lawsuit

  32. Observations : λ parameter • AC declines as λ increases ( mostly effective for homogeneous text collections) : • CPU time declines as λ increases

  33. Observations : Cluster size • Imbalance in cluster sizes has adverse effect :

  34. Conclusion & Future Work GD-CLS can be used to effectively cluster text data. Further development involves: • Smart updating • Use in Bioinformatics • Develop user-interface • Convert to C++

More Related