390 likes | 637 Views
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining. Farial Shahnaz. Topics. Introduction Algorithm Performance Observation Conclusion and Future Work. Introduction. Basic Concepts. Text Mining : Detection of trends or patterns in text data
E N D
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz
Topics • Introduction • Algorithm • Performance • Observation • Conclusion and Future Work
Basic Concepts • Text Mining : Detection of trends or patterns in text data • Clustering : Grouping or classifying documents based on similarity of content
Clustering • Manual Vs Automated • Supervised Vs Unsupervised • Hierarchical Vs Partitional
Clustering • Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents • Method : Nonnegative Matrix Factorization or NMF
Vector Space Model of Text Data • Documents represented as n-dimensional vectors • n : terms in the dictionary • vector component : importance of term • Document collection represented as term-by-document matrix
Term-by-Document Matrix • Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the) • Document 1 : a quick brown fox • Document 2 : jumped over the lazy dog
Clustering Method : NMF • Low rank approximation of large sparse matrices • Preserves data nonnegativity • Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)
Other Methods • Other rank reduction methods : • Principal Component Analysis (PCA) • Vector Quantization (VQ) • Produce basis vectors with negative entries • Additive and Subtractive combinations of basis vectors yield original document vectors
NMF • Produces nonnegative basis vectors • Additive combination of basis vectors yield original document vector
NMF • Basis vectors interpreted as semantic features or topics • Documents clustered on the basis of shared features
NMF • Demonstrated by Xu et. Al (2003): • Outperforms Singular Value Decomposition (SVD) • Comparable to Graph Partitioning methods
NMF : Definition Given • S : Document collection • Vmxn : term-by-document matrix • m : terms in the dictionary • n : Number of documents in S
NMF : Definition NMF is defined as: • Low rank approximation of Vmxn in terms of some metric • Factor V into the product WH • Wmxk : Contains basis vectors • Hkxn : Contains linear combinations • k : Selected number of topics or basis vectors, k << min(m,n)
NMF : Common Approach • Minimize objective function:
NMF : Existing Methods Multiplicative Method (MM) [ by Lee and Seung ] • Based on Multiplicative update rules • || V - WH || is monotonically non-increasing and constant iff W, H at stationary point • Version of Gradient Descent (GD) optimization scheme
NMF : Existing Methods Sparse Encoding [ by Hoyer ] • Based on study of neural networks • Enforces statistical sparsity of H • Minimizes sum of non-zeros in H
NMF : Existing Methods Sparse Encoding [ by Mu, Plemmons and Santago ] • Similar to Hoyer’s method • Enforces statistical sparsity of H using a regularization parameter • Minimizes number of non-zeros in H
NMF : Proposed Algorithm Hybrid Method: • W approximated using Multiplicative Method • H calculated using a Constrained Least Square (CLS) model as the metric • Penalizes the number of non-zeros • Similar to the method by Mu, Plemmons and Santago • Called GD-CLS
Text Collections Used • Two benchmark topic detection text collections: • Reuters : Collection of documents on assorted topics • TDT2 : Transcripts from news media
Accuracy Metric • Defined by: • di: Document number i • = 1 = 1 if the topic labels match • ∂(di) = 0 otherwise k = 2, 4, 6, 8, 10, 15, 20 λ = 0.1, 0.01, 0.001
Observations : AC • AC inversely proportional to k • Nature of the collection affects AC • Reuters : earn, interest, cocoa • TDT2 : Asian economic crisis, Oprah lawsuit
Observations : λ parameter • AC declines as λ increases ( mostly effective for homogeneous text collections) : • CPU time declines as λ increases
Observations : Cluster size • Imbalance in cluster sizes has adverse effect :
Conclusion & Future Work GD-CLS can be used to effectively cluster text data. Further development involves: • Smart updating • Use in Bioinformatics • Develop user-interface • Convert to C++