A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz

Topics • Introduction • Algorithm • Performance • Observation • Conclusion and Future Work

Introduction

Basic Concepts • Text Mining : Detection of trends or patterns in text data • Clustering : Grouping or classifying documents based on similarity of content

Clustering • Manual Vs Automated • Supervised Vs Unsupervised • Hierarchical Vs Partitional

Clustering • Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents • Method : Nonnegative Matrix Factorization or NMF

Vector Space Model of Text Data • Documents represented as n-dimensional vectors • n : terms in the dictionary • vector component : importance of term • Document collection represented as term-by-document matrix

Term-by-Document Matrix • Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the) • Document 1 : a quick brown fox • Document 2 : jumped over the lazy dog

Term-by-Document Matrix

Clustering Method : NMF • Low rank approximation of large sparse matrices • Preserves data nonnegativity • Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)

Other Methods • Other rank reduction methods : • Principal Component Analysis (PCA) • Vector Quantization (VQ) • Produce basis vectors with negative entries • Additive and Subtractive combinations of basis vectors yield original document vectors

NMF • Produces nonnegative basis vectors • Additive combination of basis vectors yield original document vector

Term-by-Document Matrix (all entries nonnegative)

NMF • Basis vectors interpreted as semantic features or topics • Documents clustered on the basis of shared features

NMF • Demonstrated by Xu et. Al (2003): • Outperforms Singular Value Decomposition (SVD) • Comparable to Graph Partitioning methods

Algorithm

NMF : Definition Given • S : Document collection • Vmxn : term-by-document matrix • m : terms in the dictionary • n : Number of documents in S

NMF : Definition NMF is defined as: • Low rank approximation of Vmxn in terms of some metric • Factor V into the product WH • Wmxk : Contains basis vectors • Hkxn : Contains linear combinations • k : Selected number of topics or basis vectors, k << min(m,n)

NMF : Common Approach • Minimize objective function:

NMF : Existing Methods Multiplicative Method (MM) [ by Lee and Seung ] • Based on Multiplicative update rules • || V - WH || is monotonically non-increasing and constant iff W, H at stationary point • Version of Gradient Descent (GD) optimization scheme

NMF : Existing Methods Sparse Encoding [ by Hoyer ] • Based on study of neural networks • Enforces statistical sparsity of H • Minimizes sum of non-zeros in H

NMF : Existing Methods Sparse Encoding [ by Mu, Plemmons and Santago ] • Similar to Hoyer’s method • Enforces statistical sparsity of H using a regularization parameter • Minimizes number of non-zeros in H

NMF : Proposed Algorithm Hybrid Method: • W approximated using Multiplicative Method • H calculated using a Constrained Least Square (CLS) model as the metric • Penalizes the number of non-zeros • Similar to the method by Mu, Plemmons and Santago • Called GD-CLS

GD-CLS

Performance

Text Collections Used • Two benchmark topic detection text collections: • Reuters : Collection of documents on assorted topics • TDT2 : Transcripts from news media

Text Collections Used

Accuracy Metric • Defined by: • di: Document number i • = 1 = 1 if the topic labels match • ∂(di) = 0 otherwise k = 2, 4, 6, 8, 10, 15, 20 λ = 0.1, 0.01, 0.001

Results for Reuters Results for TDT2

Observations

Observations : AC • AC inversely proportional to k • Nature of the collection affects AC • Reuters : earn, interest, cocoa • TDT2 : Asian economic crisis, Oprah lawsuit

Observations : λ parameter • AC declines as λ increases ( mostly effective for homogeneous text collections) : • CPU time declines as λ increases

Observations : Cluster size • Imbalance in cluster sizes has adverse effect :

Conclusion & Future Work GD-CLS can be used to effectively cluster text data. Further development involves: • Smart updating • Use in Bioinformatics • Develop user-interface • Convert to C++

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Presentation Transcript

Text Based Information Retrieval - Text Mining

Matrix Factorization

Non Negative Matrix Factorization

Stochastic Matrix Factorization

An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Learning the parts of objects by nonnegative matrix factorization

Nonnegative Matrix Factorization via Rank-one Downdate

Robust Nonnegative Matrix Factorization

Nonnegative Matrix Factorization via Rank-one Downdate

Evaluation of Distance Metrics for Recognition Based on Non-Negative Matrix Factorization

An Efficient Initialization Method for Nonnegative Matrix Factorization

Matrix Factorization

A Similarity-Based Robust Clustering Method

Top-Down Clustering Method Based On TV-Tree

Mining Discrete Patterns via Binary Matrix Factorization

A method for WSD on Unrestricted Text

Matrix Factorization

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

An MLE-based clustering method on DNA barcode

Principled Regularization for Probabilistic Matrix Factorization

A Very Fast Method for Clustering Big Text Datasets