Advanced Term Filtering Techniques for Text Mining Efficiency

Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn lwthucs@gmail.com Presented in ICDM’2010, Sydney, Australia

Outline • Introduction • Problem Definition • Lossless Term Filtering • Lossy Term Filtering • Experiments • Conclusion

Introduction • Term-document correlation matrix • : a co-occurrence matrix, • Applied in various text mining tasks • text classification [Dasgupta et al., 2007] • text clustering [Bleiet al., 2003] • information retrieval [Wei and Croft, 2006]. • Disadvantage • high computation-cost and intensive memory access. the number of times that term occurs in document How can we overcome the disadvantage?

Introduction (cont'd) • Three general methods: • To improve the algorithm itself • High performance platforms • multicore processors and distributed machines [Chu et al., 2006] • Feature selection approaches • [Yi, 2003] • We follow this line

Introduction (cont'd) • Basic assumption • Each term captures more or less information. • Our goal • A subspace of features (terms) • Minimal information loss. • Conventional solution: • Step 1: Measure each individual term or the dependency to a group of terms. • Information gain or mutual information. • Step 2: Remove features with low scores in importance or high scores in redundancy. • Limitations • A generic definition of information loss for a single term and a set of terms in terms of any metric? • The information loss of each individual document, and a set of documents? • Why should we consider the information loss on documents?

Introduction (cont'd) • Consider a simple example: • Question: any loss of information if A and B are substituted by a single term S? • Consider only the information loss for terms • YES!Consider that is defined by some state-of-the-art pseudometrics, cosine similarity. • Consider both • NO!Because both documents emphasize A more than B. It’s information loss of documents! A A B A A B Doc 1 Doc 2 • What we have done? • - Information loss for both terms and documents • - Term Filtering with Bounded Error problem • - Develop efficient algorithms

Problem Definition - Superterm • Want • less information loss • smaller term space • Superterm:. the number of occurrences of superterm in document • Group terms into superterms!

Problem Definition – Information loss A transformed representation for document after merging • Information loss during term merging? • user-specified distance measures • Euclidean metric and cosine similarity. • superterm substitutes a set of terms • information loss of terms . • information loss of documents . Can be chosen with different methods: winner-take-all, average-occurrence, etc. A A B A A B Doc 1 Doc 2 Doc 1 Doc 1 Doc 2 Doc 2 S A S B

Problem Definition – TFBE • Term Filtering with Bounded Error (TFBE) • To minimize the size of supertermssubject to the following constraints: (1) , (2) for all, , (3) for all , . Mapping function from terms to superterms Bounded by user-specified errors

Lossless Term Filtering • Special case • NO information loss of any term and document • Theorem: The exactoptimal solution of the problem is yielded by grouping terms of the samevector representation. • Algorithm • Step 1: find “local” superterms for each document • Same occurrences within the document • Step 2: add, split, or remove superterms from global superterm set • = 0 and = 0 • This case is applicable? • YES!On Baidu Baike dataset (containing 1,531,215 documents) • The vocabulary size 1,522,576 -> 714,392 • > 50%

#iterations << |V| Lossy Term Filtering • General case • A greedy algorithm • Step 1: Consider the constraint given by . • Similar to Greedy Cover Algorithm • Step 2: Verify the validity of the candidates with . • Computational Complexity •  • Parallelize the computation for distances  • > 0 and > 0 • NP-hard! • Further reduce the size of candidate superterms? • Locality-sensitive hashing (LSH) 

a projection vector (drawn from the Gaussian distribution) Efficient Candidate Superterm Generation the width of the buckets (assigned empirically) • Basic idea: • Hash function for the Euclidean distance [Datar et al., 2004]. • LSH (Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html) Same hash value (with several randomized hash projections) Now search -near neighbors in all the buckets! Close to each other (in the original space)

Experiments • Settings • Datasets • Academic: ArnetMiner(10,768 papers and 8,212 terms) • 20-Newsgroups (18,774 postings and 61,188 terms) • Baselines • Task-irrelevant feature selection: document frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA) • Supervised method (only on classification): Chi-statistic (CHIMAX)

Experiments (cont'd) • Filtering results on Euclidean metric • Findings • Errors ↑, term radio ↓ • Same bound for terms, bound for documents↑, term ratio ↓ • Same bound for documents, bound for terms ↑, term ratio ↓

Experiments (cont'd) • The information loss in terms of error Avg. errors Max errors

Experiments (cont'd) • Efficiency Performance of Parallel Algorithm and LSH* * Optimal resulting from different choice of tuned

Experiments (cont'd) • Applications • Clustering (Arnet) • Classification (20NG) • Document retrieval (Arnet)

Conclusion • Formally define the problem and perform a theoretical investigation • Develop efficient algorithms for the problems • Validate the approach through an extensive set of experiments with multiple real-world text mining applications

Thank you!

References • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. • Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,. • Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,. • Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,. • Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,. • Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.

Advanced Term Filtering Techniques for Text Mining Efficiency

Advanced Term Filtering Techniques for Text Mining Efficiency

Presentation Transcript

Cross-Selling with Collaborative Filtering

Filtering MAP Data With Excel

Error Detection with Error Rules

Greedy Routing with Bounded Stretch

Audio/voice Recorder with filtering

Exploring an Environment with Bounded Visibility

Bounded Delay Scheduling with Packet Dependencies

Filtering Spam With

Error Correction in Computationally Bounded Channels

Cross-Selling with Collaborative Filtering

Image Filtering with GLSL

Packet Scheduling with Bounded Buffers

Internet Filtering with DansGuardian

Graph Summarization with Bounded Error

Estimation of the derivatives of a digital function with a convergent bounded error

Perturbative gadgets with constant-bounded interactions

Assumptions About the Error Term e

Collaborative filtering with privacy

Collaborative filtering with temporal dynamics

Route filtering: Handle with Care

Filtering Spam With

Collaborative Filtering with Temporal Dynamics