1 / 33

Term Filtering with Bounded Error

Term Filtering with Bounded Error. Zi Yang , Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China { yangzi , tangjie , ljz }@ keg.cs.tsinghua.edu.cn lwthucs@gmail.com

cleave
Download Presentation

Term Filtering with Bounded Error

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn lwthucs@gmail.com Presented in ICDM’2010, Sydney, Australia

  2. Outline • Introduction • Problem Definition • Lossless Term Filtering • Lossy Term Filtering • Experiments • Conclusion

  3. Introduction • Term-document correlation matrix • : a co-occurrence matrix, • Applied in various text mining tasks • text classification [Dasgupta et al., 2007] • text clustering [Bleiet al., 2003] • information retrieval [Wei and Croft, 2006]. • Disadvantage • high computation-cost and intensive memory access. the number of times that term occurs in document

  4. Introduction • Term-document correlation matrix • : a co-occurrence matrix, • Applied in various text mining tasks • text classification [Dasgupta et al., 2007] • text clustering [Bleiet al., 2003] • information retrieval [Wei and Croft, 2006]. • Disadvantage • high computation-cost and intensive memory access. the number of times that term occurs in document How can we overcome the disadvantage?

  5. Introduction (cont'd) • Three general methods: • To improve the algorithm itself • High performance platforms • multicore processors and distributed machines [Chu et al., 2006] • Feature selection approaches • [Yi, 2003] • We follow this line

  6. Introduction (cont'd) • Basic assumption • Each term captures more or less information. • Our goal • A subspace of features (terms) • Minimal information loss. • Conventional solution: • Step 1: Measure each individual term or the dependency to a group of terms. • Information gain or mutual information. • Step 2: Remove features with low scores in importance or high scores in redundancy.

  7. Introduction (cont'd) • Basic assumption • Each term captures more or less information. • Our goal • A subspace of features (terms) • Minimal information loss. • Conventional solution: • Step 1: Measure each individual term or the dependency to a group of terms. • Information gain or mutual information. • Step 2: Remove features with low scores in importance or high scores in redundancy. • Limitations • A generic definition of information loss for a single term and a set of terms in terms of any metric? • The information loss of each individual document, and a set of documents? • Why should we consider the information loss on documents?

  8. Introduction (cont'd) • Consider a simple example: • Question: any loss of information if A and B are substituted by a single term S? • Consider only the information loss for terms • YES!Consider that is defined by some state-of-the-art pseudometrics, cosine similarity. • Consider both • NO!Because both documents emphasize A more than B. It’s information loss of documents! A A B A A B Doc 1 Doc 2

  9. Introduction (cont'd) • Consider a simple example: • Question: any loss of information if A and B are substituted by a single term S? • Consider only the information loss for terms • YES!Consider that is defined by some state-of-the-art pseudometrics, cosine similarity. • Consider both • NO!Because both documents emphasize A more than B. It’s information loss of documents! A A B A A B Doc 1 Doc 2 • What we have done? • - Information loss for both terms and documents • - Term Filtering with Bounded Error problem • - Develop efficient algorithms

  10. Outline • Introduction • Problem Definition • Lossless Term Filtering • Lossy Term Filtering • Experiments • Conclusion

  11. Problem Definition - Superterm • Want • less information loss • smaller term space • Superterm:. the number of occurrences of superterm in document • Group terms into superterms!

  12. Problem Definition – Information loss A transformed representation for document after merging • Information loss during term merging? • user-specified distance measures • Euclidean metric and cosine similarity. • superterm substitutes a set of terms • information loss of terms . • information loss of documents . A A B A A B Doc 1 Doc 2 Doc 1 Doc 1 Doc 2 Doc 2 S A S B

  13. Problem Definition – Information loss A transformed representation for document after merging • Information loss during term merging? • user-specified distance measures • Euclidean metric and cosine similarity. • superterm substitutes a set of terms • information loss of terms . • information loss of documents . Can be chosen with different methods: winner-take-all, average-occurrence, etc. A A B A A B Doc 1 Doc 2 Doc 1 Doc 1 Doc 2 Doc 2 S A S B

  14. Problem Definition – TFBE • Term Filtering with Bounded Error (TFBE) • To minimize the size of supertermssubject to the following constraints: (1) , (2) for all, , (3) for all , . Mapping function from terms to superterms Bounded by user-specified errors

  15. Outline • Introduction • Problem Definition • Lossless Term Filtering • Lossy Term Filtering • Experiments • Conclusion

  16. Lossless Term Filtering • Special case • NO information loss of any term and document • Theorem: The exactoptimal solution of the problem is yielded by grouping terms of the samevector representation. • Algorithm • Step 1: find “local” superterms for each document • Same occurrences within the document • Step 2: add, split, or remove superterms from global superterm set • = 0 and = 0

  17. Lossless Term Filtering • Special case • NO information loss of any term and document • Theorem: The exactoptimal solution of the problem is yielded by grouping terms of the samevector representation. • Algorithm • Step 1: find “local” superterms for each document • Same occurrences within the document • Step 2: add, split, or remove superterms from global superterm set • = 0 and = 0 • This case is applicable? • YES!On Baidu Baike dataset (containing 1,531,215 documents) • The vocabulary size 1,522,576 -> 714,392 • > 50%

  18. Outline • Introduction • Problem Definition • Lossless Term Filtering • Lossy Term Filtering • Experiments • Conclusion

  19. #iterations << |V| Lossy Term Filtering • General case • A greedy algorithm • Step 1: Consider the constraint given by . • Similar to Greedy Cover Algorithm • Step 2: Verify the validity of the candidates with . • Computational Complexity •  • Parallelize the computation for distances  • > 0 and > 0 • NP-hard!

  20. #iterations << |V| Lossy Term Filtering • General case • A greedy algorithm • Step 1: Consider the constraint given by . • Similar to Greedy Cover Algorithm • Step 2: Verify the validity of the candidates with . • Computational Complexity •  • Parallelize the computation for distances  • > 0 and > 0 • NP-hard! • Further reduce the size of candidate superterms? • Locality-sensitive hashing (LSH) 

  21. Efficient Candidate Superterm Generation • Basic idea: • Hash function for the Euclidean distance [Datar et al., 2004]. • LSH (Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html) Same hash value (with several randomized hash projections) Close to each other (in the original space)

  22. a projection vector (drawn from the Gaussian distribution) Efficient Candidate Superterm Generation the width of the buckets (assigned empirically) • Basic idea: • Hash function for the Euclidean distance [Datar et al., 2004]. • LSH (Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html) Same hash value (with several randomized hash projections) Close to each other (in the original space)

  23. a projection vector (drawn from the Gaussian distribution) Efficient Candidate Superterm Generation the width of the buckets (assigned empirically) • Basic idea: • Hash function for the Euclidean distance [Datar et al., 2004]. • LSH (Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html) Same hash value (with several randomized hash projections) Now search -near neighbors in all the buckets! Close to each other (in the original space)

  24. Outline • Introduction • Problem Definition • Lossless Term Filtering • Lossy Term Filtering • Experiments • Conclusion

  25. Experiments • Settings • Datasets • Academic: ArnetMiner(10,768 papers and 8,212 terms) • 20-Newsgroups (18,774 postings and 61,188 terms) • Baselines • Task-irrelevant feature selection: document frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA) • Supervised method (only on classification): Chi-statistic (CHIMAX)

  26. Experiments (cont'd) • Filtering results on Euclidean metric • Findings • Errors ↑, term radio ↓ • Same bound for terms, bound for documents↑, term ratio ↓ • Same bound for documents, bound for terms ↑, term ratio ↓

  27. Experiments (cont'd) • The information loss in terms of error Avg. errors Max errors

  28. Experiments (cont'd) • Efficiency Performance of Parallel Algorithm and LSH* * Optimal resulting from different choice of tuned

  29. Experiments (cont'd) • Applications • Clustering (Arnet) • Classification (20NG) • Document retrieval (Arnet)

  30. Outline • Introduction • Problem Definition • Lossless Term Filtering • Lossy Term Filtering • Experiments • Conclusion

  31. Conclusion • Formally define the problem and perform a theoretical investigation • Develop efficient algorithms for the problems • Validate the approach through an extensive set of experiments with multiple real-world text mining applications

  32. Thank you!

  33. References • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. • Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,. • Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,. • Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,. • Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,. • Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.

More Related