Multitype Features Coselection for Web Document Clustering

Multitype Features Coselection for Web Document Clustering Advisor : Dr. Hsu Presenter : Wen-Cheng Tsai Author : Shen Huang, Zheng Chen, Yong Yu, Wei-Ying Ma TKDE,2006

Outline • Motivation • Objective • Method • Experience • Conclusion • Personal Comments

Motivation • Compared to unsupervised selection, supervised feature selection is more successful in filtering out noise in most case. • Due to a lack of label information, clustering can hardly exploit supervised selection. • Some studies have proposed to solve this problem by “pseudoclass.”

Objective • We propose a novel feature coselection for web document clustering, which is called Multitype Features Coselection for Clustering (MFCC). • MFCC uses intermediate clustering results in one type of feature space to help the selection in other types of feature space. (ex：content, URL, anchor text, user access log) • MFCC reduces effectively the noise introduced by “pseudoclass”and improves clustering performance.

Method-MFCC vs IF ( Iterative feature selection) Feature space：content, URL, anchor text, user access log Five fusion model Feature selection criteria

Error rate Entropy a=7/12 b=6/12 c=5/12 a a b b b c c c c a a a a a a a b c d e f Experience Information on these data set Evaluation measurements Entropy Error rate F1-measure

Experience

Experience The performance of MFCC evaluated by Error Rate The performance of MFCC evaluated by F1- Measure The performance of MFCC evaluated by Entropy

Conclusion • We have proposed MFCC, a novel algorithm to exploit different type of features to perform web document clustering. • The better feature set coselected by heterogeneous features will produce better cluster in each space. • The better intermediate result will further improve coselection in the next iteration.

Personal Comments • Advantages • Filter out noise • Improve clustering performance • Disadvantage • No examples • Application • Information retrieval

Multitype Features Coselection for Web Document Clustering

Multitype Features Coselection for Web Document Clustering

Presentation Transcript

Mixture Models for Document Clustering

Web Document Clustering

Clustering for web documents

Similarity Measures for Text Document Clustering

Document Clustering via Matrix Representation

Recursive Bipartite Spectral Clustering for Document Categorization

Dynamic hierarchical algorithms for document clustering

Web Mining: Phrase-based Document Indexing and Document Clustering

Document Clustering

Web Document Clustering

Measuring Contribution of HTML Features in Web Document Clustering

Multitype Library Board

Term and Document Clustering

Document Clustering

DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM

Learning Element Similarity for XML Document Clustering

Web Document Clustering

Document Clustering with Prior Knowledge

web document