100 likes | 259 Views
Multitype Features Coselection for Web Document Clustering. Advisor : Dr. Hsu Presenter : Wen-Cheng Tsai Author : Shen Huang, Zheng Chen, Yong Yu, Wei-Ying Ma. TKDE,2006. Outline. Motivation Objective Method Experience Conclusion Personal Comments. Motivation.
E N D
Multitype Features Coselection for Web Document Clustering Advisor : Dr. Hsu Presenter : Wen-Cheng Tsai Author : Shen Huang, Zheng Chen, Yong Yu, Wei-Ying Ma TKDE,2006
Outline • Motivation • Objective • Method • Experience • Conclusion • Personal Comments
Motivation • Compared to unsupervised selection, supervised feature selection is more successful in filtering out noise in most case. • Due to a lack of label information, clustering can hardly exploit supervised selection. • Some studies have proposed to solve this problem by “pseudoclass.”
Objective • We propose a novel feature coselection for web document clustering, which is called Multitype Features Coselection for Clustering (MFCC). • MFCC uses intermediate clustering results in one type of feature space to help the selection in other types of feature space. (ex:content, URL, anchor text, user access log) • MFCC reduces effectively the noise introduced by “pseudoclass”and improves clustering performance.
Method-MFCC vs IF ( Iterative feature selection) Feature space:content, URL, anchor text, user access log Five fusion model Feature selection criteria
Error rate Entropy a=7/12 b=6/12 c=5/12 a a b b b c c c c a a a a a a a b c d e f Experience Information on these data set Evaluation measurements Entropy Error rate F1-measure
Experience The performance of MFCC evaluated by Error Rate The performance of MFCC evaluated by F1- Measure The performance of MFCC evaluated by Entropy
Conclusion • We have proposed MFCC, a novel algorithm to exploit different type of features to perform web document clustering. • The better feature set coselected by heterogeneous features will produce better cluster in each space. • The better intermediate result will further improve coselection in the next iteration.
Personal Comments • Advantages • Filter out noise • Improve clustering performance • Disadvantage • No examples • Application • Information retrieval