1 / 10

Multitype Features Coselection for Web Document Clustering

Multitype Features Coselection for Web Document Clustering. Advisor : Dr. Hsu Presenter : Wen-Cheng Tsai Author : Shen Huang, Zheng Chen, Yong Yu, Wei-Ying Ma. TKDE,2006. Outline. Motivation Objective Method Experience Conclusion Personal Comments. Motivation.

elia
Download Presentation

Multitype Features Coselection for Web Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multitype Features Coselection for Web Document Clustering Advisor : Dr. Hsu Presenter : Wen-Cheng Tsai Author : Shen Huang, Zheng Chen, Yong Yu, Wei-Ying Ma TKDE,2006

  2. Outline • Motivation • Objective • Method • Experience • Conclusion • Personal Comments

  3. Motivation • Compared to unsupervised selection, supervised feature selection is more successful in filtering out noise in most case. • Due to a lack of label information, clustering can hardly exploit supervised selection. • Some studies have proposed to solve this problem by “pseudoclass.”

  4. Objective • We propose a novel feature coselection for web document clustering, which is called Multitype Features Coselection for Clustering (MFCC). • MFCC uses intermediate clustering results in one type of feature space to help the selection in other types of feature space. (ex:content, URL, anchor text, user access log) • MFCC reduces effectively the noise introduced by “pseudoclass”and improves clustering performance.

  5. Method-MFCC vs IF ( Iterative feature selection) Feature space:content, URL, anchor text, user access log Five fusion model Feature selection criteria

  6. Error rate Entropy a=7/12 b=6/12 c=5/12 a a b b b c c c c a a a a a a a b c d e f Experience Information on these data set Evaluation measurements Entropy Error rate F1-measure

  7. Experience

  8. Experience The performance of MFCC evaluated by Error Rate The performance of MFCC evaluated by F1- Measure The performance of MFCC evaluated by Entropy

  9. Conclusion • We have proposed MFCC, a novel algorithm to exploit different type of features to perform web document clustering. • The better feature set coselected by heterogeneous features will produce better cluster in each space. • The better intermediate result will further improve coselection in the next iteration.

  10. Personal Comments • Advantages • Filter out noise • Improve clustering performance • Disadvantage • No examples • Application • Information retrieval

More Related