1 / 19

PEBL: Web Page Classification without Negative Examples

PEBL: Web Page Classification without Negative Examples. Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, JAN 2004. Outline. Problem statement Motivation Related work Main contribution Technical details Experiments Summary.

kerem
Download Presentation

PEBL: Web Page Classification without Negative Examples

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, JAN 2004

  2. Outline • Problem statement • Motivation • Related work • Main contribution • Technical details • Experiments • Summary

  3. Problem Statement • To classify web pages into “user-interesting” classes. • E.g. “Home-Page Classifier” “Call for Papers Classifier” • Negative Samples are not given specifically. • Positive and Unlabeled Samples.

  4. Motivation • Collecting Negative Samples may be delicate and arduous • Negative samples must uniformly represent the universal set. • Manually collected negative training examples could be biased. • Predefined classes usually do not match users’ diverse and changing search targets.

  5. Challenges • Collecting unbiased unlabeled data from universal set. • Random Sampling of web pages on Internet. • Achieving classification accuracy from positive unlabeled data as high as from labeled data. • PEBL framework (Mapping-Convergence Algorithm using SVM).

  6. Related Work • Semisupervised Learning • Requires sample of labeled (+/-) and unlabeled data • EM algorithm • Transductive SVM • Single-Class Learning or Classification • Rule-based (k-DNF) • Not tolerant to sparse, high-dimensionality. • Requires knowledge of proportion of positive instances in the universal set. • Probability-based • Requires prior probabilities for each class. • Assumes linear separation. • OSVM, Neural Networks

  7. Main Contribution • Collection of just positive samples speeds up the process of building classifiers. • The universal set of unlabeled samples can be reused for training different classifiers. • This supports example based query on internet. • PEBL achieves accuracy as high as that of a typical framework w/o loss of efficiency in testing.

  8. SVM Overview

  9. Mapping-Convergence Algorithm • Mapping Stage • A weak classifier (1) that draws an initial approximation of “strong” negative data. • 1 must not generate false negatives. • Convergence Stage • Runs in iteration using a second base classifier (2) that maximizes the margin to make progressively better approximation of negative data. • 2 must maximize margin.

  10. Checking the frequency of the features within positive and unlabeled samples gives us a list of positive features. Filter out all the samples having positive features leaving behind just the “strong” negative samples. Mapping Stage

  11. Mapping-Convergence Algorithm

  12. Mapping-Convergence Algorithm

  13. Experiments • LIBSVM for SVM implementation. • Gaussian Kernels for better text categorization accuracy. • Experiment1: The Internet • 2388 pages from DMOZ - unlabeled dataset • 368 personal homepages, 449 non-homepages • 192 college admission pages, 450 non-admission • 188 resume pages, 533 non-resume pages

  14. Experiments • Experiment2: University CS Department • 4093 pages from WebKB - unlabeled dataset • 1641 student pages, 662 non-student pages • 504 project pages, 753 non-project pages • 1124 faculty pages, 729 non-faculty pages • Precision-Recall (P-R) breakeven point is used as the performance measure. • Compared against • TSVM: Traditional SVM • OSVM: One-Class SVM • UN: treating unlabeled data as negative instances

  15. Experiments

  16. Experiments

  17. Summary • Classifying web pages of interesting class requires laborious preprocessing. • PEBL framework eliminates the need for negative training samples. • M-C algorithm achieves accuracy as high as traditional SVM. • Additional multiplicative logarithmic factor in training time on top of SVM.

  18. End of Show

More Related