PEBL: Web Page Classification without Negative Examples

PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, JAN 2004

Outline • Problem statement • Motivation • Related work • Main contribution • Technical details • Experiments • Summary

Problem Statement • To classify web pages into “user-interesting” classes. • E.g. “Home-Page Classifier” “Call for Papers Classifier” • Negative Samples are not given specifically. • Positive and Unlabeled Samples.

Motivation • Collecting Negative Samples may be delicate and arduous • Negative samples must uniformly represent the universal set. • Manually collected negative training examples could be biased. • Predefined classes usually do not match users’ diverse and changing search targets.

Challenges • Collecting unbiased unlabeled data from universal set. • Random Sampling of web pages on Internet. • Achieving classification accuracy from positive unlabeled data as high as from labeled data. • PEBL framework (Mapping-Convergence Algorithm using SVM).

Related Work • Semisupervised Learning • Requires sample of labeled (+/-) and unlabeled data • EM algorithm • Transductive SVM • Single-Class Learning or Classification • Rule-based (k-DNF) • Not tolerant to sparse, high-dimensionality. • Requires knowledge of proportion of positive instances in the universal set. • Probability-based • Requires prior probabilities for each class. • Assumes linear separation. • OSVM, Neural Networks

Main Contribution • Collection of just positive samples speeds up the process of building classifiers. • The universal set of unlabeled samples can be reused for training different classifiers. • This supports example based query on internet. • PEBL achieves accuracy as high as that of a typical framework w/o loss of efficiency in testing.

SVM Overview

Mapping-Convergence Algorithm • Mapping Stage • A weak classifier (1) that draws an initial approximation of “strong” negative data. • 1 must not generate false negatives. • Convergence Stage • Runs in iteration using a second base classifier (2) that maximizes the margin to make progressively better approximation of negative data. • 2 must maximize margin.

Checking the frequency of the features within positive and unlabeled samples gives us a list of positive features. Filter out all the samples having positive features leaving behind just the “strong” negative samples. Mapping Stage

Mapping-Convergence Algorithm

Experiments • LIBSVM for SVM implementation. • Gaussian Kernels for better text categorization accuracy. • Experiment1: The Internet • 2388 pages from DMOZ - unlabeled dataset • 368 personal homepages, 449 non-homepages • 192 college admission pages, 450 non-admission • 188 resume pages, 533 non-resume pages

Experiments • Experiment2: University CS Department • 4093 pages from WebKB - unlabeled dataset • 1641 student pages, 662 non-student pages • 504 project pages, 753 non-project pages • 1124 faculty pages, 729 non-faculty pages • Precision-Recall (P-R) breakeven point is used as the performance measure. • Compared against • TSVM: Traditional SVM • OSVM: One-Class SVM • UN: treating unlabeled data as negative instances

Experiments

Summary • Classifying web pages of interesting class requires laborious preprocessing. • PEBL framework eliminates the need for negative training samples. • M-C algorithm achieves accuracy as high as traditional SVM. • Additional multiplicative logarithmic factor in training time on top of SVM.

End of Show

PEBL: Web Page Classification without Negative Examples