160 likes | 286 Views
Using Web Structure for Classifying and Describing Web Pages. Eric J. Glover 1 , Kostas Tsioutsiouliklis 1,2 , Steve Lawrence 1 , David M. Pennock 1 , Gary W. Flake 1 International World Wide Web Conference, 2002 Presented by Zaihan Yang CSE Web Mining. Introduction. Aim
E N D
Using Web Structure for Classifying and Describing Web Pages Eric J. Glover1, Kostas Tsioutsiouliklis1,2, Steve Lawrence1, David M. Pennock1, Gary W. Flake1 International World Wide Web Conference, 2002 Presented by Zaihan Yang CSE Web Mining
Introduction • Aim • Classification of web pages • Description of web pages (to name clusters of web pages) • Using Web Structure • Extracting patterns from hyperlinks in the web. • HyperLink • The destination page • Associated anchortext describing link
Typical Text-based classification • To utilize the words (or phrases) of a target document, considering the most significant features. • Not Effective. • E.g. • The home page of General Motors (www.gm.com) does not state that they are a car company. • Full text • Anchortext • Extended-anchortext • A combination
Virtual Document • A virtual document is: • A collection of anchortexts or extended anchortexts from links pointing to the target document. • Anchortext: • The words occurring inside of a link • Extended anchortext: • The set of rendered words occurring up to 25 words before and after an associated link (as well as the anchortext itself).
Main Method • Full-text classifier • Virtual documents classifier • Two Improvement methods • Name a cluster • Main Procedure Datasets Features EFL Ranking Train SVM
Datasets • Positive: a set of web pages downloaded from various Yahoo! Categories. • Negative: Random documents from outside Yahoo! • WebKB dataset • Features: • All words and two or three word phrases • i.e. My favorite game is scrabble. • Possible features: My, my favorite, my favorite game, favorite, favorite game, etc.
Dimensionality reduction • To remove useless features. • Two step process: • First, remove all features that do not occur in a specified percentage of documents. i.e. (|Af|/|A| < T+) and (|Bf|/|B| < T-) • A: the set of positive examples. • B: the set of negative examples. • Af: documents in A that contain feature f. • Bf: documents in B that contain feature f. • T+: threshold for positive features. • T-: threshold for negative features. • Second, rank the remaining features based on expected entropy loss.
Expected Entropy Loss • The prior entropy of the class distribution: • The posterior entropy of the class when the feature is present: • The posterior entropy of the class when the feature is absent: • The expected entropy loss:
Train SVM • A set of data points: {(x1,y1),…, (xN,yN)} • xi is an input and yi is a target output (1 or -1). • Separating hyperplane: • w•φ(xi) + b = 0 • w•φ(xi) + b ≥ 1 if yi = 1 • w•φ(xi) + b ≤ -1 if yi = -1 • w•φ(xi) + b where minimizing • Output: Kernel function:
Improvement-Uncertainty Sampling • The result from an SVM classifier is a real number from -∞ to +∞. • When the output is on the interval (-1,1) it is less certain than if it is on the intervals (-∞,-1) and (1,+∞). • The region (-1,1) is called the “uncertain region”. • Uncertainty sampling • A human judges the documents in the “uncertain region”
Improvement-Combination • To combine results from the extended anchortext based classifier with the less accurate full-text classifier. Negative but uncertain? Result of extended-AT classifier Extended-AT classifier N Web page Y Full-text classifier Positive and |output| > |outputAT|? N negative Y positive
Name the Cluster • Using the top ranked features extracted from the extended anchotexts virtual documents to name a cluster. • Beliefs: • The words near the anchortexts are descriptions of the target documents. • The top ranked features by expected entropy loss are those which occur in many positive examples,and few negative ones.
Results-classifying • Anchortext alone is comparable for classification purpose with the full-text. • Classification accuracy is significantly improved when using the extended anchortext instead of the document full-text. • Combination method is highly effective for improving positive-class accuracy, but reduces negative class accuracy. • Uncertainty sampling required examining only 8% of the documents on average, while providing an average positive class accuracy improvement of almost 10 percentage points.
Result--Clustering • The full-text appears comparable to the extended anchortext. • The anchortext alone appears to do a poor job of describing the category.
Future Work • To include other features on the inbound web pages besides extended anchortext: • To examine the effects of the number of inbound links. • To examine the nature of the category by expanding this to thousands of categories. • To study the effects of the positive set size.