Web Page Categorization without the Web Page

Web Page Categorization without the Web Page Author: Min-Yen Kan WWW-2004

Basic Idea • Web Page Categorization ~ Text Categorization • Some retrieve the whole document • This yields URLs of additional documents • Could result in cyclic crawling or non-terminating crawling • Glean information from intuitive URLs • Avoid the bottleneck

An Example • http://cs.cornell.edu/Info/Courses/Current/CS415/CS415.html • Classify the above webpage into one of the following categories: • Course • Faculty • Project • Student

Approach • 2 phase URL segmentation • First phase • Baseline • scheme://host/path-elements/document.extension • More segmentation like, faculty-info  faculty info • Refined • Break the URL if a transition between uppercase, lowercase and digits is observed

Approach • Second phase • Information content reduction • Examines all possible partitions of the segment • Adds information content (IC) of all such partitions • Pick the one with lowest IC • Title token based finite state transducer • What about acronyms • Non-deterministic weighted finite-state transducer splits and expands segments based on previously seen web page titles

An Example • nytimes New York Times • ФNewYorkTimes • Score of 12 and outputs |n|y|times R1 R5 R5 R1 R5 R5 R5 R1 R3 R3 R3 R4

Experiments • Dataset used: WebKB (4167 pages) • Classified under student, faculty, course and project • Classification used: SVM • Compared with: FOIL-PILFS (based on inductive logic programming) • Evaluation made based on (U)RL {Ub,Ur,Ui,Uf}, (A)nchor text, (T)itle text and page te(X)t

Experiments

Conclusion • URLs contain tokens effective for classification • Its faster • Careful URL segmentation boosts classification • URL segmentation is more powerful than expansion • Can assist source based classification to a limited extent • FST can not expand what it hasn’t seen • Cryptic URLs are hard to tackle

Web Page Categorization without the Web Page

Web Page Categorization without the Web Page

Presentation Transcript

Web Page Design

Web Page

Web page

Web Page

Web Page Design

Web page

Web Page Creation

Enhancing the Web Page

Web Page Design

Web Page

Course web page:

Web Page Introduction

Web Page Design

Course web page:

Web page: euhou

Manipulating the web page

web page design

Topic Distillation and Web Page Categorization