90 likes | 259 Views
Web Page Categorization without the Web Page . Author: Min-Yen Kan WWW-2004. Basic Idea. Web Page Categorization ~ Text Categorization Some retrieve the whole document This yields URLs of additional documents Could result in cyclic crawling or non-terminating crawling
E N D
Web Page Categorization without the Web Page Author: Min-Yen Kan WWW-2004
Basic Idea • Web Page Categorization ~ Text Categorization • Some retrieve the whole document • This yields URLs of additional documents • Could result in cyclic crawling or non-terminating crawling • Glean information from intuitive URLs • Avoid the bottleneck
An Example • http://cs.cornell.edu/Info/Courses/Current/CS415/CS415.html • Classify the above webpage into one of the following categories: • Course • Faculty • Project • Student
Approach • 2 phase URL segmentation • First phase • Baseline • scheme://host/path-elements/document.extension • More segmentation like, faculty-info faculty info • Refined • Break the URL if a transition between uppercase, lowercase and digits is observed
Approach • Second phase • Information content reduction • Examines all possible partitions of the segment • Adds information content (IC) of all such partitions • Pick the one with lowest IC • Title token based finite state transducer • What about acronyms • Non-deterministic weighted finite-state transducer splits and expands segments based on previously seen web page titles
An Example • nytimes New York Times • ФNewYorkTimes • Score of 12 and outputs |n|y|times R1 R5 R5 R1 R5 R5 R5 R1 R3 R3 R3 R4
Experiments • Dataset used: WebKB (4167 pages) • Classified under student, faculty, course and project • Classification used: SVM • Compared with: FOIL-PILFS (based on inductive logic programming) • Evaluation made based on (U)RL {Ub,Ur,Ui,Uf}, (A)nchor text, (T)itle text and page te(X)t
Conclusion • URLs contain tokens effective for classification • Its faster • Careful URL segmentation boosts classification • URL segmentation is more powerful than expansion • Can assist source based classification to a limited extent • FST can not expand what it hasn’t seen • Cryptic URLs are hard to tackle