Fast Webpage classification using URL features

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh

Outline • Introduction • URL Feature Extraction • Recursive segmentation • Using URL feature classes • Experimental results • Conclusion

Introduction • A web page's uniform resource locator (URL) is the least expensive to obtain • One of the more informative sources with respect to classification • The authors approach webpage classification only by using the URLs • Feature extraction from URL • Apply machine learning algorithms

URL Baseline Segmentation • Segment URL at non-alphanumeric characters and at URI-escaped entities (e.g., '%20') to create smaller tokens • Baseline segmentation is straightforward to implement and typically results in 4-7 tokens example

Recursive Segmentation • Concatenated words (e.g., activatealert) are especially prevalent in website domain names • Segmenting these tokens into its component words is likely to increase performance • This paper performs the segmentation by information content (entropy) reduction additionally • A token T can be split into n partitions if where ti denotes the ith partition of T

Recursive Segmentation • A partitioning that has lower entropy than others would be a more probable parse of the token • Such entropies can be estimated by collecting the frequencies of tokens in a large corpus • Applying a tree partition strategy (O(n log n)) to replace all the 2^(T-1) partitions example

URI Components and Length features • First spilt the URL via URI protocol scheme :// host / path / document . extension ? query # fragment • A token that occurs in different parts of URLs may contribute differently to classification • The authors feature set by qualifying them with their components • The absence of certain components can influence classification as well • The absence of certain components also can influence classification as well example

Orthographic Features • Using the surface form of a token also presents challenges for generalization • e.g. 2002 vs. 2003 • Add features for tokens with capitalized letters and/or numbers that differentiate these tokens by their length • These features are added both in a general, URL-wide feature as well as ones that are URI component-specific

Sequential Features • N-grams token might also help in classification • The authors use 2, 3, and 4-grams • Sequential order among tokens also matters • “web spider” and “spider web” • consider model left-to-right precedence between tokens example

Evaluate on Multi-class Classification • Employ a subset of the WebKB, containing 4,167 pages • Four classes (student, faculty, course and project ) • Use SVM and Maximum entropy classification method • Marco F measure is used

Results on WebKB

Evaluate On Hierarchical Categorization • Evaluate on the Open Directory Project • The snapshot dated 3 August 2004, which encompasses over 4.4 M URLs categorized into 17 first-level and 508 second-level categories • The authors use 100,000 randomly chosen ODP URLs to assemble a testing (and training) corpus for the two-level, hierarchical experiments • Only 360second-level categories are used.

Results on ODP

Conclusion • The authors have extended previous work and added features to model URL component length, content, orthography, token sequence and precedence • Also evaluate the use of these features over a large set of tasks including relevance, categorization and Pagerank prediction. • These features do not perform as well with typical web site entry points (i.e., just the domain name), as they attempt to leverage the internal path structure of the URL.

scheme :// host / path / document . extension ? query # fragment

Fast Webpage classification using URL features

Fast Webpage classification using URL features

Presentation Transcript

AML-CLINICAL FEATURES,CLASSIFICATION,TREATMENT

WebPage

Multi-label Relational Neighbor Classification using Social Context Features

URL:

Fast Time Series Classification Using Numerosity Reduction

Improving Web Spam Classification using Rank-time Features

Fast Packet Classification Using Bit Compression with Fast Boolean Expansion

URL

WebPage Summarization Using Clickthrough Data

Learning URL Patterns for Webpage De-duplication

Fast Packet Classification Using Multi-Dimensional Encoding

URL

Fast Time Series Classification Using Numerosity Reduction

Fast Packet Classification Using Bloom filters

Automatic Classification of Married Couples’ Behavior using Audio Features

URL

URL

Off Fast - Features & Advantages

URL

Making a webpage using DreamWeaver