520 likes | 971 Views
Features and Algorithms. Web Page Classification. Paper by: XIAOGUANG QI and BRIAN D. DAVISON. Presentation by: Jason Bender. Outline. Introduction to Classification Background Classification Types Classification Methods Applications Features Algorithms Evolution of Websites.
E N D
Features and Algorithms Web Page Classification • Paper by: XIAOGUANG QI and BRIAN D. DAVISON • Presentation by: Jason Bender
Outline • Introduction to Classification • Background • Classification Types • Classification Methods • Applications • Features • Algorithms • Evolution of Websites
What is web page classification? • The process of assigning a web page to one or more predefined category labels (ex: news, sports, business…) • Classification is generally posed as a supervised learning problem • Set of labeled data is used to train a classifier which is applied to label future examples
Background - Classification Types • Supervised learning problem broken into sub problems: • Subject Classification • Functional Classification • Sentiment Classification • Other types of Classification
Subject Classification • Concerned with subject or topic of the web page • Judging whether a page is about arts, business, sports, etc… Functional Classification • Role that the page is playing • Deciding a page to be a personal homepage, course page, admissions page, etc…
Sentiment Classification • Focuses on the opinion that is presented in a web page Other types of Classification • Such as genre classification and search engine spam classification
Background - Classification Methods • Binary vs. Multiclass • Single Label vs. Multi Label • Soft vs. Hard • Flat vs. Hierarchical
Applications • Why is classification important and how can we use it efficiently?
Constructing, maintaining, or expanding web directories • Web directories provide an efficient way to browse for information within a predefined set of categories • Example: • Open Directory Project • Currently constructed by human effort • 78,940 editors of ODP
Improving the quality of search results • Big problem with search results is search ambiguity
Helping question and answering systems • Can use classification systems to help improve the quality of answers • Example: Wolfram alpha Other applications • Contextual advertising
Features • What features can we extract from a web page to use to help classify it?
Features - Introduction • Because of features such as the hyperlink <a> … </a>, webpage classification is vastly different from other forms of classification such as plaintext classification. • Features organized into two groups: • On-page features – directly located on page • Neighbor features – found on related pages
On Page Features • Textual Contents & Tags • Bag-of-words • N-gram feature • Rather than analyzing individual words, group them into clusters of n-words. • Ex: New York vs. new ….. ….. York • Yahoo! Has used a 5-gram feature • HTML tags – title, heading, metadata, main text • URL
On Page Features • Visual Analysis • Each page has two representations • Text via HTML • Visual via the browser • Each page can be represented as a visual adjacency multigraph
Features of Neighbors • What happens when a page’s features are missing or are unrecognizable?
Features of Neighbors • Assumptions • If page1 is in the neighborhood of many “sports” pages then there is an increasing probability that page1 is also a “sports” page. • Linked pages are more likely to have terms in common
Features of Neighbors • Neighbor Selection • Focus on pages within 2 steps of target • 6 types: parent, child, sibling, spouse, grandparent, and grandchild
Features of Neighbors • Labels • Anchor Text • Surrounding Anchor Text • By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.
Features of Neighbors • Implicit Links • Connections between pages that appear in the results of the same query and are both clicked by users
Algorithms • What are the algorithmic approaches to webpage classification? • Dimension reduction • Relational learning • Hierarchal classification • Information combination
Dimension Reduction • Boost classification by emphasizing certain features that are more useful in classification • Feature Weighting • Reduces the dimensions of feature space • Reduces computational complexity • Classification more accurate as a result of reduced space
Dimension Reduction • Methods • Use first fragment • K-nearest neighbor algorithm • Weighted features • Weighted HTML Tags • Metrics • Expected mutual information • Mutual information
Relational Learning • Relaxation Labeling
Hierarchical Classification • Based on “divide and conquer” • Classification problems split into hierarchical set of sub problems. • Error Minimization • When a lower level category is uncertain of whether page belongs or not, shift assignment one level up.
Information Combination • Combine several methods into one • Information from different sources are used to train multiple classifiers and the collective work of those classifiers make a final decision.
Conclusion • Webpage classification is a type of supervised learning problem aiming to categorize a webpage into a predefined set of categories. • In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier
Evolution of Websites • Apple in 1998
Evolution of Websites • Apple 2008
Evolution of Websites • Nike in 2000
Evolution of Websites • Nike in 2008
Evolution of Websites • Yahoo in 1996
Evolution of Websites • Yahoo in 2008
Evolution of Websites • Microsoft in 1998
Evolution of Websites • Microsoft in 2008
Evolution of Websites • MTV in 1998
Evolution of Websites • MTV in 2008
Sources • Web Page Classification: Features and Algorithms by XiaoguangQi & Brian D. Davison • Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classification by MilosKovacevic, Michelangelo Diligenti, Marco Gori, and VeljkoMilutinovic • The Evolution of Websites http://www.wakeuplater.com/website-building/evolution-of-websites-10-popular-websites.aspx