1 / 43

Web Page Classification

Features and Algorithms. Web Page Classification. Paper by: XIAOGUANG QI and BRIAN D. DAVISON. Presentation by: Jason Bender. Outline. Introduction to Classification Background Classification Types Classification Methods Applications Features Algorithms Evolution of Websites.

royce
Download Presentation

Web Page Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Features and Algorithms Web Page Classification • Paper by: XIAOGUANG QI and BRIAN D. DAVISON • Presentation by: Jason Bender

  2. Outline • Introduction to Classification • Background • Classification Types • Classification Methods • Applications • Features • Algorithms • Evolution of Websites

  3. What is web page classification? • The process of assigning a web page to one or more predefined category labels (ex: news, sports, business…) • Classification is generally posed as a supervised learning problem • Set of labeled data is used to train a classifier which is applied to label future examples

  4. Background - Classification Types • Supervised learning problem broken into sub problems: • Subject Classification • Functional Classification • Sentiment Classification • Other types of Classification

  5. Subject Classification • Concerned with subject or topic of the web page • Judging whether a page is about arts, business, sports, etc… Functional Classification • Role that the page is playing • Deciding a page to be a personal homepage, course page, admissions page, etc…

  6. Sentiment Classification • Focuses on the opinion that is presented in a web page Other types of Classification • Such as genre classification and search engine spam classification

  7. Background - Classification Methods • Binary vs. Multiclass • Single Label vs. Multi Label • Soft vs. Hard • Flat vs. Hierarchical

  8. Binary vs. Multiclass Classification

  9. Single-Label vs. Multi-Label Classification

  10. Soft vs. Hard Classification

  11. Flat vs. Hierarchical Classification

  12. Applications • Why is classification important and how can we use it efficiently?

  13. Constructing, maintaining, or expanding web directories • Web directories provide an efficient way to browse for information within a predefined set of categories • Example: • Open Directory Project • Currently constructed by human effort • 78,940 editors of ODP

  14. Improving the quality of search results • Big problem with search results is search ambiguity

  15. Helping question and answering systems • Can use classification systems to help improve the quality of answers • Example: Wolfram alpha Other applications • Contextual advertising

  16. Features • What features can we extract from a web page to use to help classify it?

  17. Features - Introduction • Because of features such as the hyperlink <a> … </a>, webpage classification is vastly different from other forms of classification such as plaintext classification. • Features organized into two groups: • On-page features – directly located on page • Neighbor features – found on related pages

  18. On Page Features • Textual Contents & Tags • Bag-of-words • N-gram feature • Rather than analyzing individual words, group them into clusters of n-words. • Ex: New York vs. new ….. ….. York • Yahoo! Has used a 5-gram feature • HTML tags – title, heading, metadata, main text • URL

  19. On Page Features • Visual Analysis • Each page has two representations • Text via HTML • Visual via the browser • Each page can be represented as a visual adjacency multigraph

  20. Features of Neighbors • What happens when a page’s features are missing or are unrecognizable?

  21. Features of Neighbors • Assumptions • If page1 is in the neighborhood of many “sports” pages then there is an increasing probability that page1 is also a “sports” page. • Linked pages are more likely to have terms in common

  22. Features of Neighbors • Neighbor Selection • Focus on pages within 2 steps of target • 6 types: parent, child, sibling, spouse, grandparent, and grandchild

  23. Features of Neighbors • Labels • Anchor Text • Surrounding Anchor Text • By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.

  24. Features of Neighbors • Implicit Links • Connections between pages that appear in the results of the same query and are both clicked by users

  25. Algorithms • What are the algorithmic approaches to webpage classification? • Dimension reduction • Relational learning • Hierarchal classification • Information combination

  26. Dimension Reduction • Boost classification by emphasizing certain features that are more useful in classification • Feature Weighting • Reduces the dimensions of feature space • Reduces computational complexity • Classification more accurate as a result of reduced space

  27. Dimension Reduction • Methods • Use first fragment • K-nearest neighbor algorithm • Weighted features • Weighted HTML Tags • Metrics • Expected mutual information • Mutual information

  28. Relational Learning • Relaxation Labeling

  29. Hierarchical Classification • Based on “divide and conquer” • Classification problems split into hierarchical set of sub problems. • Error Minimization • When a lower level category is uncertain of whether page belongs or not, shift assignment one level up.

  30. Information Combination • Combine several methods into one • Information from different sources are used to train multiple classifiers and the collective work of those classifiers make a final decision.

  31. Conclusion • Webpage classification is a type of supervised learning problem aiming to categorize a webpage into a predefined set of categories. • In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier

  32. Evolution of Websites • Apple in 1998

  33. Evolution of Websites • Apple 2008

  34. Evolution of Websites • Nike in 2000

  35. Evolution of Websites • Nike in 2008

  36. Evolution of Websites • Yahoo in 1996

  37. Evolution of Websites • Yahoo in 2008

  38. Evolution of Websites • Microsoft in 1998

  39. Evolution of Websites • Microsoft in 2008

  40. Evolution of Websites • MTV in 1998

  41. Evolution of Websites • MTV in 2008

  42. Sources • Web Page Classification: Features and Algorithms by XiaoguangQi & Brian D. Davison • Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classification by MilosKovacevic, Michelangelo Diligenti, Marco Gori, and VeljkoMilutinovic • The Evolution of Websites http://www.wakeuplater.com/website-building/evolution-of-websites-10-popular-websites.aspx

More Related