1 / 10

Artface: Automated Web Client Expectation Classifier

Artface aims to determine web client expectations via Open Directory categories to classify and predict website hits. By fetching, parsing, and categorizing referring pages, it approximates user expectations accurately. The system utilizes DMOZ categories and a Naïve Bayesian classifier to provide automated results and enhance user experience.

ceverett
Download Presentation

Artface: Automated Web Client Expectation Classifier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Artface(Automated reorganization to fit approximate client expectations) Mike Venzke

  2. Artface Goals • Provide a method for determining the approximate expectation of a web client • Examine feasibility of using this information in an automated manner

  3. Description • Using Open Directory categories, create a model for classifying web pages. • Fetch, parse, and classify the referring page of local web hits. • As a result, have the approximate expectations people have when they go to different parts of your website.

  4. Classification Categories • Used DMOZ categories • Already classified web pages; provides good training data. • Went 3 levels deep in directory • Wanted to get approximate expectation, not so specific that very similar items are considered different. • Time and constraints

  5. Page Fetching • Used Python SGMLParser module • Good at parsing out irrelevant data • Fast enough • Easy to use

  6. Classification • Rainbow – LGPL’d Naïve Bayesian text classifier • Used ~ 9000 documents as training data, with expanded category as classification. • ~7000 test pages taken from web logs of www.cs.rpi.edu and www.linenplace.com

  7. Data Results • Fairly accurate results • http://webgraph.canbelearned.com

  8. Automation Possibilities • Determine ‘good’ categories by self-site classification or user input • Track traffic from ‘good’ categories and provide higher-level links to local pages. • Set of bad categories is small and generally universal. • Take action against local sites based on how they’re being used, not what they have.

  9. Automation Possibilities (contd) • Provide custom pages based on what user expected, rather than what page contains. • May not have found what they wanted. • May be interested in a more broad topic.

  10. Process Enhancement Ideas • More training data • Use all levels of DMOZ data, but push classification up to threshold level. • Handle more page errors • Scripting, authentication errors provide false data. • Remove or special-parse ‘classless’ information pages • Search engines

More Related