100 likes | 112 Views
Artface aims to determine web client expectations via Open Directory categories to classify and predict website hits. By fetching, parsing, and categorizing referring pages, it approximates user expectations accurately. The system utilizes DMOZ categories and a Naïve Bayesian classifier to provide automated results and enhance user experience.
E N D
Artface(Automated reorganization to fit approximate client expectations) Mike Venzke
Artface Goals • Provide a method for determining the approximate expectation of a web client • Examine feasibility of using this information in an automated manner
Description • Using Open Directory categories, create a model for classifying web pages. • Fetch, parse, and classify the referring page of local web hits. • As a result, have the approximate expectations people have when they go to different parts of your website.
Classification Categories • Used DMOZ categories • Already classified web pages; provides good training data. • Went 3 levels deep in directory • Wanted to get approximate expectation, not so specific that very similar items are considered different. • Time and constraints
Page Fetching • Used Python SGMLParser module • Good at parsing out irrelevant data • Fast enough • Easy to use
Classification • Rainbow – LGPL’d Naïve Bayesian text classifier • Used ~ 9000 documents as training data, with expanded category as classification. • ~7000 test pages taken from web logs of www.cs.rpi.edu and www.linenplace.com
Data Results • Fairly accurate results • http://webgraph.canbelearned.com
Automation Possibilities • Determine ‘good’ categories by self-site classification or user input • Track traffic from ‘good’ categories and provide higher-level links to local pages. • Set of bad categories is small and generally universal. • Take action against local sites based on how they’re being used, not what they have.
Automation Possibilities (contd) • Provide custom pages based on what user expected, rather than what page contains. • May not have found what they wanted. • May be interested in a more broad topic.
Process Enhancement Ideas • More training data • Use all levels of DMOZ data, but push classification up to threshold level. • Handle more page errors • Scripting, authentication errors provide false data. • Remove or special-parse ‘classless’ information pages • Search engines