Automatic Classification of Bookmarked Web Pages

Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007 1

Overview • General Principles • Reading List • Tasks involved • Schedule 2

General Principles • Email: cstaff@cs.um.edu.mt • Web site: http://www.cs.um.edu.mt/~cstaff • Plagiarism • Referencing • ACM Digital Library: Membership for students from Malta 3

Reading List • Abrams, D., Baecker, R.: How people use WWW bookmarks. In: CHI ’97: CHI ’97 extended abstracts on Human factors in computing systems, New York, NY, USA, ACM Press (1997) 341-342 • Bugeja, I.: Managing WWW browser’s bookmarks and history (a Firefox extension). Final year project report, Department of Computer Science & AI, University of Malta, 2006. http://hyper.iannet.org/hyperBkreport.pdf • Cockburn, A., McKenzie, B.: What do web users do? an empirical analysis of web use. In: Int. J. Hum.-Comput. Stud. 54(6) (2001) 903-922 • Staff, C.: Automatic Classification of Web Pages into Bookmark Categories. Submitted to UM’07, 2007. • Staff, C.: CSA3200 User Adaptive Systems Lecture Notes, 2006. Follow link from http://www.cs.um.edu.mt/~cstaff/ • Mozilla Development Center: 2006, “Building an Extension”., http://developer.mozilla.org/en/docs/Building_an_Extension 4

Classifying Bookmarks • When a user bookmarks a page (or adds a page to Favorites) we want to recommend the best existing category • Improvement over simply recommending last category saved to • Improvement over simply offering ‘category root’ 5

Tasks • Representation of bookmark categories • Two clustering/similarity algorithms • Extra utility • User interface • Evaluation • Write up report 6

Tasks Overview • We are going to implement a number of algorithms to help with the overall task. • Some of these will be used while the user is browsing • Others will be used to classify pages ‘off-line’ (especially for the existing bookmark files) • We’re going to have a ‘standard test bed’ for conducting the evaluation 7

Tasks Overview • Represent bookmark categories • We’re starting with populated bookmark files, so use ‘How Did I Find That?’ approach • Plus another, individual approach • When a page is to be bookmarked • If referrer page is available, identify topic of page • Otherwise, identify page topic using ‘How Did I Find That?’ approach • Compare current topic topic to bookmark category representations 8

Tasks Overview • User Interface • To replace the built in ‘Bookmark this Page’ menu item and keyboard command • To display a new dialog box to users to offer choice of recommended category, last category used, and to allow user to select some other category or create a new category 9

Tasks Overview • Evaluation • Will be standard and automated • For testing purposes, download test_eval.zip from home page • Contains 2x8 bookmark files (.html) and one URL file (.txt) • Bookmark files are ‘real’ files collected one year ago • URL file contains a number of lines with following format: • Bk file ID, URL of bookmarked page, home category, exact entry from bookmark file (with date created, etc.) 10

Tasks Overview • Evaluation (continued) • Challenge to also ‘re-create’ bookmark file in the order that it was created by users • Eventually, close to the end of the APT, the evaluation test data sets will be made available • About 20 unseen bookmark files and one URL file • Same format as before • You’ll get bookmark files early to prepare representations, but classification run will be part of a demo session 11

Tasks Overview • Write up report • We’ll spend some time looking at the structure of a scientific report, how to write a literature review, present evaluation results, etc. 12

Task: Representing Bookmark Categories • We need to identify what a category or collection of bookmarks is about so that we can check if a new page could belong to that category • Ideally, we find out what is similar between the different documents in the category (especially if we know which link a user followed to reach child!) • In the absence of this information use: • One algorithm will be based on ‘How Did I Find That?’ • A second algorithm that is up to you 13

Task: Two clustering/similarity algorithms • Once we have represented the categories, we can ‘send’ page to be bookmarked to best category • Similar to ‘information filtering’ or ‘clustering’ • What similarity measure or clustering algorithm to use? • One way of representing page to be classified will be based on ‘How Did I Find That?’ • Other way researched/developed by you 14

Task: Extra Utility • How can the classification of web pages to be bookmarked be improved? • What particular interests do you have, and how can they be used to improve classification? • E.g., synonym detection, automatic reorganisation of bookmarks, … 15

Task: User Interface • Can use XUL to ‘extend’ Mozilla Firefox • http://www.xulplanet.com/tutorials/xultu/ • Use Ian Bugeja’s HyperBK as a framework (with due referencing and acknowledgement, of course): https://addons.mozilla.org/firefox/2539/ • Programs are likely to be JavaScript • Your extension will then be portable 16

Task: User Interface • You can use Ian’s interface, but it may need some work to tweak it: • To support some of the new functionality that you’re adding (e.g. choice of algorithms) • And to fix some of the usability problems with the dialog box 17

Task: Evaluation • ACofBWP will be evaluated! • But you must build a version of the program that can be called in batch mode; that will accept a directory containing bookmark files and a URL file; that will run in two modes (classify and reconstruct); and that will report faithfully on its performance. 18

Task: Write Up Report • At least one tutorial will be dedicated to good report writing practice; how to write a literature review; how to build and write references; how to present evaluation results. 19

Grading Structure • 10% for obtaining an average of at least 0.8 precision on evaluation (for random bookmark classification, using either implemented approach) • 10% for incurring a maximum 2 second overhead on average to classify a page (must faithfully report time overhead) • Max. 10% for extra utility. • 40% Report • 15% Presentation • 15% Artifact Design/Implementation 20

Future Opportunities • FYP supervision • Opportunity to co-author research paper that will be submitted to leading IR/AH/UM conference (irrespective of FYP) 21

Pitfalls • Utilities must be lightweight • Mostly those that are interactive, or that are invoked while user is browsing • Should all of a document be used to contribute to a category representation/be used in a similarity measure? 22

Schedule • Until w.c. 6th March inc: Discussion, talks once/week • w.c 19th March: Submit TOC/chapter overview for feedback (optional) • w.c. 23th Apr: Demo 1 (optional) • 23th Apr-7th May: Submit one chapter of your choice for feedback (optional) • w.c. 7th May: Demo 2 (optional) • 14th May: Evaluation collection will be made available • May 25: Submit APT report • June: Demo and evaluation under exam conditions 23

Automatic Classification of Bookmarked Web Pages