1 / 23

Automatic Classification of Bookmarked Web Pages

Automatic Classification of Bookmarked Web Pages. Chris Staff First Talk February 2007. Overview. General Principles Reading List Tasks involved Schedule. General Principles. Email: cstaff@cs.um.edu.mt Web site: http://www.cs.um.edu.mt/~cstaff Plagiarism Referencing

Download Presentation

Automatic Classification of Bookmarked Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007 1

  2. Overview • General Principles • Reading List • Tasks involved • Schedule 2

  3. General Principles • Email: cstaff@cs.um.edu.mt • Web site: http://www.cs.um.edu.mt/~cstaff • Plagiarism • Referencing • ACM Digital Library: Membership for students from Malta 3

  4. Reading List • Abrams, D., Baecker, R.: How people use WWW bookmarks. In: CHI ’97: CHI ’97 extended abstracts on Human factors in computing systems, New York, NY, USA, ACM Press (1997) 341-342 • Bugeja, I.: Managing WWW browser’s bookmarks and history (a Firefox extension). Final year project report, Department of Computer Science & AI, University of Malta, 2006. http://hyper.iannet.org/hyperBkreport.pdf • Cockburn, A., McKenzie, B.: What do web users do? an empirical analysis of web use. In: Int. J. Hum.-Comput. Stud. 54(6) (2001) 903-922 • Staff, C.: Automatic Classification of Web Pages into Bookmark Categories. Submitted to UM’07, 2007. • Staff, C.: CSA3200 User Adaptive Systems Lecture Notes, 2006. Follow link from http://www.cs.um.edu.mt/~cstaff/ • Mozilla Development Center: 2006, “Building an Extension”., http://developer.mozilla.org/en/docs/Building_an_Extension 4

  5. Classifying Bookmarks • When a user bookmarks a page (or adds a page to Favorites) we want to recommend the best existing category • Improvement over simply recommending last category saved to • Improvement over simply offering ‘category root’ 5

  6. Tasks • Representation of bookmark categories • Two clustering/similarity algorithms • Extra utility • User interface • Evaluation • Write up report 6

  7. Tasks Overview • We are going to implement a number of algorithms to help with the overall task. • Some of these will be used while the user is browsing • Others will be used to classify pages ‘off-line’ (especially for the existing bookmark files) • We’re going to have a ‘standard test bed’ for conducting the evaluation 7

  8. Tasks Overview • Represent bookmark categories • We’re starting with populated bookmark files, so use ‘How Did I Find That?’ approach • Plus another, individual approach • When a page is to be bookmarked • If referrer page is available, identify topic of page • Otherwise, identify page topic using ‘How Did I Find That?’ approach • Compare current topic topic to bookmark category representations 8

  9. Tasks Overview • User Interface • To replace the built in ‘Bookmark this Page’ menu item and keyboard command • To display a new dialog box to users to offer choice of recommended category, last category used, and to allow user to select some other category or create a new category 9

  10. Tasks Overview • Evaluation • Will be standard and automated • For testing purposes, download test_eval.zip from home page • Contains 2x8 bookmark files (.html) and one URL file (.txt) • Bookmark files are ‘real’ files collected one year ago • URL file contains a number of lines with following format: • Bk file ID, URL of bookmarked page, home category, exact entry from bookmark file (with date created, etc.) 10

  11. Tasks Overview • Evaluation (continued) • Challenge to also ‘re-create’ bookmark file in the order that it was created by users • Eventually, close to the end of the APT, the evaluation test data sets will be made available • About 20 unseen bookmark files and one URL file • Same format as before • You’ll get bookmark files early to prepare representations, but classification run will be part of a demo session 11

  12. Tasks Overview • Write up report • We’ll spend some time looking at the structure of a scientific report, how to write a literature review, present evaluation results, etc. 12

  13. Task: Representing Bookmark Categories • We need to identify what a category or collection of bookmarks is about so that we can check if a new page could belong to that category • Ideally, we find out what is similar between the different documents in the category (especially if we know which link a user followed to reach child!) • In the absence of this information use: • One algorithm will be based on ‘How Did I Find That?’ • A second algorithm that is up to you 13

  14. Task: Two clustering/similarity algorithms • Once we have represented the categories, we can ‘send’ page to be bookmarked to best category • Similar to ‘information filtering’ or ‘clustering’ • What similarity measure or clustering algorithm to use? • One way of representing page to be classified will be based on ‘How Did I Find That?’ • Other way researched/developed by you 14

  15. Task: Extra Utility • How can the classification of web pages to be bookmarked be improved? • What particular interests do you have, and how can they be used to improve classification? • E.g., synonym detection, automatic reorganisation of bookmarks, … 15

  16. Task: User Interface • Can use XUL to ‘extend’ Mozilla Firefox • http://www.xulplanet.com/tutorials/xultu/ • Use Ian Bugeja’s HyperBK as a framework (with due referencing and acknowledgement, of course): https://addons.mozilla.org/firefox/2539/ • Programs are likely to be JavaScript • Your extension will then be portable 16

  17. Task: User Interface • You can use Ian’s interface, but it may need some work to tweak it: • To support some of the new functionality that you’re adding (e.g. choice of algorithms) • And to fix some of the usability problems with the dialog box 17

  18. Task: Evaluation • ACofBWP will be evaluated! • But you must build a version of the program that can be called in batch mode; that will accept a directory containing bookmark files and a URL file; that will run in two modes (classify and reconstruct); and that will report faithfully on its performance. 18

  19. Task: Write Up Report • At least one tutorial will be dedicated to good report writing practice; how to write a literature review; how to build and write references; how to present evaluation results. 19

  20. Grading Structure • 10% for obtaining an average of at least 0.8 precision on evaluation (for random bookmark classification, using either implemented approach) • 10% for incurring a maximum 2 second overhead on average to classify a page (must faithfully report time overhead) • Max. 10% for extra utility. • 40% Report • 15% Presentation • 15% Artifact Design/Implementation 20

  21. Future Opportunities • FYP supervision • Opportunity to co-author research paper that will be submitted to leading IR/AH/UM conference (irrespective of FYP) 21

  22. Pitfalls • Utilities must be lightweight • Mostly those that are interactive, or that are invoked while user is browsing • Should all of a document be used to contribute to a category representation/be used in a similarity measure? 22

  23. Schedule • Until w.c. 6th March inc: Discussion, talks once/week • w.c 19th March: Submit TOC/chapter overview for feedback (optional) • w.c. 23th Apr: Demo 1 (optional) • 23th Apr-7th May: Submit one chapter of your choice for feedback (optional) • w.c. 7th May: Demo 2 (optional) • 14th May: Evaluation collection will be made available • May 25: Submit APT report • June: Demo and evaluation under exam conditions 23

More Related