Automatic Classification of Bookmarked Web Pages

Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007 1

Tasks • Representation of bookmark categories • Two clustering/similarity algorithms • Extra utility • User interface • Evaluation • Write up report 2

Overview • Represent bookmark categories • We’re starting with populated bookmark files, so use ‘How Did I Find That?’ approach • Plus another, individual approach • When a page is to be bookmarked • If referrer page is available, identify topic of page • Otherwise, identify page topic using ‘How Did I Find That?’ approach • Compare current topic topic to bookmark category representations 3

Overview • The representation of the bookmark categories and the representation of the page to be bookmarked need to be compatible, so that we can compare them • Clustering techniques/Similarity measures will tell us cluster membership/degree of similarity 4

Clustering/Similarity • Given a sets of documents, D1…Dn, and an individual document di, how can we tell to which Dj di “belongs”? • What features of documents in Dj and document Di should we use? 5

The Vector Space Model of IR • In VSM, documents are plotted into vector space • the nearest neighbours belong to the same cluster • a query is plotted into vector space too, and its nearest neighbours are the relevant documents • Typically, a document is represented by term features G. Salton and C. Buckley. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513--523. p18-wong (Generalised Vector Space Model).pdf - look at refs [1],[2],[3] for original work 6

Vector Space Model • Documents are represented as m-dimensional vectors or “bags of words” • m is the size of the vocabulary • wk = 1, indicates term is present in document • wk = 0, indicates term is absent • dj = <1,0,0,1,...,0,0> 7

Vector Space Model 8

Vector Space Model • The query is then plotted into m-dimensional space and the nearest neighbours are the most relevant • However, the results set is usually presented as a list ranked by similarity to the query 9

Context… • What’s “180374”? • What’s “Chris”? • What’s “R2D2”? • What’s “B4Y2”? • Why were we able to tell what the first three might be, but not the last one? 10

Context… • Information is data in context • What it is • Knowledge is information in context • How and when to use it 11

Context and HDIFT • Web-based documents may contain more than one topic • How can we identify in what the user might be interested? • We want to identify a document’s context so that we can understand what a user might be interested in • What can provide a context? 12

How Did I Find That? • HDIFT is a technique for finding documents related to a user’s bookmark categories • represents each category in a novel way; • extracts a query from the category representation (centroid) • Submits query to 3rd party search engine • Shows top-10 results to user • First technique that you must use 13

How Did I Find That? • Building a category representation • Instead of using term features from the documents in the category, use terms from parents of documents • Why? • How many parents should we use? • If we know parent, then referrer page only • Otherwise, ? 14

How Did I Find That? • HDIFT uses 20 parents obtained using the ‘link:’ operator in Google • Find “context block” in parent that contains link to bookmarked doc • There may be more than one link! • Write context block to file • Once obtained all context blocks of all bookmarks in category, use SWISH-E to index them • Create centroid representation 15

How Did I Find That? • Each category will have an HDIFT centroid representation • When user bookmarks a page, use two approaches to determine the context of the page • One will use the referrer page only • The other will generate a representation of the document, based on its parents (i.e., HDIFT) 16

How Did I Find That? • Each representation of the bookmarked page is used to determine category membership • Obviously, for evaluation, you won’t have access to the parent of the page to be bookmarked (classified), as the page has been removed from the category containing it • What difference, if any, is there between the two document representations? 17

2nd category representation method • This is largely up to you to come up with • What’s important is that you try to identify what is likely to be important to the user • Why did the user put those documents into the same category? • Why is the user bookmarking the current page? • Feel free to use any clustering algorithm/similarity measure you like 18

Judging similarity • Typically, we use cosine similarity which measures the cosine angle or Euclidean Distance between two documents in vector space 19

Cosine Similarity Measure From IR vector space model.pdf 20

Cosine Similarity Measure • For the approach based on HDIFT, please use CSM • For other approach, you can use anything you like, e.g., Information Filtering, Clustering • Including CSM, if you like 21

Indexing Documents • In HDIFT, I use SWISH-E to index documents to obtain a cluster centroid and extract top-ranking keywords to form a query • You have a requirement to impose only a maximum 2 second overhead on average to classify a page that has been bookmarked • The categories can be indexed, represented, and re-indexed when the cluster membership changes, off-line or in the background 22

Indexing Documents • You *do not need* to use SWISH-E, though you may need to use something, and there is nothing to stop you using SWISH-E :-) • You may develop your own lightweight indexing system; use another 3rd party (ideally, open source) system, etc. 23

Automatic Classification of Bookmarked Web Pages