220 likes | 239 Views
Neighborhood - based Tag Prediction. Adriana Budura (adriana.budura@epfl.ch) joint work with: Sebastian Michel, Philippe Cudré-Mauroux, Karl Aberer. 1. Outline. Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions. 2.
E N D
Neighborhood - based Tag Prediction Adriana Budura (adriana.budura@epfl.ch) joint work with: Sebastian Michel, Philippe Cudré-Mauroux, Karl Aberer 1 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Outline Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 2 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Motivation • Tagging portals « Web 2.0 » • users attach keywords (tags) to resources: flickr, del.icio.us, citeulike,… • Tags: • unstructured textual information • reflect the meaning of resources for users powerful tool to improve search BUT: we need many tags and users are lazy • Therefore…. Automatic Tag Inference 3 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Neighborhood -based Tag Prediction • IDEA: copy tags from other resources • Semantically related resources –> related tags • How to discover semantically similar resources? • Resources are connected via links (e.g., HTML, citations ) • neighborhood of a resource captures its context (e.g., citations in „Related Work“ ) • propagate tags along the edges of the graph • How relevant is a tag found in the neighborhood? Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Computational Model • 3 concepts: • Documents • the resources for which we infer tags; uniquely identifiable • in our scenario: scientific publications, Web pages • Tags • keywords attached to the resources • Document neighborhoods • documents connected by users graph Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
How relevant is a tag found in the neighborhood? • Neighborhood defines context (far away -> less related) • Enough support in the neighborhood • Some tags are more likely to occur together • Similar documents are likely to share the same tags Tag Distance Tag Occurence Tag Co-Occurence Document – Document Similarity Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
ranking IR IR TopK IR PageRank P2P P2P distributed distributed Principles of Tag Propagation e.g. Citation graph of publications d_init Tag Occurence Doc-Doc Similarity Tag Co-Occurence Tag Distance Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Overview Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 8 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
(1) Tag Co-Occurrence • relevance of a tag t for d_init based on the tags already assigned to d_init ? • conditional probability: • d_init can have more than one initial tag => we aggregate for sets of tags T(d_init) 9 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
(2) Doc-Doc Similarity • relevance of a tag t (coming from a document d) for d_init, based on the similarity between d and d_init ? • vector space model: • for documents that are several hops away we aggregate 10 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
(3) Tag Distance / (4) Tag Occurence • Tag Occurrence • what is the popularity (support) of a tag in the neighborhood • expressed as a sum over all scores for a tag t • Tag Distance • the distance between the documents d_init and d with tag t smallest path 11 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Putting it All Together - sum of partial scores for each occurrence of a tag t in the neighborhood d_init Tag Occurence Doc-Doc Similarity, Tag Distance Tag Co-Occurence Combined Scoring Function: 12 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Overview Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 13 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Inferring tags for a document • traverse the graph of documents and gather tags for the initial document • do not visit the whole neighborhood need smart graph traversal • the scoring model can compute a score for “every” tag top-k tags are enough … when should we stop? Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Graph Traversal • Precomputed: Tags + Scores for each document Doc 1 Doc 1 Doc 2 P2P, 0.3 Tag, 0.28 Social, 0.25 Paper, 0.2 2009, 0.1 Social, 0.4 Search, 0.33 Budura, 0.25 Tag, 0.2 Paper, 0.2 Doc 2 D_init • Select the next document based on the doc-doc similarity 15 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Top-K Graph Traversal • List of all neighbors sorted by doc-doc sim • Select best document Doc x Visited Doc x P2P, 0.3 Tag, 0.28 Social, 0.25 Paper, 0.2 2009, 0.1 Social, 0.4 Search, 0.33 Budura, 0.25 Tag, 0.2 Paper, 0.2 D_init Social, 0.65 Paper, 0.4 Tag, 0.48 P2P, 0.3 .... top-k 16 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Top-k Tag Inference • Fagin et al. - NRA Algorithm w b for each candidate tag • worst_score = actual score • best_score = worst_score + • best_to_come_score • prune a tag when • best_score < score of tag currently at rank k • stop when • seen k tags && no candidate tags left w b b w score (m-m‘) * Top-k, pos. k Candidate Expelled • unknown final “score” mass for each tag • Consider ONLYm occurences for each tag Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Overview Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 18 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Experimental Setup • Datasets • del.icio.us (120K bookmarks) • CiteULike/CiteSeer (2200 crawled pdfs) • Measures of Interest: • Precision (user study) • Relative precision (computed based on already assigned tags) • Cost (number of visited neighbors) 19 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Experimental Results: CiteULike 30 initial documents manual precision evaluation (user study) 20 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Experimental Results: Del.icio.us 120 initial documents relative precision evaluation 21 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09
Conclusions Tag inference over edges of resource graphs 4 principles of tag propagation Scoring model Top-k tag inference with modest access to the resource graph 22 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09