Evaluating Similarity Measures for Emergent Semantics of Social Tagging

Evaluating SimilarityMeasures for Emergent Semantics of Social Tagging Authors : Benjamin Markines, CiroCattuto, FilippoMenczer, Dominik Benz, Andreas Hotho, GerdStumme Presenter : ZhiQiao

1.Introduction • 2.Similarity measuring Framework • a.Representation • b.aggregation method • c.similarity measures • 3.Evaluation • a.predicting tag relations • b.evaluation via external grounding • 4. Conclusion and Scalability

Web1.0:users could only view webpages but not contribute to the content of the webpagesFrom web 1.0 to web 2.0 in which allow users to interact and collaborate with each other. e.g. collectively classify and find information like Tagging

Folksonomy Social bookmarking system and their emergent information structures---user create or share tags to annotate resource Example from delicious.com

PageRank : textual analysis of content by taking into account the hyperlinks created by authors as implicit endorsements between pages. Folksonomies: grant us access to a more semantically richer source of social annotation. They allow us to extend the assessment of what a page is about from content analysis algorithms to the collective “wisdom of the crowd”. If many people agree that a page is about programming then with high probability it is about programming even if its content does not include the word “programming”. “The wisdom of the crowd”

Since tags can be easily created by users require no special knowledge. • Lack of structure • Lack of global coherence • Ambiguous • The use of different languages • Spelling mistakes What are the potential problems for tagging?

Should a relationship be stronger if many people agree that two objects are related than if only few people do? • Which weighting schemes regulate best the inﬂuence of an individual? • How does the sparsity of annotations affect the accuracy of these measures? • Are the same measures most effective for both resource and tag similarity? • Which aggregation schemes retain the most reliable semantic information? • Which lend themselves to incremental computation? Some open questions to address

Triple Annotation Representation • A folksonomy F is a set of tripes (u,r,t) • User u annotating resource r with tag t Similarity framework

A post (u,r,(t1,…tn)) is transformed into a set of triples{(u,r,t1)…(u,r,tn)} Define similarity measures σ(x,y) where x and y can be two resources or tags

For evaluation purpose, we focus on resource-resource and tag-tag similarity. Therefore we need to aggregate across users. Aggregation Methods

Projection: The simplest aggregation approach is to project across users, obtaining a unique set of (r,t) pairs. Check if the triples are stored in a database relation F. A matrix with binary elements W(0 or 1) to represent the aggregation result. a 0 in the corresponding matrix element means that no user associated that resource with that tag, whereas a 1 means that at least one user has performed the indicated association.

Distributional: A more sophisticated form of aggregation stems from considering distributional information associated with the set membership relationships between resources and tags. One way to achieve distributional aggregation is to make set membership fuzzy (weighted by the Shannon information (log-odds) extracted from the annotations. Intuitively, a shared tag may signal a weak association if it is very common. Thus we will use the information of a tag x deﬁned as -log p(x) where p(x) is the fraction of resources annotated with x. Another approach is to count the users who agree on a certain resource-tag annotation while projecting across users.

Macro-Aggregation Treats each user’s annotation set independently first then aggregates across users. The per-user binary matrix representation w are used to compute a “local” similarity σu(x,y). Finally we macro-aggregate by summing across users to obtain the “global” similarity

Collaborative: • We have only considered feature-based representations. • In collaborative filtering, one or more user annotate two objects is seen as evidence of association. • Adding special tag tuto all resources annotated by u (the probability of observing this tag to any of u’s resources is 1, therefore share no information value according to shannon’s information (-log p(x) ) )

Forprojection: Similarity Measures: Matching

Ex:P(cnn)=1/3(only attach ”news” out of 3 tags) Fordistributional: Similarity Measures:Matching

Compute similarity for individual user then aggregate across users. • Difference: Adding user tag t* • P(cnn|alice)=1/3 ->P(cnn|alice)=1/4 • Collaborative filtering indicates higher similarity than macro aggregation Formacro and collaborative aggregation: Similarity Measures:Matching

Overlap • Jaccard • Dice • Cosine • Mutual Information Similarity Measures: ……

Predicting Tag Relations • BibSonomy.org allows users to input directed relations such as tagging->web2.0 between pairs of tags. • Contain many tags,user data is very sparse • Losing information, sensitive to small change • Miss hierarchical Evaluation

Only looks at rank of similarity not actual similarity value • Use WordNet and Open Directory Project as external grounding • A higher correlation of ranking is to be interpreted as a better agreement with the grounding and thus as evidence of a better similarity measure. Evaluation via External Grounding

Mutual information is the best measure that extracts semantic similarity information from a Folksonomy • Macro-aggregation is less effective than micro-aggregation(proj and ditrib) (Why?) • In spite of macro-aggregation’s shortcomings, collaborative ﬁltering extracts much useful information • Mutual information is the most expensive one • Macro and collaborative aggregation allow for incremental computation because each user’s representation is maintained separately.Can be scalable DISCUSSION AND SCALABILITY

Evaluating Similarity Measures for Emergent Semantics of Social Tagging

Evaluating Similarity Measures for Emergent Semantics of Social Tagging

Presentation Transcript

Evaluating Similarity Measures: A Large-Scale Study in the orkut Social Network

Similarity Measures for Query Expansion in TopX

Social People-Tagging vs. Social Bookmark-Tagging

Emergent Semantics: Meaning and Metaphor

Evaluating Measures

Improving Similarity Measures for Short Segments of Text

Measures of similarity and structural equivalence

Similarity Measures for Text Document Clustering

Document Similarity Measures

Learning Term-weighting Functions for Similarity Measures

The Semantics of Collaborative Tagging Systems

Similarity Measures for Rhythmic Sequences

An Architecture for Emergent Semantics

Evaluating measures of core inflation

Approximation of Protein Structure for Fast Similarity Measures

Robust Similarity Measures for Mobile Object Trajectories

Approximation of Protein Structure for Fast Similarity Measures

Similarity Measures for Query Expansion in TopX

Learning Discriminative Projections for Text Similarity Measures

Distance and Similarity Measures

Measures of Text Similarity

Similarity Measures for Rhythmic Sequences