220 likes | 227 Views
Instance-based mapping between thesauri and folksonomies. Christian Wartena Rogier Brussee Telematica Instituut. Outline. Interoperability of Keywords Wikipedia and del.icio.us Keyword similarity Experiment Conclusion. Interoperability of Keywords.
E N D
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut
Outline • Interoperability of Keywords • Wikipedia and del.icio.us • Keyword similarity • Experiment • Conclusion
Interoperability of Keywords • Documents (pictures, movies, …) are annotated with keywords for organization and retrieval. • In different collections/communities different sets of keywords are used. • The set of selectable keywords is often organized in and delimited by a thesaurus. • The set of freely generated end-user keywords, “tags” forms a folksonomy • Align keywords/tags by comparing usage. • Tested on del.icio.us tags and Wikipedia categories.
del.icio.us and Wikipedia • Del.icio.us • Social book marking site • Bookmarks in most cases can be interpreted as labels or tags for the bookmarked URL. • Many Wikipedia articles are tagged by del.icio.us users • Wikipedia • Articles are labeled with one or more categories by the article authors. • Categories are organized hierarchically. • Categories are organized consciously like in a thesaurus • New categories are introduced after discussions between active Wikipedians.
Keyword alignment • Problem • Given a keyword k in a system A, what is the most similar keyword k’ in system B. • Given a tag from del.icio.us, what is the most similar Wikipedia category (or vice versa). • Approach • Interpret similarity as similarity of usage. • Compute similarity of usage on a common sub-collection. • Evaluation • Compare results to human judgment of similarity.
Keyword similarity • Basic assumption: similarity is similarity of usage. • If two keywords have similar usage they will give similar results in retrieval tasks. • Two keywords have similar usage if they • Have a similar distribution over documents • Divergence (relative entropy) of distributions • Cosine • Often co-occur • Jaccard coefficient
New measure for keyword similarity • Keywords have similar usage if they co-occur with similar frequency with all other keywords. • We use the frequency with which a tag/keyword is assigned to a document. • We include co-occurrence information with other terms. • Helps to cope with sparse data • In other words: • Terms are similar if they have similar co-occurrence patterns • Similar to Tag Context Similarity of Cattuto et al.’s presentation tomorrow (Semantic Social Networks Session)
Formalization: Distribution of co-occurring terms • where • q(t|d) is the keyword distribution of d • Q(d|z) is the document distribution of z • “The fraction of z’s that is found in d” • Weighted average of the keyword distributions of documents • The weight is the relevance of d for z given by the probability Q(d|z)
Distance of keywords • For each keyword there is a distribution over all (other) keywords. • Similarity is expressed by divergence of these distributions • Kullback-Leibler divergence: • Bits per keyword saved by compressing a subcollection with keyword distribution p using p instead of a general distribution q.
Distance of keywords (cont’d) • Jensen-Shannon divergence: • Mean distribution: • Jensen-Shannon divergence is symmetric. • Jensen-Shannon divergence is square of a non-negative distance satisfying the triangle inequality.
Alignment • Consider a collection of documents annotated with different sets of keywords. • Represent a keyword by a distribution over terms from both collections. • For each term find the closest term from the other collection.
Experiment I • Mapping between Teleblik keywords and User Tags • Educational video’s. • Professional keywords from public broadcasting archive. • Keywords assigned in an experiment by high school students. • Data • 100 videos • 12.414 tags • 4.348 different tags • 269 different keywords
Experiment II • Mapping between del.icio.us tags and Wikipedia categories • Del.icio.us tags collected by Mathias Lux (Klagenfurt Univ.) • Data • 58.345 Wikipedia articles • 500.618 tags and category annotations • 42.425 different Wikipedia categories • 49.603 different tags • Mappings computed for tags occurring on at least 10 docs. • Mappings for 2355 tags • Mappings for 1827 categories • Using co-occurrence data with all 49.603 tags/categories
Evaluation of mapping • Manual evaluation • Classification of a sample of mappings into: • Broader term • Narrower • Related term • Unrelated • Source term is not a keyword (e.g. “to read”) • Meaning unknown
Distance vs. mapping quality • Pairs with a small distance are evaluated better than pairs with large distance. • Evaluation of mappings with smallest and largest distance • a) Categories to tags • b) Tags to categories
Effect of keyword frequency • No correlation between keyword frequency and divergence with best mapping found.
Comparison with Jaccard-coefficient • Evaluation of mapping using two different distance measures. • Categories broader, narrower and related are merged • Results for • a) Categories to tags • b) Tags to categories
Discussion of results • Method works very well in test • Good mapping results • Distance is good indication of quality • Insensitive to frequency (upto a certain degree) • Better than Jaccard, because it uses: • co-occurrence with other tags (‘tag context’) • frequency with which a tag is assigned to a document. • Frequency information is typical for user generated tags. • We expect this method to perform less well for aligning keywords with other keywords (without assignment frequencies). • Distance measure also works well for clustering tags.
Future work • Evaluating relatedness using external sources (e.g. Wordnet) • Compare to other distance measures • We used documents annotated completely according to two annotation schemes. • How large has the overlap to be to obtain decent results? • We can create partial overlap of disjoint document sets by a partial identification of the keywords. • Detect asymmetry in relations (broader vs. narrower term)
Conclusion • Using co-occurrence patterns is a fruitful approach. • Frequent terms from folksonomies do behave similar to carefully assigned keywords. • Because usage based similarity measure yields good mappings. • Folksonomy seems to work!