220 likes | 241 Views
Explore aligning keywords from different systems for efficient organization and retrieval. Experiment, analyze, and evaluate keyword similarity using instances from del.icio.us and Wikipedia. Discover methods to enhance mapping quality and overcome keyword divergence challenges.
E N D
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut
Outline • Interoperability of Keywords • Wikipedia and del.icio.us • Keyword similarity • Experiment • Conclusion
Interoperability of Keywords • Documents (pictures, movies, …) are annotated with keywords for organization and retrieval. • In different collections/communities different sets of keywords are used. • The set of selectable keywords is often organized in and delimited by a thesaurus. • The set of freely generated end-user keywords, “tags” forms a folksonomy • Align keywords/tags by comparing usage. • Tested on del.icio.us tags and Wikipedia categories.
del.icio.us and Wikipedia • Del.icio.us • Social book marking site • Bookmarks in most cases can be interpreted as labels or tags for the bookmarked URL. • Many Wikipedia articles are tagged by del.icio.us users • Wikipedia • Articles are labeled with one or more categories by the article authors. • Categories are organized hierarchically. • Categories are organized consciously like in a thesaurus • New categories are introduced after discussions between active Wikipedians.
Keyword alignment • Problem • Given a keyword k in a system A, what is the most similar keyword k’ in system B. • Given a tag from del.icio.us, what is the most similar Wikipedia category (or vice versa). • Approach • Interpret similarity as similarity of usage. • Compute similarity of usage on a common sub-collection. • Evaluation • Compare results to human judgment of similarity.
Keyword similarity • Basic assumption: similarity is similarity of usage. • If two keywords have similar usage they will give similar results in retrieval tasks. • Two keywords have similar usage if they • Have a similar distribution over documents • Divergence (relative entropy) of distributions • Cosine • Often co-occur • Jaccard coefficient
New measure for keyword similarity • Keywords have similar usage if they co-occur with similar frequency with all other keywords. • We use the frequency with which a tag/keyword is assigned to a document. • We include co-occurrence information with other terms. • Helps to cope with sparse data • In other words: • Terms are similar if they have similar co-occurrence patterns • Similar to Tag Context Similarity of Cattuto et al.’s presentation tomorrow (Semantic Social Networks Session)
Formalization: Distribution of co-occurring terms • where • q(t|d) is the keyword distribution of d • Q(d|z) is the document distribution of z • “The fraction of z’s that is found in d” • Weighted average of the keyword distributions of documents • The weight is the relevance of d for z given by the probability Q(d|z)
Distance of keywords • For each keyword there is a distribution over all (other) keywords. • Similarity is expressed by divergence of these distributions • Kullback-Leibler divergence: • Bits per keyword saved by compressing a subcollection with keyword distribution p using p instead of a general distribution q.
Distance of keywords (cont’d) • Jensen-Shannon divergence: • Mean distribution: • Jensen-Shannon divergence is symmetric. • Jensen-Shannon divergence is square of a non-negative distance satisfying the triangle inequality.
Alignment • Consider a collection of documents annotated with different sets of keywords. • Represent a keyword by a distribution over terms from both collections. • For each term find the closest term from the other collection.
Experiment I • Mapping between Teleblik keywords and User Tags • Educational video’s. • Professional keywords from public broadcasting archive. • Keywords assigned in an experiment by high school students. • Data • 100 videos • 12.414 tags • 4.348 different tags • 269 different keywords
Experiment II • Mapping between del.icio.us tags and Wikipedia categories • Del.icio.us tags collected by Mathias Lux (Klagenfurt Univ.) • Data • 58.345 Wikipedia articles • 500.618 tags and category annotations • 42.425 different Wikipedia categories • 49.603 different tags • Mappings computed for tags occurring on at least 10 docs. • Mappings for 2355 tags • Mappings for 1827 categories • Using co-occurrence data with all 49.603 tags/categories
Evaluation of mapping • Manual evaluation • Classification of a sample of mappings into: • Broader term • Narrower • Related term • Unrelated • Source term is not a keyword (e.g. “to read”) • Meaning unknown
Distance vs. mapping quality • Pairs with a small distance are evaluated better than pairs with large distance. • Evaluation of mappings with smallest and largest distance • a) Categories to tags • b) Tags to categories
Effect of keyword frequency • No correlation between keyword frequency and divergence with best mapping found.
Comparison with Jaccard-coefficient • Evaluation of mapping using two different distance measures. • Categories broader, narrower and related are merged • Results for • a) Categories to tags • b) Tags to categories
Discussion of results • Method works very well in test • Good mapping results • Distance is good indication of quality • Insensitive to frequency (upto a certain degree) • Better than Jaccard, because it uses: • co-occurrence with other tags (‘tag context’) • frequency with which a tag is assigned to a document. • Frequency information is typical for user generated tags. • We expect this method to perform less well for aligning keywords with other keywords (without assignment frequencies). • Distance measure also works well for clustering tags.
Future work • Evaluating relatedness using external sources (e.g. Wordnet) • Compare to other distance measures • We used documents annotated completely according to two annotation schemes. • How large has the overlap to be to obtain decent results? • We can create partial overlap of disjoint document sets by a partial identification of the keywords. • Detect asymmetry in relations (broader vs. narrower term)
Conclusion • Using co-occurrence patterns is a fruitful approach. • Frequent terms from folksonomies do behave similar to carefully assigned keywords. • Because usage based similarity measure yields good mappings. • Folksonomy seems to work!