Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata

Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata Anon Plangprasopchok1, Kristina Lerman1, Lise Getoor2 1USC Information Sciences Institute 2Department of Computer Science, University of Maryland SIGKDD 2010 2010. 12. 17. Summarized and Presented by Kim Chung Rim, IDS Lab., Seoul National University

Contents Introduction Goal Challenges Learning Folksonomies SAP: Growing a Tree Evaluation Conclusion & Discussion

Introduction • The social Web has changed the way people create and use information • Users organize their content via tags • Sites such as Flickr, Del.icio.us and YouTube allows users to tag their contents with descriptive keywords

Social Metadata • captures collective knowledge of Social Web users • Once mined from many users, • Such collective knowledge will add a rich semantic layer to the content of Social Web • Could be used for information discovery

Structured Social Metadata • Some Social Websites allow users to organize tags hierarchically • Delicious • Users can group related tagsinto bundles • Flickr • Users can group related photos into sets and related sets into collections

Folksonomy • Folksonomy – a common taxonomy that is learned from social metadata. • There exist predefined hierarchies such as WordNet or Linnaean classification system. • Benefits of Folksonomy over already existing hierarchies • represent collective agreement of many individuals • Are relatively inexpensive to obtain • Can adapt to evolving vocabularies and community’s information needs • Directly tied to the annotated content

Goal animal animal fish cat dog fish cat dog reptile • dog • animal biggle jindo biggle jindo reptile cat • The goal of this paper is to • Extract users’ knowledge (folk knowledge) from user-generated semantics • To build up a tree of communal taxonomy

Challenges • Sparseness • Social metadata is very sparse – only 3.74 tags per photo • Nosiy vocabulary • Spelling errors and meaningless tags – not sure, mykid • Ambiguity • Individual tag is often ambiguous – jaguar

Challenges • Structural noise and conflicts • Users use different term to build their own hierarchy • Varying granularity level • Users’ level of expertise and expressiveness differ

Learning Folksonomies - intuition animal • animal animal fish cat dog reptile cat fish cat dog reptile animal • dog animal fish cat dog biggle jindo fish cat dog biggle jindo roots with similar term should be aggregated horizontally A Root and a leaf with similar term should be aggregated vertically

Learning Folksonomy • Relational Clustering • Two nodes should be clustered if their similarity is above a certain threshold  • Similarity is computed using local similarity & structural similarity • localSim(a,b) measures the similarity of node names and their tag distribution • structSim(a,b) measures the similarity of leaves •  is a weight for adjusting contributions from localSim and structSim( )

Learning Folksonomy - terms • A personal hierarchy is called a sapling • A sapling is composed of a root node and its leaves • Tag statistic of a node x is defined as • Where and are tag and its frequency • is aggregated from all • = {(canada, 1), (vacation, 1), … } • = {(bc,1), (canada, 2), (vacation, 1), … }

Local Similarity • The local similarity of nodes a and b is composed of • Name similarity • Tag distribution similarity • nameSim can be any string similarity metric, which returns a value from 0 to 1 • tagSim can be any function for measuring the similarity of distributions – in this paper, tagSim counts the number of common tags n, in the top K tag of a and b. if n ≥ J (a certain threshold), it returns 1, else it returns n/J.

Structural Similarity • Compute structural similarity in two cases • Similarity between two root nodes • Similarity between a root node and a leaf node

Root-to-Root Similarity • Similarity based on • How many of their children have common name • The tag distribution similarity of those that do not have the same name • Where , • is an aggregation of tag distributions of all

Root-to-Leaf Similarity • Takes into account when a sapling contains the same leaf as the compared node • When merging UK->{Scotland, Glasgow, Edinburgh, London} and Scotland->{Glasgow, Shetland}, it can be said that Scotland and UK are similar because UK contains a shortcut to Glasgow • Measures overlap between siblings of and children of

Relational Clustering animal • animal animal fish cat dog reptile cat fish cat dog reptile • When the nodeSim(a,b) exceeds a certain threshold, two nodes are clustered(merged)

SAP: Growing a Tree canada canada canada canada canada victoria victoria victoria victoria Horizontal aggregation victoria Vertical aggregation victoria Using Incremental Relational Clustering,

Handling Problems • Shortcuts • Take the longer route • Noisy vocabularies • If a name is used by less than 1% of the user who contributed in the sapling, it is considered as noisy and discarded

Evaluation • 20,759 saplings from 7,121 users are selected from Flickr • Values for several parameters are chosen

Evaluation Methodologies • Against the reference hierarchy (ODP) • Taxonomic Overlap – measuring structure similarity between two trees. Measures how many ancestor and descendant nodes overlap to those in the reference tree for each node and take the harmonic mean (fmTO) • Lexical Recall – measuring how well an approach can discover concepts in the reference hierarchy (LR) • Structural evaluation • Area Under Tree – a measurement to define how bushy and deep a tree is (AUT) • Manual evaluation (for concepts not in ODP) • Asking 5 judges whether the path is correct

Evaluation metric - AUT animal fish cat dog biggle jindo • AUT for below tree is: • 1*(3+1)/2 + 1*(3+2)/2 = 4.5

Baseline of comparison • SIG – past work of the authors • Assume nodes having the same name refer to the same concept • Keep the relations between node pairs • Combine all relations into a tree

Results

Results Simple comparison

Some of the Learned Folksonomies

Conclusion & Discussion This work can create more accurate and more detailed folksonomies than the current state-of-the-art approach by exploiting structural information This work is more scalable – it incrementally grows the folksonomies Lacks comparison with other state-of-the-art approaches

Thank you

Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata