230 likes | 249 Views
Explore methods of inducing hierarchy from user-contributed tags, including pattern-based clustering, contextual features, and tag hierarchy challenges. Compare graph-based and hierarchical clustering approaches and propose improvements for tree deduction and taxonomy evolution.
E N D
Tag Cube: Acquiring Latent Conceptual Structures from Folksonomy Data Junjie Yao, Yuxin Huang 2010-05-13
Delicious Snapshot • Design blog software music tools reference art video programming webdesign web2.0 mac howtolinux tutorial web free news photography shopping blogs css imported education travel javascript food games • Development inspiration politics flash apple tips java googleosx business windows iphone science productivity books toreadhelath funny internet wordpressajax ruby research humor fun technology search opensource • Photoshop media recipes cool work article marketing security mobile jobs rails lifehacks tutorials resources php social download diyubuntu freeware portfolio photo movies writing graphics youtube audio online
classification Organize Search Recommend Leverage User Contributed Description Metadata Discover Consume Annotate Produce Organize Users Web content
Automatic Taxonomy Induction • decomposed into two subtasks: • term extraction • relation formation. • relation formulation becomes the focus of most research: Pattern based. clustering-based. • clustering-based approaches cannot generate relations as accurate as pattern based. • performance is largely influenced by the types of features used.
Clustering features • Contextual • Co-occurrence • Syntactic dependency Usually, we need the Is-a and sibling relations
Tree, relation construction • Directed Maximum Spanning Tree. Or Prefix Tree. • Hierarchical Agglomerative Clustering. • Pachinko Allocation models. LDA based extension algorithm.
Tag hierarchy challenges: • Sparse and short context: • compared to textual documents • Low quality and noise: • Typo, synonym, describer instead of categorizer • Scalability and data size.
Inducing Hierarchy from Tags • Existing approaches • Graph based [Mika05] • build a network of associated tags (node = tag, edge = co-occurrence of tags) • suggest applying betweenness centrality and set theory to determine broader/narrower relations • Hierarchical Clustering [Brooks06; Heymann06+] • Tags appearing more frequently would likely have higher centrality and thus more abstract. • Probabilistic subsumption [Sanderson99+; Schmitz06] • x is broader than y if x subsumes y • x subsumes y if p(x|y) > t & p(y|x) < t x y
Difficulties… • Some difficulties when using tags • to induce hierarchy: Notation: A B (A is broader than B Or hypernym relation) Washington United States Car Automobile Obama President Specificity Rarity Chrome Browser FrequencyAbstractness Insect Hongkong Color Brazilian Tags are from different facets Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06]
Our Previous Work • Graph based representation • build a graph of associated tags (node = tag, edge = co-occurrence of tags in bookmarking actions) • Applying association rule and confidence level to determine broader/narrower relations • Greedy algorithm for tree induction. • Evaluation • Empirical and lack of ground truth • Temporal Fusion • Consecutive smoothing.
Review Comment Positive: • Interesting work, evolutionary aspect • Remarkable data size Negative: • heuristic contribution of apparent marginal value • weak evaluation, non-convincing at this point Overall: solution is intuitively reasonable but heuristic. The technical depth of this work is limited and the evaluation is insufficient to support the proposed solution.
Proposed Improvements • Rich similarity measurements • Better Partial Order determination • Efficient and accuracy tree deduction • Reasonable evaluation and practical application scenarios • Incorporate burst detection into taxonomy evolution?
Why cubes? • OLAP , Text Cube • Cube organization and searching in folksonomy/ social media? • The multi-categories and roll-up, drill-down characters differ from single taxonomy.
Similarity Measurements • Co-action • <user,resource> • Co-resource • <resource> • Co-user • <user> • Clustering or pLSA to cope with sparse features.
Partial Order Determination • Define new ontology metrics to replace frequency. • Take consideration into various aspects: • Resource coverage • E.g. browser, firefox, chrome • Temporal duration • E.g. president, obama • Abstractness • Terms appearing in tags, corresponding resource description. • Pair training • Graph centrality? • Term Association
Tree Deduction Given the tagging terms and similarity/partial order relation, To deduce the underlying taxonomy: • Prefix tree or incremental clustering. • The optimization objective: • Minimize the conflict order pairs and weighted layer penalty. • Iteratively update the convergence. • Incrementally expand the taxonomy tree.
Reference hierarchy Evaluation Under Ground Truth It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare. Relations (right after tokenized) (ODP) Induce (remove noise+link) Induced hierarchy
Metrics • Taxonomic Overlap [adapted from Maedche02+] • measuring structure similarity between two trees • for each node, determining how many ancestor and descendant nodes overlap to those in the reference tree. • Lexical Recall • measuring how well an approach can discover concepts, existing in the reference hierarchy (coverage)
Application Scenario • Organize and represent resources into the tag cube. • A new way of similarity measurement based on tree distance. • Resource search & recommendation
Evaluation • ODP dataset • The overlapped pages between our delicious dataset and ODP is around 50,000. • Ground truth: Pages distance in the original ODP tree • Baseline 0: tag vector based similarity measurement • Baseline x: other taxonomy induction methods.
Cube Materialization • Tree compression • MDL principle.
References • [1] D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. In Proc. of VLDB, pages 970–981, 2005. • [2] Y. Cao, H. Duan, C. Lin, Y. Yu, and H. Hon. Recommending questions using the mdl-based tree cut model. In Proc. of WWW, pages 81–90, 2008. • [3] P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Int. Res., 24(1):305–339, 2005. • [4] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text cube: Computing IR measures for multidimensional text database analysis. In Proc. of ICDM, pages 905–910, 2008. • [5] B. Markines, C. Cattuto, F. Menczer, D. Benz, A. Hotho, and G. Stumme. Evaluating similaritymeasures for emergent semantics of social tagging. In Proc. of WWW, pages 641–650, 2009. • [6] M. Sanderson and B. Croft. Deriving concept hierarchies from text. pages 206–213, 1999. • [7] H. Yang and J. Callan. A metric-based framework for automatic taxonomy induction. In Proc. of ACL, pages 271–279, 2009. • [8] X. Yin and S. Shah. Building taxonomy of web search intents for name entity queries. In Proc. of WWW, pages 1001–1010, 2010. • [9] Y. Yu, C. X. Lin, Y. Sun, C. Chen, J. Han, B. Liao, T. Wu, C. Zhai, D. Zhang, and B. Zhao. iNextCube: information network-enhanced text cube. Proc. VLDB Endow., 2(2):1622–1625, 2009.