240 likes | 319 Views
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization. --- Lei Tang , Jianping Zhang and Huan Liu. Taxonomies and Hierarchical Models. Web pages can be organized as a tree-structured taxonomy (Yahoo!, Google directory)
E N D
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization ---Lei Tang, Jianping Zhang and Huan Liu
Taxonomies and Hierarchical Models • Web pages can be organized as a tree-structured taxonomy (Yahoo!, Google directory) • Parental control: Web filters to block children’s access to undesirable web sites. • Parents want accurate content categorization of different granularity • Service providers appreciate the decision path how a blocking/non-blocking is made for fine tuning. • Hierarchical Model: Exploit the taxonomy for classification strategy or loss function
Quality of Taxonomy • Most hierarchical models use a predefined taxonomy, typically semantically sound. • A librarian is often employed to construct the semantic taxonomy. • Is semantically-sound taxonomy always good? • Subjectivity can result in different taxonomies • Semantics change for specific data
Normally Geography During Katrina Hurricane Federal Emergency Management Agency Politics A Motivating Example
Stagnant nature of predefined Taxonomy (Prior Knowledge) Dynamic change of Semantics reflected in Data A “Bayesian” View Inconsistent Data-Driven Taxonomy
“Start from Scratch” - Clustering • Throw away the predefined taxonomy information, clustering based on labeled data. • Two categories: divisive or hierarchical • Usually require human experts to specify some parameters like the maximum height of a tree, the number of nodes in each branch, etc. • Difficult to specify parameters without looking at the data
Optimal Hierarchy • Optimal hierarchy: • How to estimate the likelihood? • Hierarchical model’s performance and the likelihood are positively related. • Use hierarchical models’ performance statistics on validation set to gauge the likelihood. • Brute-force approach to enumerate all taxonomies is infeasible.
Constrained Optimal Hierarchy • Predefined taxonomy can help. • Assumption: the optimal hierarchy is near the neighborhood of predefined taxonomy H0 • Constrained optimal hierarchy H’ for H0: H’ results from a series of elementary operations to adjust H0 until no likelihood increase is observed.
1 1 1 2 6 3 4 4 2 2 3 4 5 5 5 6 6 3 (H1) (H2) ‘Merge’ 1 7 2 ‘Demote’ 3 4 5 6 (H3) (H4) Elementary Operations ‘Promote’ (All the leaf nodes remain unchanged)
H13 H12 H01 H33 H02 H11 H03 H24 H23 H32 H0 H31 H22 H04 H21 Search in Hierarchy Space • Given a predefined taxonomy, find its best constrained optimal hierarchy. • Search in the hierarchy space.
H13 H12 H01 H33 H02 H11 H03 H24 H23 H32 H0 H31 H22 H04 H21 Finding Best COH • Greedy Search • Follow the track with largest likelihood increase at each step to search for the best hierarchy.
Framework (a wrapper approach) • Given: H0 , Training Data T, Validation Data V • Generate neighbor hierarchies for H0, • For each neighbor hierarchy, train hierarchical classification models on T • Evaluate hierarchical classifiers on V. • Pick the best neighbor hierarchy as H0 • Repeat step 1 until no improvement
1 1 2 2’ 3 3 Hierarchy Neighbors • Elementary operations can be applied to any nodes in the tree. • Neighbors of a hierarchy could be huge. • Most operations are repeated for evaluation. H1 H2
Finding Neighbors • Check nodes one by one rather than all the nodes at the same time in each search step. • ‘Merge’ and ‘Demote’ only consider the node most similar to the current one. • Nodes at higher levels affects more for classification. • Top-down traversal: Generate neighbors by performing all possible elementary operations to the shallowest node first.
Root Geography Politics Hurricane Further consideration • 2 types of top-down traversal: • ‘Promote’ operation only to generate neighbors • ‘Demote’ and ‘Merge’ operationsonly to generate neighbors • Repeat 2-traversals procedure until no improvement. If a node is inproperly placed under a parent, we need to ‘promote’ it first.
Experiment Setting • 10-fold cross validation • Naïve Bayes Classifier (Multinomial) • Use information gain to select features • Due to the scarcity of documents in each class, we use training data to validate the likelihood of a hierarchy.
Data Sets • Data: Soc and Kids • Human labeled web pages with a predefined taxonomy
Over-fitting? • As we optimize the hierarchy just based on training data, it’s possible to over-fit the data.
Robust Method • Instead of multiple traversals(iterations), just do 2-traversals once.
Conclusions • Semantically sound taxonomy does not necessarily lead to intended good classification performance. • Given a predefined taxonomy, we can accustom it to a data-driven taxonomy for more accurate classification • Taxonomy generated by our method outperforms human-constructed taxonomy and the taxonomy generated “starting from scratch”.
Future work • An initial work to combine “noisy” prior knowledge and data. • How to implement an efficient filter model that can find a good taxonomy by exploiting the predefined taxonomy? • Feature selection could alleviate the difference between taxonomies. How to use the taxonomy information for feature selection?
Questions? Thanks!