200 likes | 305 Views
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet. Svetlana Strunja š-Yoshikawa Joint with Fred Annexstein and Kenneth Berman {strunjs,annexste,berman}@ececs.uc.edu University of Cincinnati. Introduction.
E N D
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and Kenneth Berman {strunjs,annexste,berman}@ececs.uc.edu University of Cincinnati
Introduction • Consider Lowest Common Ancestor Query Problem • Find most specific common generalization or least common subsumeramong 2 or more terms or attributes in a large hierarchical/classification data sets • Constraint: Evaluate queries without indirection • Goal: Compact labeling schemes for taxonomies
Introduction (cont’d) • Applications • Fast classification of sets and similarity, e.g. prediction sets similar to Google Sets (given “Bush" and “Clinton” it predicts all other US presidents) • Fast answers to ancestor queries in XML search, e.g., test if 2 terms share a parent node without loading XML file (see[1],[2]) • Fast navigation through voluminous web taxonomies (see [3])
Data Model • Structural properties found in well-known web taxonomies: • large variance out-degree(Δ), i.e., some nodes have many subclasses • small in-degree (δ) range and variance • small depth (σ) (logarithmic) • small number (>1) of paths from root • See paper for table of statistical values for Wordnet, ODP, and Math taxonomies
Our Approach • Given: large, rooted web taxonomies represented abstractly as Directed Acyclic Graph or DAG with above statistics • Problem: Label each node of the DAG so that all local path information for each taxonomy element is preserved in the encoding • Our labeling scheme is a variable-length, prefix-based scheme, and built up in two stages
Our Approach (cont’d) 1.Greedy Dewey Labeling for Trees (TGDL) -Identifies a Breadth-First tree T in a DAG -Encodes path information for the paths in T -Label nodes with concatenation of edge labels
Analysis of the Length for TGDL Labels • Performed in 2 steps • First step: assume that delimiting labels are empty -- each node v labeled with bits at most • Second step: Using different edge delimiting schemes estimated upper bound of node labels
Delimiting schemes • They encode length of each tree-edge label • Two approaches tested: • Unary Length Encoding • Fixed Binary Length Encoding
Unary Length Encoding (ULE) • Comparable to Elias Gamma Code • Gamma ULE 1 1 10 2 010 113 011 0100 4 00100 01015 00101 01106 00110 01117 00111 0010008 0001000 001001 • ULE assigns |e|-1 bits long zero prefix to an edge label e with GDL label of the length |e|
Unary Length Encoding (ULE) Analysis Theorem: Upper bound on TGDL label length with ULE of delimiters is bits, for an arbitrary node v in a tree T - is the depth of v in T - n is number of nodes in T
Fixed Binary Length Encoding (FBLE) • For an edge e, this encoding is the binary representation of the length for GDL(e) • Encoded with a fixed number of bits - is the maximum node out-degree in T - uses 4 bits in our application
FBLE example - 4 bits will encode delimiters for any T with maximum out-degree < 2^16 - Let e is an edge in T with a given GDL label, e.g. GDL(e)=0000111111 Then FBLE produces delimiter 1010, so label for e is 10100000111111
Fixed Binary Length Encoding (FBLE) Analysis • Upper bound on TGDL label length with FBLE of delimiters is bits, for an arbitrary node v in a tree T
Our Approach (cont’d2) 2.Extended Greedy Dewey Labeling for DAGs (EGDL) -Augment codes generated from step 1 -Used for inferring paths not part of the Breadth-First tree -Adds TGDL node label pairs of non-tree edges
EGDL Labeling - Example .01*.0.01 .01*.0.0 .0.01*.0.01
Experimental Results-Label Lengths Encoding Length Wordnet 2.1 Statistics
References [1] Budanitsky, A., Hirst, G. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh,PA, 2001. [2] Resnik, F. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pages 448–453, 1995. [3] Christophides, V., Plexousakis, D. On Labeling Schemes for the Semantic Web. In Proceedings of the 12th international conference on World Wide Web, pages 544–555, Budapest, Hungary. [4] Abiteboul., S., Kaplan, H., Milo, T. Compact labeling schemes for ancestor queries. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547–556, Washington, D.C., 2001. [5] Strunjas-Yoshikawa, S., Annexstein, F., Berman, K. Compact Encodings for All Local Path Information in Web Taxonomies with applications to WordNet . In Proceedings of the 32nd International Conference on Current Trends in Theory and Practice of Computer Science, Merin, Czech Republic, January 21-27, 2006.