TOSS: An Extension of TAX with Ontologies and Similarity Queries

TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland, College Park

Outline • Introduction • Ontologies and Integration • Similarity Enhanced Ontology (SEO) • TOSS Algebra • Implementation and Experiments • Related Work

XML: EXtensible Markup Language • markup language much like HTML, derived from SGML • designed to describe data for easy data transmission and data manipulation over the web • XML tags: not predefined  flexible to define your own tags

XML Example in DBLP tag name <?xml version="1.0"?> <inproceedings> <author>Paolo Ciancarini</author> <author>Andrea Giovannini</author> <author>Davide Rossi</author> <title>Mobility and Coordination for distributed Java Applications</title> <pages>402—426</pages> <year>1999</year> <booktitle>Advances in Distributed Systems</booktitle> </inproceedings> value

DBLP

SIGMOD

Motivating Examples and TAX • DBLP and SIGMOD bibliographies in XML • [Jagadish et al., TAX: A Tree Algebra in XML, in DBPL, 2001] • one of the best algebra developed for XML databases • selection, projection, product, etc • use pattern tree to find embeddings (matchings)

TAX: Pattern tree • A pattern tree is a pair P=(T, F), where T = (V, E) is an object-labeled and edge-labeled tree such that: • Each object in V has a distinct interger as its label • Each edge is either labeled pc (for parent-child) or ad (for ancestor-descendant) • F is a selection condition applicable to objects tag name pc: parent-child ad: ascendant-descendant value

TAX: Embedding • Suppose SDB is a semistructured database and P = (T, F) a pattern tree. An embedding of a pattern tree P into SDB is a total mapping h: P  U(v, E)SDBV from the nodes of T to those in SDB s.t.: • H preserves the structure of T, i.e., whenever (u, v)is a pc (resp. ad) edge in T, h(v) is a child (resp., descendent) of h(u) in SDB • The image under the mapping h satisfies the selection condition F

TAX: Witness tree • Each embedding h of a pattern tree P into SDB induces a witness tree to the embedding denoted hSDB(P), defined as: • A node n of SDB is in the witness tree if n = h(u) for some node u in the pattern tree P • For any pair of nodes n, m in the witness tree, whenever m is the closest ancestor (of the nodes in the witness tree) of n in SDB, the witness tree contains the edge (m, n) • The witness tree preserve order between nodes in SDB, i.e., for any two nodes in hSDB(P), whenever m precedes n in the preorder enumeration of SDB, it does so in that of hSDB(P) as well

tag name pc: parent-child ad: ascendant-descendant value • Pattern tree • Embedding witness trees: Data Instance

TAX: Selection • Suppose SDB is a semistructured database, P = (T, F) a pattern tree, and SL is any set of nodes. A selection query σP,SL(SDB) returns all witness trees w.r.t. pattern tree P and SDB. In addition, if a node n in SL appears in a witness tree above, then all descendants of n will also be added to the witness tree.

tag name pc: parent-child ad: ascendant-descendant value • Pattern tree • Selection witness trees: Data Instance

Pattern tree • Selection witness trees: Data Instance

Pattern tree • Selection witness trees  selection result Data Instance

TAX: Projection • Suppose SDB is a semistructured database, P = (T, F) a pattern tree, and PL is a projection list (a list of ode labels appearing in P). A projection query πP,PL(SDB) returns tree(s) consisting of all nodes n selected from SDB s.t. for every node n in the result, there exists some witness tree hSDB(P) and n’ PL where hSDB(n’) = n.

Pattern tree • Projection Data Instance

Product • The product of two instances (two sets of trees) contains, for each pair of trees (from the two instances), a tree whose root is a new node (called tax_prod_root) with children as the roots of the two instances. X tax_prod_root

SIGMOD Problems! DBLP

Problems • Lack of lexical semantics in answering queries • Find papers written by “J. Ullman”: • J.D. Ullman? Jeffrey Ullman? • similar values/tags • Find papers whose at least one author is from “U.S. government”: • U.S. Census Bureau? U.S. Army? • values/tags with relationships described by ontologies

Problems • TAX returns correct results High precision • but often misses some correct results Poor Recall • Quality = (recall  precision)1/2  Low Quality • Goal of our TOSS system: • extend and enhance the semantics of TAX to return high quality answers using ontology and similarity measures

Our approach • capture inter-term lexical relationships by ontology and integrate ontologies of different DBs • use existing similarity measures to enhance the integrated ontology • TOSS: extend TAX algebra to query with ontology and similarity

Architecture STORY, PARQ

Ontology • a set S • S = {article, author, title} • a partially ordered set (S, ≤S) • part_of relation ≤S = {(author, article), (title, article), (title, title), (author, author), (article, article)} • a hierarchy (H, ≤H) is Hasse diagram for (S, ≤S) • a DAG with a minimal set of edges s.t. there’s a path from u to v iff u ≤Sv • H = {article, author, title} • ≤H = {(author, article), (title, article)}

Ontology • Suppose Σ is some finite set of strings and S is some set. An ontology w.r.t. Σ is a partial mapping Θ from Σ to hierarchies for S • Σ = {part_of} • Θ(part_of) = (H, ≤H) author part_of article title part_of

Ontology Integration SIGMOD DBLP

Ontology Integration SIGMOD DBLP IC (interoperation constraints)

Ontology Integration

Ontology Integration Hierarchy graph associated with SIGMOD and DBLP

Ontology Integration Fusion of ontologies of SIGMOD and DBLP

Similarity Enhanced Ontology • A string similarity measure dS is any function which takes two strings X,Y and returns a non-negative real number such that • X, dS(X,X) = 0 • X,Y, dS(X,Y) = dS(Y,X)

Similarity Enhanced Ontology • Any string similarity measure can be used such as Levenstein distance, Monge-Elkan distance, Jaro metric, Jaccard Similarity, etc(Cohen et al. "A comparison of string metrics for matching names and records", 1st Workshop on Data Clearning, Record Linkage and Object Consolidation, 2003) • For example: Levenstein distance assigns a unit cost to every edit operation. • dS(“relation”, “relational”)=2

Similarity Enhanced Ontology • A similarity measure is any function which takes nodes A, B as input and returns a non-negative real numbers such that • d(A,B) = minXS,YT dS(X,Y), where dS is a string similarity measure, S,T are sets of strings contained in nodes A,B. • In an integrated ontology, nodes may contain one or more strings and we want to consider whether two nodes are sufficiently similar. Since strings in one node (in the original ontology) is equivalent, so we take the minimum of the distances of all string pairs from two nodes.

Similarity Enhanced Ontology • A string similarity measure dS is strong iff for all strings X, Y, Z, • dS(X, Y) + dS(Y, Z) ≥ dS(X, Z)

Similarity Enhanced Ontology

Similarity Enhanced Ontology • Suppose H is an integrated hierarchy, d is a similarity measure and   0. (H’,) is a similarity enhancement of H w.r.t. d, iff H’ is a hierarchy and  is a function from H to 2H’ such that: • the original partial orderings in H are preserved, and no unwarranted orderings are included • all nodesmapped into the same node are similar to each other (by the threshold ) • two strings are similar iff they are jointly present in some node in (H’,) • no redundantnode whose string set is a subset of some other node

Similarity Enhanced Ontology An example ontology Its similarity enhancement

Similarity Enhanced Ontology • (H, d, ) is similarity consistent iff there exists a similarity enhancement of H w.r.t. d, . • Theorem • If (H, d, ) is similarity consistent, then all similarity enhancements of H are equivalent.

Similarity Enhanced Ontology

SEO Semistructured Instance A semistructured instance is defined as I = (V, E, t) where t associates a type in T with each attribute (tag and/or content) of each object o in V.

TOSS Algebra • A simple selection condition has the form X op Y • op  { =, , <, , >, , ~, instance_of, isa, part_of, subtype_of, above, below}, and X, Y are terms, i.e.,attributes (tag, content), types, or typed values v: with v  dom(). • A selection condition is a simple selection condition OR a conjunction/disjunction of two selection conditions OR a negation of a selection condition

TOSS Algebra • The pattern tree to find the titles of all papers in DBLP related to Microsoft (independently of the field in which Microsoft appears): #1.tag = inproceedings & #2.tag = title & #3.tag part_of inproceedings & #3.content ~ “Microsoft”

TOSS: An Extension of TAX with Ontologies and Similarity Queries

TOSS: An Extension of TAX with Ontologies and Similarity Queries

Presentation Transcript

Biomedical Ontologies: The State of the Art Barry Smith and Werner Ceusters MIE, Sarajevo, August 30

The many uses of enriched thesauri and ontologies in the ATOD field

Fast Proximity Queries for Interactive Walkthroughs

Prof. Werner CEUSTERS, MD Ontology Research Group, Center of Excellence in Bioinformatics and Life Sciences and

Agricultural Extension

Ontologies and Level 2 Fusion: Theory and Application

SimRank : A Measure of Structural-Context Similarity

Word Meaning and Similarity

SQL Unit 7 Set Operations

Principles and Foundations of Ontologies and Semantic Grids

Sequence Comparison

Topic 1 Outline

Bioinformatics Workshop 1 Sequences and Similarity Searches

SQL Queries

Chapter 8: SQL-99

The CROP ( C ommon R eference O ntologies for P lants) Initiative Barry Smith

Network Traffic Self-Similarity

L ogics for D ata and K nowledge R epresentation

Learning Embeddings for Similarity-Based Retrieval

Text Mining: Techniques, Tools, ontologies and Shared tasks