680 likes | 827 Views
TOSS: An Extension of TAX with Ontologies and Similarity Queries. Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland, College Park. Outline. Introduction Ontologies and Integration Similarity Enhanced Ontology (SEO) TOSS Algebra
E N D
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland, College Park
Outline • Introduction • Ontologies and Integration • Similarity Enhanced Ontology (SEO) • TOSS Algebra • Implementation and Experiments • Related Work
XML: EXtensible Markup Language • markup language much like HTML, derived from SGML • designed to describe data for easy data transmission and data manipulation over the web • XML tags: not predefined flexible to define your own tags
XML Example in DBLP tag name <?xml version="1.0"?> <inproceedings> <author>Paolo Ciancarini</author> <author>Andrea Giovannini</author> <author>Davide Rossi</author> <title>Mobility and Coordination for distributed Java Applications</title> <pages>402—426</pages> <year>1999</year> <booktitle>Advances in Distributed Systems</booktitle> </inproceedings> value
Motivating Examples and TAX • DBLP and SIGMOD bibliographies in XML • [Jagadish et al., TAX: A Tree Algebra in XML, in DBPL, 2001] • one of the best algebra developed for XML databases • selection, projection, product, etc • use pattern tree to find embeddings (matchings)
TAX: Pattern tree • A pattern tree is a pair P=(T, F), where T = (V, E) is an object-labeled and edge-labeled tree such that: • Each object in V has a distinct interger as its label • Each edge is either labeled pc (for parent-child) or ad (for ancestor-descendant) • F is a selection condition applicable to objects tag name pc: parent-child ad: ascendant-descendant value
TAX: Embedding • Suppose SDB is a semistructured database and P = (T, F) a pattern tree. An embedding of a pattern tree P into SDB is a total mapping h: P U(v, E)SDBV from the nodes of T to those in SDB s.t.: • H preserves the structure of T, i.e., whenever (u, v)is a pc (resp. ad) edge in T, h(v) is a child (resp., descendent) of h(u) in SDB • The image under the mapping h satisfies the selection condition F
TAX: Witness tree • Each embedding h of a pattern tree P into SDB induces a witness tree to the embedding denoted hSDB(P), defined as: • A node n of SDB is in the witness tree if n = h(u) for some node u in the pattern tree P • For any pair of nodes n, m in the witness tree, whenever m is the closest ancestor (of the nodes in the witness tree) of n in SDB, the witness tree contains the edge (m, n) • The witness tree preserve order between nodes in SDB, i.e., for any two nodes in hSDB(P), whenever m precedes n in the preorder enumeration of SDB, it does so in that of hSDB(P) as well
tag name pc: parent-child ad: ascendant-descendant value • Pattern tree • Embedding witness trees: Data Instance
TAX: Selection • Suppose SDB is a semistructured database, P = (T, F) a pattern tree, and SL is any set of nodes. A selection query σP,SL(SDB) returns all witness trees w.r.t. pattern tree P and SDB. In addition, if a node n in SL appears in a witness tree above, then all descendants of n will also be added to the witness tree.
tag name pc: parent-child ad: ascendant-descendant value • Pattern tree • Selection witness trees: Data Instance
Pattern tree • Selection witness trees: Data Instance
Pattern tree • Selection witness trees selection result Data Instance
TAX: Projection • Suppose SDB is a semistructured database, P = (T, F) a pattern tree, and PL is a projection list (a list of ode labels appearing in P). A projection query πP,PL(SDB) returns tree(s) consisting of all nodes n selected from SDB s.t. for every node n in the result, there exists some witness tree hSDB(P) and n’ PL where hSDB(n’) = n.
Pattern tree • Projection Data Instance
Product • The product of two instances (two sets of trees) contains, for each pair of trees (from the two instances), a tree whose root is a new node (called tax_prod_root) with children as the roots of the two instances. X tax_prod_root
SIGMOD Problems! DBLP
Problems • Lack of lexical semantics in answering queries • Find papers written by “J. Ullman”: • J.D. Ullman? Jeffrey Ullman? • similar values/tags • Find papers whose at least one author is from “U.S. government”: • U.S. Census Bureau? U.S. Army? • values/tags with relationships described by ontologies
Problems • TAX returns correct results High precision • but often misses some correct results Poor Recall • Quality = (recall precision)1/2 Low Quality • Goal of our TOSS system: • extend and enhance the semantics of TAX to return high quality answers using ontology and similarity measures
Our approach • capture inter-term lexical relationships by ontology and integrate ontologies of different DBs • use existing similarity measures to enhance the integrated ontology • TOSS: extend TAX algebra to query with ontology and similarity
Architecture STORY, PARQ
Architecture STORY, PARQ
Ontology • a set S • S = {article, author, title} • a partially ordered set (S, ≤S) • part_of relation ≤S = {(author, article), (title, article), (title, title), (author, author), (article, article)} • a hierarchy (H, ≤H) is Hasse diagram for (S, ≤S) • a DAG with a minimal set of edges s.t. there’s a path from u to v iff u ≤Sv • H = {article, author, title} • ≤H = {(author, article), (title, article)}
Ontology • Suppose Σ is some finite set of strings and S is some set. An ontology w.r.t. Σ is a partial mapping Θ from Σ to hierarchies for S • Σ = {part_of} • Θ(part_of) = (H, ≤H) author part_of article title part_of
Ontology Integration SIGMOD DBLP
Ontology Integration SIGMOD DBLP IC (interoperation constraints)
Ontology Integration Hierarchy graph associated with SIGMOD and DBLP
Ontology Integration Fusion of ontologies of SIGMOD and DBLP
Architecture STORY, PARQ
Similarity Enhanced Ontology • A string similarity measure dS is any function which takes two strings X,Y and returns a non-negative real number such that • X, dS(X,X) = 0 • X,Y, dS(X,Y) = dS(Y,X)
Similarity Enhanced Ontology • Any string similarity measure can be used such as Levenstein distance, Monge-Elkan distance, Jaro metric, Jaccard Similarity, etc(Cohen et al. "A comparison of string metrics for matching names and records", 1st Workshop on Data Clearning, Record Linkage and Object Consolidation, 2003) • For example: Levenstein distance assigns a unit cost to every edit operation. • dS(“relation”, “relational”)=2
Similarity Enhanced Ontology • A similarity measure is any function which takes nodes A, B as input and returns a non-negative real numbers such that • d(A,B) = minXS,YT dS(X,Y), where dS is a string similarity measure, S,T are sets of strings contained in nodes A,B. • In an integrated ontology, nodes may contain one or more strings and we want to consider whether two nodes are sufficiently similar. Since strings in one node (in the original ontology) is equivalent, so we take the minimum of the distances of all string pairs from two nodes.
Similarity Enhanced Ontology • A string similarity measure dS is strong iff for all strings X, Y, Z, • dS(X, Y) + dS(Y, Z) ≥ dS(X, Z)
Similarity Enhanced Ontology • Suppose H is an integrated hierarchy, d is a similarity measure and 0. (H’,) is a similarity enhancement of H w.r.t. d, iff H’ is a hierarchy and is a function from H to 2H’ such that: • the original partial orderings in H are preserved, and no unwarranted orderings are included • all nodesmapped into the same node are similar to each other (by the threshold ) • two strings are similar iff they are jointly present in some node in (H’,) • no redundantnode whose string set is a subset of some other node
Similarity Enhanced Ontology An example ontology Its similarity enhancement
Similarity Enhanced Ontology • (H, d, ) is similarity consistent iff there exists a similarity enhancement of H w.r.t. d, . • Theorem • If (H, d, ) is similarity consistent, then all similarity enhancements of H are equivalent.
Architecture STORY, PARQ
SEO Semistructured Instance A semistructured instance is defined as I = (V, E, t) where t associates a type in T with each attribute (tag and/or content) of each object o in V.
TOSS Algebra • A simple selection condition has the form X op Y • op { =, , <, , >, , ~, instance_of, isa, part_of, subtype_of, above, below}, and X, Y are terms, i.e.,attributes (tag, content), types, or typed values v: with v dom(). • A selection condition is a simple selection condition OR a conjunction/disjunction of two selection conditions OR a negation of a selection condition
TOSS Algebra • The pattern tree to find the titles of all papers in DBLP related to Microsoft (independently of the field in which Microsoft appears): #1.tag = inproceedings & #2.tag = title & #3.tag part_of inproceedings & #3.content ~ “Microsoft”