Cooperative Query Answering for Semistructured Data

Cooperative Query Answering for Semistructured Data Speakers: Chuan Lin & Xi Zhang By Michael Barg and Raymond K. Wong

Outline • Motivations • Overview • Basic Concepts • Cooperative Query Processing • Experiment

Motivations • XML data • same semantic content • very different structures

Court Transcript: plaintiff User Query: woman “insurance claims” related to “smoking” for “woman” insurance claim smoking Insurance Record: insurance claim insurer smoking woman Example: same semantics, diff structures

Data: personnel User Query: “phone number” of “Bob” Who is the new “sales manager” sales manager salesman assistant sales manager salesman Joe Bob phone number phone number Motivations • No exact query result

Overview • Goal: • Return approximate answers for XML queries • “approximate”: semantic + structural similar • Solution: • Return a set of results • ranked by an overall score • score: indicates how well the subgraph containing the result satisfies the query criteria.

Basic Concepts: Query Tree Query: /restaurant[.//Soho]/phone_number Query Tree: Result Term restaurant t h h soho t phone_number r For each edge: “head”: the end which is closer to nearest result term “end”: the other end In case of tie, “head” is the end closer to root

Basic Concepts: Converging Order • Order of edges considered in query processing • Converge on a result term

shopping_ center restaurant soho soho restaurant restaurant soho soho restaurant eating_ places address restaurant soho (a) (b) (c) (d) (e) Basic Concepts: Similarity • Semantically similar topologies

Basic Concepts: Similarity (cont.) • Deviation Proximity (DP) • Measure how far one structure deviates from a desired structure • Given: • ra: data node with value a • rb: data node with value b • Q(a,b): query tree edge • DP: the actual position of rb to the nearest position, r’b, which satisfies the topological relationship specified by Q(a,b) • Topological relationship: parent-child, ancestor-descendent

restaurant soho soho eating_ places restaurant Deviation Proximity Q (restaurant, soho) requires parent-child relationship shopping_ center restaurant soho restaurant soho address restaurant (soho’) (soho’) soho (soho’) (soho’) (soho’) DP(restauarent, soho): 0 1 2 3 3

restaurant soho soho eating_ places restaurant Deviation Proximity Q (restaurant, soho) requires anc-desc relationship shopping_ center restaurant soho restaurant soho address restaurant (soho’) soho (soho’) (soho’) (soho’) (soho’) DP(restauarent, soho): 0 0 2 3 3

Cooperative Query Processing • Input: a Query Tree QT, an XML Document Tree DT • Output: ordered list of <rresult_term, score> • Cooperative Query Processing • Structural proximity calculation • Progressive Score

Cooperative Query Processing (cont.) • Progressively matching edges in QT with DT • Consider edges in converging order • For each edge QT(a,b), where a is head and b is tail, get a list of <ra, score> • ra is a node in DT with value a • score is the progressive score of ra w.r.t the nearest rb • use graph encoding to calculate structural proximity of ra and rb

Structural Proximity Calculation • Encodings and Compressed Arrays • Compact • Preserve relationship to a larger graph • Facilitate distance calculations • Proximity Searching

Encodings and Compressed Arrays • Basic Concepts: • Common Node • Terminal Node • Annotated Node • Path representation • Representing Single Path • Representing Multiple Paths • Representing Multiple Elements • Compressed Arrays • Each encoding is a path/muti-path for a node/a set of nodes

Encodings and Compressed Arrays

Representing Single Path 1.1.1  y1 1.2.1.1.1.1  y2

Representing Multiple Paths 1.3  B .B.2.1.1  C .3  C .C.2  y3

Representing Multiple Elements .A.1.1y1 1 A .2.1.1.1.1 y2 .3  B.B.2.1.1  C.3  C.C.2  y3

Compressed Arrays

Drawback of Encoding • 1A.A.1B.B.1D.2E.?.2C.C.1F.2G

Proximity Searching • Multi-Element Comparison • Input: • A compressed array, caN, containing the multi-element encoding of the Near Set. • A compressed array, caF, containing the multi-path encoding or path encoding of all paths from the root to the specified element of the Find Set, EF. • output: • dist, the shortest path from EF to the closest element in Near Set

Proximity Searching MinDist=5 MinDist = 4 MinDist = 2

Progressive Score • Accumulative Deviation Proximity (DP) • Calculated from structural proximity • Boolean operator at Query Tree branches a a b b c c prog(a) = prog(b)+prog(c) prog(a) = min (prog(b),prog(c))

Experiment XML: Query: //restaurant/soho Query Result: <soho, 2> <soho, 3> <soho, 4>

Thank you!

Questions & Answers

Cooperative Query Answering for Semistructured Data

Cooperative Query Answering for Semistructured Data

Presentation Transcript

Join Synopses for Approximate Query Answering

Semistructured Data

Join Synopses for Approximate Query Answering

Cooperative Query Answering

Cooperative XML (CoXML) Query Answering

Data Exchange: Semantics and Query Answering

Indexing Semistructured Data

Join Synopses for Approximate Query Answering

Cooperative Query Answering for Semistructured data

Data exchange: semantics and query answering

XML: Semistructured Data

Semistructured-Data Model

Query Optimization for Semistructured Data

Join Synopses for Approximate Query Answering

Semistructured-Data Model

Cooperative XML (CoXML) Query Answering

CoXML: A Cooperative XML Query Answering System

Join Synopses for Approximate Query Answering

Typing semistructured data

Data Exchange: Semantics and Query Answering

Typing semistructured data

Cooperative Query Answering