360 likes | 487 Views
Quang Hieu Vu 1 , Beng Chin Ooi 1 , Dimitris Papadias 2 , Anthony K. H. Tung 1 1 National University of Singapore 2 Hong Kong University of Science and Technology. A Graph Method for Keyword-based Selection of the top-K Databases. Outline. Motivation Problem definition
E N D
Quang Hieu Vu1, Beng Chin Ooi1, Dimitris Papadias2, Anthony K. H. Tung1 1 National University of Singapore 2 Hong Kong University of Science and Technology A Graph Method for Keyword-based Selection of the top-K Databases
Outline • Motivation • Problem definition • An existing approach • System architecture • Query processing • Experimental study • Conclusion
Motivation • Challenge: to issue a query in a DBMS, users need to know • Database schema • Data manipulation language (e.g. SQL) • In distributed systems: heterogeneity of different database schemas • Solution: Keyword Search (KS) • The basic unit of information is a tuple • Each result of a query is a set of tuples satisfying • Contain all or most query keywords • Can be joined together in a meaningful way • (via Primary Key – Foreign Key relationship)
Problem definition • Given a set of relational databases stored at different nodes in a distributed system and a keyword query • Select the top-K databases most likely to contribute results • (K is an input parameter) • Purpose: to minimize the total cost of processing the query without sacrificing precision and recall
An existing approach: M-KS [1] Each DBMS builds a keyword relationship matrix (KRM) acting as its summary For each pair of terms (ti, tj), there is an entry in KRM that records the frequencies of occurrences of the two terms having a relationship at different distances. Two terms have a relationship if They are in the same tuple relationship distance = 0 They are in different tuples, but these tuples can be joined together via d join operations relationship distance = d [1] Bei Yu et al. Effective keyword-based selection of relational databases. SIGMOD’07
An example of KRM A database KRM of the database
An example of KRM A database KRM of the database
An example of KRM A database KRM of the database
Disadvantages of M-KS Use only binary relationships between terms to eliminate non-promising databases Yield numerous false positives Record only the frequency of term co-occurrences Unsuitable for ranking based on IR measures Is designed to support only AND semantics Real applications usually support queries under OR semantics
G-KS • G-KS summarizes the terms and their relationships in • each DBMS using a keyword relationship graph (KRG) • A node corresponds to a term and has a weight. • If two terms have a relationship at distance d, there is an edge between their corresponding nodes in the graph. The distance d between them is marked on the edge. • When two terms can be connected through multiple paths of variable distances, each distinct value of d is recorded. • Every distance value in the graph is associated with a weight.
An example of KRG A database KRG of the database
An example of KRG A database KRG of the database
Graph Compression Observation: a large percentage of terms in a DBMS appear only once. If such terms occur in the same tuple They have the same weight They have the same set of connections to other nodes and these connections are of equal weight Graph compression: 2 types of nodes Single nodes: contain one term Compound nodes: consist of multiple terms The weight of a compound node as well as its edges are computed using any of the included terms
An example of a compressed KRG A database Compressed KRG of the database
An example of a compressed KRG A database Compressed KRG of the database
An example of a compressed KRG A database Compressed KRG of the database
Graph construction Create nodes Compound nodes for terms that occur only once in the database and are in the same tuple Single nodes for other terms Create edges Nodes representing terms in the same tuple: an edge at distance 0 Nodes representing terms in two tuples, which can be connected by d join operations: an edge at distance d
Join keyword tree (JKT) Given a sub graph SG of a KRG, JKT(SG) is a tree satisfying Each tree vertex maps to a non-empty set of nodes of SG and the tree vertices should collectively contain all nodes in SG Edges connecting two vertices are associated with a single distance d Mapping rules If two SG nodes map to the same tree vertex, there must exist a relationship distance 0 between them in SG If two SG nodes map to different tree vertices, there must exist a relationship distance d’ between them in SG, where d’ is the sum of distances in the path connecting two tree vertices
Example of a JKT Mapping from SG to JKT(SG)
Example of a JKT Mapping from SG to JKT(SG)
Example of a JKT JKT(SG) Database
Candidate graph (CG) Given q and KRG, CG(KRG, q) is an SG of KRG satisfying SG includes all nodes of KRG containing the query keywords, and only these nodes SG is complete There exists at least one JKT(SG)
Important theorems Theorem 1: if a database contains a result with all keywords of a query q, then the corresponding KRG must have a candidate graph CG(KRG,q) Theorem 2: the existence of a candidate graph CG(KRG,q) in KRG does not guarantee that the corresponding database has results for q Note: CG(KRG,q) indicates a high probability of the database having results.
Experimental study Use DBLP dataset to generate 81 databases according to bibliography types Compare G-KS vs M-KS Effectiveness measurement Use brute-force method to send the query to all databases and select top-K databases from returned results. Let KBF(M) and KG-KS(M) be the total number of results with M keywords in BF and G-KS Recall: KG-KS(M)/KBF (M) Precision: KG-KS(M)/K
Conclusion • G-KS: A method that selects the top-K databases for processing a relational keyword search query • G-KS summarizes each database as a keyword relationship graph where • Nodes correspond to terms • Edges capture distance relationships between terms • IR techniques are applied to weight nodes and edges • An algorithm is designed to consider all keywords as a whole in query processing in order to minimize the number of false positives
Thank you ! Questions & Answers