A Graph Method for Keyword-based Selection of the top-K Databases

Quang Hieu Vu1, Beng Chin Ooi1, Dimitris Papadias2, Anthony K. H. Tung1 1 National University of Singapore 2 Hong Kong University of Science and Technology A Graph Method for Keyword-based Selection of the top-K Databases

Outline • Motivation • Problem definition • An existing approach • System architecture • Query processing • Experimental study • Conclusion

Motivation • Challenge: to issue a query in a DBMS, users need to know • Database schema • Data manipulation language (e.g. SQL) • In distributed systems: heterogeneity of different database schemas • Solution: Keyword Search (KS) • The basic unit of information is a tuple • Each result of a query is a set of tuples satisfying • Contain all or most query keywords • Can be joined together in a meaningful way • (via Primary Key – Foreign Key relationship)

Problem definition • Given a set of relational databases stored at different nodes in a distributed system and a keyword query • Select the top-K databases most likely to contribute results • (K is an input parameter) • Purpose: to minimize the total cost of processing the query without sacrificing precision and recall

An existing approach: M-KS [1] Each DBMS builds a keyword relationship matrix (KRM) acting as its summary For each pair of terms (ti, tj), there is an entry in KRM that records the frequencies of occurrences of the two terms having a relationship at different distances. Two terms have a relationship if They are in the same tuple relationship distance = 0 They are in different tuples, but these tuples can be joined together via d join operations relationship distance = d [1] Bei Yu et al. Effective keyword-based selection of relational databases. SIGMOD’07

An example of KRM A database KRM of the database

Disadvantages of M-KS Use only binary relationships between terms to eliminate non-promising databases Yield numerous false positives Record only the frequency of term co-occurrences Unsuitable for ranking based on IR measures Is designed to support only AND semantics Real applications usually support queries under OR semantics

G-KS • G-KS summarizes the terms and their relationships in • each DBMS using a keyword relationship graph (KRG) • A node corresponds to a term and has a weight. • If two terms have a relationship at distance d, there is an edge between their corresponding nodes in the graph. The distance d between them is marked on the edge. • When two terms can be connected through multiple paths of variable distances, each distinct value of d is recorded. • Every distance value in the graph is associated with a weight.

An example of KRG A database KRG of the database

Weight of a node

Weight of an edge

Graph Compression Observation: a large percentage of terms in a DBMS appear only once. If such terms occur in the same tuple They have the same weight They have the same set of connections to other nodes and these connections are of equal weight Graph compression: 2 types of nodes Single nodes: contain one term Compound nodes: consist of multiple terms The weight of a compound node as well as its edges are computed using any of the included terms

An example of a compressed KRG A database Compressed KRG of the database

Graph construction Create nodes Compound nodes for terms that occur only once in the database and are in the same tuple Single nodes for other terms Create edges Nodes representing terms in the same tuple: an edge at distance 0 Nodes representing terms in two tuples, which can be connected by d join operations: an edge at distance d

Join keyword tree (JKT) Given a sub graph SG of a KRG, JKT(SG) is a tree satisfying Each tree vertex maps to a non-empty set of nodes of SG and the tree vertices should collectively contain all nodes in SG Edges connecting two vertices are associated with a single distance d Mapping rules If two SG nodes map to the same tree vertex, there must exist a relationship distance 0 between them in SG If two SG nodes map to different tree vertices, there must exist a relationship distance d’ between them in SG, where d’ is the sum of distances in the path connecting two tree vertices

Example of a JKT Mapping from SG to JKT(SG)

Example of a JKT JKT(SG) Database

Candidate graph (CG) Given q and KRG, CG(KRG, q) is an SG of KRG satisfying SG includes all nodes of KRG containing the query keywords, and only these nodes SG is complete There exists at least one JKT(SG)

Important theorems Theorem 1: if a database contains a result with all keywords of a query q, then the corresponding KRG must have a candidate graph CG(KRG,q) Theorem 2: the existence of a candidate graph CG(KRG,q) in KRG does not guarantee that the corresponding database has results for q Note: CG(KRG,q) indicates a high probability of the database having results.

Query processing

Experimental study Use DBLP dataset to generate 81 databases according to bibliography types Compare G-KS vs M-KS Effectiveness measurement Use brute-force method to send the query to all databases and select top-K databases from returned results. Let KBF(M) and KG-KS(M) be the total number of results with M keywords in BF and G-KS Recall: KG-KS(M)/KBF (M) Precision: KG-KS(M)/K

Pre-processing cost

Effect of varying #keywords in a query

Effect of varying top-K selected DBs

Effect of varying max relationship dist.

Conclusion • G-KS: A method that selects the top-K databases for processing a relational keyword search query • G-KS summarizes each database as a keyword relationship graph where • Nodes correspond to terms • Edges capture distance relationships between terms • IR techniques are applied to weight nodes and edges • An algorithm is designed to consider all keywords as a whole in query processing in order to minimize the number of false positives

Thank you ! Questions & Answers

A Graph Method for Keyword-based Selection of the top-K Databases

A Graph Method for Keyword-based Selection of the top-K Databases

Presentation Transcript

Keyword++: A Framework to Improve Keyword Search Over Entity Databases

DBXplorer : A System For Keyword-Based Search Over Relational Databases

Keyword-based Search and Exploration on Databases

A Graph Method for Keyword-based Selection of the top-K Databases

Evaluating Top- K Selection Queries

NoSQL : Graph Databases

A self-organizing method for WSN s based on natural selection.

Keyword Method

Graph databases

DBXplorer: A System for Keyword-Based Search over Relational Databases.

DBXplorer: A System for Keyword-Based Search over Relational Databases

Graph Databases (GDB)

A Gene Selection Method for Microarray Data based on Sampling

Keyword Search Over Graph Databases

A Method for Runtime Service Selection

A fast algorithm for the generalized k-keyword proximity problem given keyword offsets

Graph-based Iterative Hybrid Feature Selection

Effective Keyword-Based Selection of Relational Databases

A genetic algorithm-based method for feature subset selection

Supporting Top-K Keyword Search in XML Databases

Keyword Search and Keyword Selection

A Method for Runtime Service Selection