140 likes | 294 Views
Effective Keyword-Based Selection of Relational Databases. By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung. Presented by Deborah Kallina. Introduction. The database selection problem for relational data sources Demand for keyword-based searches of structured databases
E N D
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina
Introduction • The database selection problem for relational data sources • Demand for keyword-based searches of structured databases • What about searches over multiple structured databases?
Current Approach • Document collections are summarized with keyword frequency lists • Inadequate because… • Relational tables are typically normalized • Results must consider number of join operations Example: query Q = {multimedia, database, VLDB}
The Usefulness of a Database in Answering a Keyword Query • A database is useful if it has high quality results to the keyword query. • What are “high quality” results? • The results contain all the query keywords. • There is a keyword relationship – the query keywords can be connected meaningfully in the database.
This paper proposes… • A method that summaries the relationships between keywords in a database. • A ranking method based on the keyword summaries in order to select the most useful databases for a given keyword query.
Why Summaries? • We have an equation which measures the usefulness of a database DB to query Q • But calculating this score isn’t feasible for every database in the system! • So we estimate the usefulness of the database using a summary of the relationships between all pairs of keywords in the database.
Results of a Keyword Query • A minimal tree of tuples containing all the query keywords * tuples in the tree are connected according to their relationships defined in the database schema • The more distant the relationships, the weaker the relationships connecting the tuples. • distance - the number of join operations in a joining sequence of tuples Examples: • t1 t2 t3, distance = 2 • t4, distance = 0 • DB2 of Figure 1, Q = {multimedia, binder} • t1 t5 t12or t4 t9 t12, distance = 2 • t15 t10 t1 t5 t12, distance = 4
The Matrices • Each relational database DB has several types of matrices • one D matrix – represents the presence or absence of each keyword in each tuple in DB • multiple T matrices – represents the relationship between tuples in a relational database at each distance • construct the Wd matrices – the frequencies of keyword pairs at distance d;Example: Figure 3
Calculate the Key Relationship Matrix (KRM) • wd(ki, kj) is the frequency of d-distance joining sequences to connect the pair of keywords ki and kj • is the maximum number of join operations (user-specified) Example: Figure 4 Q = {database:VLDB} for DB2 in Figure 3, the only non-zero is when d = 2 1/(2+1) * 1 occurrence = 0.33
Implementation in SQL • The generation of the matrices: D, T1, T2, … T and W0, W1, … , W , for each DB, can be performed conveniently inside the local RDBMS using SQL. • Tables are created for each matrix. • Tables can be maintained dynamically when tuples are added to the database or old tuples are deleted.
Data Sources Selection Using KRM • Estimate the score of the data source DB to Q through the scores of the individual pairs. • With the KR-summary, we can effectively rank a set of databases. A higher score of the data source DB to Q indicates a better ranking.
Definitions & Metrics for Comparing Experimental Results • real ranking – the “real” score of each database based on Equation 2-1 • recall – compares the accumulated score of the top l databases selected based on the summaries of the source databases against the total available score when we select top l databases according to real ranking (summaries vs. real ranking) • precision – measures the fraction of the top l selected databases which have high quality results
Experimental Results Observations: • The selection performance of KR-summaries generally gets better as gets larger. • The precision and recall performance for different values of tends to cluster into groups. • There are big gaps in both precisions and recalls between KR-summaries when 0 ≤ ≤ 1 and when is greater.
Conclusion “The estimation method is effective in assigning high ranks to useful databases, although less relevant or irrelevant databases might be selected.”