DBXplorer: A System for Keyword-Based Search over Relational Databases

DBXplorer: A System for Keyword-Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande

Contents • Introduction • Overview of DBXplorer • Symbol Table Design - Publish • Keyword Search • Support for Generalized Matches • Conclusion

Introduction • Internet search engines have popularized keyword-based search. • Searching on traditional database management system is done through customized applications which are closely tied to the database schema. • Traditional database management systems do not support keyword-based search. • e.g. search the Microsoft intranet on ‘Jim Gray’ to obtain matched rows, i.e., rows in the database where ‘Jim Gray’ occur. • In this paper, DBXplorer, an efficient and scalable keyword search utility for relational databases, is described. • The goal is to enable such searches without necessarily requiring the users to know the schema of the respective databases.

Example Searching for a book • Keywords “Programming” by “Ritchie” • Less probability of presence of both keywords in single row of table • For result, rows need to be generated by joining tables on the fly (all possible combinations) Authors AuthorsBooks Books BookStores Store

Overview of DBXplorer • Objective - Given a set of query keywords, DBXplorer returns all rows (either from single tables, or by joining tables connected by foreign-key joins) such that the each row contains all keywords. • Applying IR techniques from the documents world to databases is difficult, because of • Database normalization – by which logical units of information may be fragmented and scattered across several tables • Matching row may be obtained by joining several tables on the fly • IR techniques use Inverted Lists = Symbol Table in databases • Symbol Table - Stores the information about keywords at different granularities (column/row), i.e. for each keyword stores the list of all rows

Overview of DBXplorer - Publish • Enabling keyword search in DBXplorer requires 2 steps • Publish • enables database for keywords by building Symbol table and associated structures • Search • retrieves matching rows from published database • Steps in Publish • Step 1: A database is identified, along with the set of tables and columns within the database to be published. • Step 2: Auxiliary tables [like Symbol table] are created for supporting keyword searches.

Overview of DBXplorer - Search • Steps in Search • Looks up symbol table to find tables / columns which contain keywords • Identify all possible joins (subsets of table if joined) whose rows might contain required keywords • Generate SQL statements for each join (gives rows which contain all keywords), rank rows and return

Architecture of DBXplorer • The publish component provides interfaces to • select a database, • select tables/columns within the database to publish, and • modify/remove/maintain the publication. • For a given set of keywords, the search component provides interfaces to • retrieve matching databases from a set of published databases, and • selectively identify tables, columns/rows that need to be searched within each database identified in step (1). Architecture of DBXplorer

Symbol Table Design • Exact match problem considered only • Symbol Table (S) - stores the information about keywords at different granularities (column/row), i.e. for each keyword stores the list of all rows • Column Level granularity (Pub-Col) • For every keyword S maintains list of all database columns (i.e. table.column) • Cell level granularity (Pub-cell) • For every keyword S maintains list of all database cells (i.e. table.column.rowid)

Symbol Table - Design factors • Space and Time Requirements • Size: pub-col are smaller than pub-cell since repetition of keyword in a column does not increases entries in case of pub-col • Time: pub-col takes less time to build • Keyword search performance • Depends on efficient generation and execution of SQL statement (built from symbol table entries) • pub-cell returns more number of SQL statements than pub-col as for a keyword in column there are multiple entries • Ease of maintenance • Insert/Update: required for insertion of distinct new value in case of pub-col while pub-cell needs for every update/insert • Same for delete

Storing Symbol Table • Store symbol tables (pub-col) in database as (keyword hash , column Id) • FK Compression (Foreign Key) • If there is foreign key relationship between c1 and c2, store only c1 • CP Compression • Partition H into a minimum number of bipartite cliques (a bipartite clique is any subgraph of H with a maximal number of edges). • Compress each clique. • Stores symbol table (pub-cell) in database as (keyword hash, list of all cellids) v2 v3 v4 c1 c2 x Uncompressed hash table ColumnsMap table Compressed hash table

Search - Enumerating Join Trees • Step1 - Looks up symbol table to find tables / columns which contain keywords • Step2 - Enumerate join trees • Identify and enumerate all potential subsets of tables in the database that, if joined, might contain rows having all keywords. • The resulting relation will contain all potential rows having all keywords specified in the query. keywords If we view the schema graph G as an undirected graph, this step enumerates join trees, i.e., sub-trees of G such that • the leaves belong to MatchedTables • together, the leaves contain all keywords of the query Join Trees

Search – Identify matching rows • The input to this final search step is the enumeratedjoin trees. • Each join tree is then mapped to a single SQLstatement that joins the tables as specified in the tree, andselects those rows that contain all keywords. • The retrieved rows are ranked before being output. • Rows ranked by number of joins involved (ties broken arbitrarily) (same as keywords occurring close to one another in documents are ranked higher) Join Trees

Generalized Matches – Token Matches • Token matches - the keyword in the querymatches only a token or sub-string of an attribute value (e.g., retrieve rows of address by specifying only a street name). • Pub-Prefix method • B+ tree indexes can be used toretrieve rows whose cell matches a given prefix string • This clause is of the form WHERE T.C LIKE ‘P%K%’ • During publishing of a database, for every keyword K,the entry (hash(K), T.C, P) is kept in the symbol table ifthere exists a string in column T.C which • contains atoken K, and • has prefix P

Generalized Matches - Token Matches Let the hash values of thesearchable tokens i.e., ‘string’, ‘ball’ and ‘round’ be 1, 2and 3 respectively Pub-Prefix table Database table T Consider searching keyword “string” Pub-Prefix table returns prefixes “th” and “no” and subsequent SQL will contain (T.C LIKE ‘th%string%’) OR (T.C LIKE ‘no%string%’)

Conclusion • This paper discusses DBXplorer, a system that enables keyword-based search in relational databases. • DBXplorer uses symbol table alternatives to store the location of keywords in database. • DBXplorer support exact matches and generalized matches upto some extent.

DBXplorer: A System for Keyword-Based Search over Relational Databases