DBXplorer: A System for Keyword-Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das

DBXplorer: A System for Keyword-Based Search over Relational DatabasesSanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang 04-04-2006

Outline • Introduction • Overview of DBXplorer • Publish • Search • Generalized Matches • Conclusion

Introduction • Internet search engines have popularized keyword-based search. • Traditional database management systems do not support keyword-based search. • e.g. search the Microsoft intranet on ‘Jim Gray’ to obtain matched rows, i.e., rows in the database where ‘Jim Gray’ occur. • In this paper, DBXplorer, an efficient and scalable keyword search utility for relational databases, is described. • The goal is to enable such searches without necessarily requiring the users to know the schema of the respective databases.

Overview of DBXplorer • Given a set of query keywords, DBXplorer returns all rows (either from single tables, or by joining tables connected by foreign-key joins) such that the each row contains all keywords. • Two Steps: • Publish • Search

Overview of DBXplorer - Publish • Publish - a preprocessing step that enables databases for keyword search by building the symbol table and associated structures • Step 1: A database is identified, along with the set of tables and columns within the database to be published. • Step 2: Auxiliary tables are created for supporting keyword searches. • The most important structure is a symbol table S that is used at search time to efficiently determine the locations of query keywords in the database (i.e., the tables, columns, rows they occur in).

Overview of DBXplorer - Search • Search - gets matching rows from the published databases. • Step 1: Searching the symbol table • Step 2: Enumerating Join Trees. • Step 3: Identify matching rows. The final rows are ranked and presented to the user.

Overview of DBXplorer • The publish component provides interfaces to • select a database, • select tables/columns within the database to publish, and • modify/remove/maintain the publication. • For a given set of keywords, the search component provides interfaces to • retrieve matching databases from a set of published databases, and • selectively identify tables, columns/rows that need to be searched within each database identified in step (1). Architecture of DBXplorer

Publish – Symbol Table • The symbol table is the key data structure used to look upthe respective locations of query keywords in thedatabase. • An important design consideration is decidingthe location granularity • (a) column levelgranularity (Pub-Col) (i.e., list of table.column ) • (b) cell levelgranularity (Pub-Cell)(i.e., listof table.column.rowid). • The Pub-Col symbol table alternative is almost alwaysbetter than the Pub-Cell table, unless certain columns donot have indexes.

Store symbol tables (Pub-Col) in databases – compression algorithms FK-Comp CP-Comp (a)Partition H into a minimum number of bipartite cliques (a bipartite clique is any subgraph of H with a maximal number of edges). (b) Compress each clique. Publish – Symbol Table v2 v3 v4 c1 c2 x Uncompressed hash table (map the symbol table S to a bipartite graph H having two node sets, HashVal and ColId, where every row (v, c) in table S corresponds to an edge in H.) ColumnsMap table Compressed hash table

Search • Step 1 – Search the symbol table (using generated SQL) toidentify the database tables, columns/cells that contain atleast one of the keywords in the query. • Step2 – enumerate Join Trees • Step3 – identify matching rows

Search – Enumerate Join Trees • Identify and enumerate all potential subsets of tables in the database that, if joined, might contain rows having all keywords. • The resulting relationwill contain all potential rows having all keywordsspecified in the query. • If we view theschema graph G asan undirected graph, this stepenumerates join trees, i.e., sub-trees of G such that: • the leaves belong to MatchedTables and • together, theleaves contain all keywords of the query Join Trees

Search – Identify matching rows • The input to this final search step is the enumeratedjoin trees. • Each join tree is then mapped to a single SQLstatement that joins the tables as specified in the tree, andselects those rows that contain all keywords. • The retrieved rows are ranked before being output.

Generalized Matches – Token Matches • Token matches – the keyword in the querymatches only a token or sub-string of an attribute value (e.g., retrieve rows of address by specifying only a street name). • Pub-Prefix method – based on the followingcrucial observation: B+ tree indexes can be used toretrieve rows whose cell matches a given prefix string. • During publishing of a database, for every keyword K,the entry (hash(K), T.C, P) is kept in the symbol table ifthere exists a string in column T.C which • (a) contains atoken K, and • (b) has prefix P.

Generalized Matches – Token Matches Let the hash values of thesearchable tokens i.e., ‘string’, ‘ball’ and ‘round’ be 1, 2and 3 respectively Pub-Prefix table Database table T

Conclusion • This paper discusses DBXplorer, a system that enables keyword-based search in relational databases. • DBXplorer uses symbol table alternatives to store the location of keywords in database. • DBXplorer support exact matches and generalized matches. • DBXplorer has been implemented using a commercial relational database and web server and allows users to interact via a browser front-end.

DBXplorer: A System for Keyword-Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das