1 / 15

DBXplorer: A System for Keyword-Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das

DBXplorer: A System for Keyword-Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das. Cathy Wang 04-04-2006. Outline. Introduction Overview of DBXplorer Publish Search Generalized Matches Conclusion. Introduction.

varuna
Download Presentation

DBXplorer: A System for Keyword-Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DBXplorer: A System for Keyword-Based Search over Relational DatabasesSanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang 04-04-2006

  2. Outline • Introduction • Overview of DBXplorer • Publish • Search • Generalized Matches • Conclusion

  3. Introduction • Internet search engines have popularized keyword-based search. • Traditional database management systems do not support keyword-based search. • e.g. search the Microsoft intranet on ‘Jim Gray’ to obtain matched rows, i.e., rows in the database where ‘Jim Gray’ occur. • In this paper, DBXplorer, an efficient and scalable keyword search utility for relational databases, is described. • The goal is to enable such searches without necessarily requiring the users to know the schema of the respective databases.

  4. Overview of DBXplorer • Given a set of query keywords, DBXplorer returns all rows (either from single tables, or by joining tables connected by foreign-key joins) such that the each row contains all keywords. • Two Steps: • Publish • Search

  5. Overview of DBXplorer - Publish • Publish - a preprocessing step that enables databases for keyword search by building the symbol table and associated structures • Step 1: A database is identified, along with the set of tables and columns within the database to be published. • Step 2: Auxiliary tables are created for supporting keyword searches. • The most important structure is a symbol table S that is used at search time to efficiently determine the locations of query keywords in the database (i.e., the tables, columns, rows they occur in).

  6. Overview of DBXplorer - Search • Search - gets matching rows from the published databases. • Step 1: Searching the symbol table • Step 2: Enumerating Join Trees. • Step 3: Identify matching rows. The final rows are ranked and presented to the user.

  7. Overview of DBXplorer • The publish component provides interfaces to • select a database, • select tables/columns within the database to publish, and • modify/remove/maintain the publication. • For a given set of keywords, the search component provides interfaces to • retrieve matching databases from a set of published databases, and • selectively identify tables, columns/rows that need to be searched within each database identified in step (1). Architecture of DBXplorer

  8. Publish – Symbol Table • The symbol table is the key data structure used to look upthe respective locations of query keywords in thedatabase. • An important design consideration is decidingthe location granularity • (a) column levelgranularity (Pub-Col) (i.e., list of table.column ) • (b) cell levelgranularity (Pub-Cell)(i.e., listof table.column.rowid). • The Pub-Col symbol table alternative is almost alwaysbetter than the Pub-Cell table, unless certain columns donot have indexes.

  9. Store symbol tables (Pub-Col) in databases – compression algorithms FK-Comp CP-Comp (a)Partition H into a minimum number of bipartite cliques (a bipartite clique is any subgraph of H with a maximal number of edges). (b) Compress each clique. Publish – Symbol Table v2 v3 v4 c1 c2 x Uncompressed hash table (map the symbol table S to a bipartite graph H having two node sets, HashVal and ColId, where every row (v, c) in table S corresponds to an edge in H.) ColumnsMap table Compressed hash table

  10. Search • Step 1 – Search the symbol table (using generated SQL) toidentify the database tables, columns/cells that contain atleast one of the keywords in the query. • Step2 – enumerate Join Trees • Step3 – identify matching rows

  11. Search – Enumerate Join Trees • Identify and enumerate all potential subsets of tables in the database that, if joined, might contain rows having all keywords. • The resulting relationwill contain all potential rows having all keywordsspecified in the query. • If we view theschema graph G asan undirected graph, this stepenumerates join trees, i.e., sub-trees of G such that: • the leaves belong to MatchedTables and • together, theleaves contain all keywords of the query Join Trees

  12. Search – Identify matching rows • The input to this final search step is the enumeratedjoin trees. • Each join tree is then mapped to a single SQLstatement that joins the tables as specified in the tree, andselects those rows that contain all keywords. • The retrieved rows are ranked before being output.

  13. Generalized Matches – Token Matches • Token matches – the keyword in the querymatches only a token or sub-string of an attribute value (e.g., retrieve rows of address by specifying only a street name). • Pub-Prefix method – based on the followingcrucial observation: B+ tree indexes can be used toretrieve rows whose cell matches a given prefix string. • During publishing of a database, for every keyword K,the entry (hash(K), T.C, P) is kept in the symbol table ifthere exists a string in column T.C which • (a) contains atoken K, and • (b) has prefix P.

  14. Generalized Matches – Token Matches Let the hash values of thesearchable tokens i.e., ‘string’, ‘ball’ and ‘round’ be 1, 2and 3 respectively Pub-Prefix table Database table T

  15. Conclusion • This paper discusses DBXplorer, a system that enables keyword-based search in relational databases. • DBXplorer uses symbol table alternatives to store the location of keywords in database. • DBXplorer support exact matches and generalized matches. • DBXplorer has been implemented using a commercial relational database and web server and allows users to interact via a browser front-end.

More Related