620 likes | 742 Views
Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013) . Dilbert Strip . Motivation. Keyword search We have SQL, why keyword-querying? SQL - not appropriate for naive users
E N D
Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013)
Motivation • Keyword search • We have SQL, why keyword-querying? • SQL - not appropriate for naive users • So many online databases (imdb, citeseer, bseindia …) – user cannot keep track of schema for all of these
Simple Approaches • Using Form interfaces • Require separate form for each type of query – confusing • Not suitable for ad-hoc queries – how many forms will you provide? • How about Google? • Export data from db to documents and do keyword- querying on these • Suffers from duplication overheads • Google wants all keywords in one document. DB is often normalized, so need to join tables and store as documents • Multiple combinations of tables to join. Not scalable …
Differences from Web Search • Related data split across multiple tuples due to normalization • Different keywords may match tuples from different relations • What joins are to be computed can only be decided on the fly • Need to find result containing all keywords and rank them somehow Writes (AuthorId, PaperId) Paper (PaperId, PaperName) Author (AuthorId, AuthorName) Cites (Citing, Cited) The DBLP Bibliography Schema
Systems for DB search • BANKS (Browsing and Keyword Search) – IITB (ICDE ’02) • DBXplorer – Microsoft Research (ICDE ’02) • ObjectRank – IBM, UCSD, FIU (VLDB ’04) • Bidirectional BANKS – IITB (VLDB ’05)
Systems for DB search • BANKS (Browsing and Keyword Search) – IITB (ICDE ’02) • DBXplorer – Microsoft Research (ICDE ’02) • ObjectRank – IBM, UCSD, FIU (VLDB ’04) • Bidirectional BANKS – IITB (VLDB ’05) will cover in depth
The BANKS system • BANKS Architecture • Available on the web • http://www.cse.iitb.ac.in/banks • Connects to database using JDBC • JDBC metadata features used to provide schema browsing • Preprocesses db User BANKS JDBC HTTP Web-server Database
BANKS01:Keyword Search MO:MultiQuery Optimizn paper Charuta:BANKS01 writes Charuta S. Sudarshan Prasan Roy author Basic Model • Database: modeled as a graph • Nodes = tuples • Edges = references between tuples • foreign key (assume for this talk), inclusion dependencies, .. • Edges are directed. PaperId:PaperName AuthorID:PaperId AuthorId DBLP example
The BANKS Answer Model • Query: set of search terms {t1, t2, .., tn} • For each search term ti we find set of nodes Si matching ti • Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy) • Answer: rooted, directed tree connecting nodes matching keywords • Root node has special significance, may be restricted to some relations • E.g. relations representing entities, not relationships • May include intermediate nodes not in any Si (Steiner Tree) • Multiple answers • Ranking based on proximity + prestige
MultiQuery Optimization Prasan Roy S. Sudarshan Answer Example Query: sudarshanroy Paper • We would like to find sets of (closely) connected tuples that match all given keywords Writes Writes Author Author
Edge Directionality • Directed tree will miss desired answers. For eg: Query = DBXplorerObjectRank • So, for each forward edge, BANKS adds a back edge CitedBy Cited Cites BANKS DBXPlorer ObjectRank Cites Cited CitedBy BANKS Cites DBXPlorer ObjectRank Cites
Edge Directionality • What if we ignore directionality? • Some popular tuples are connected to many other tuples • E.g. Students -> departments -> university • Problem: A popular tuple would create misleading shortcuts between tuples • E.g. every student would be closely linked with every other student via the department/university • Solution: define different forward and backward edge weights • Forward edges: In the direction of the foreign key reference
3 1 3 1 3 1 Edge Weight • Weight of forward edge based on schema • e.g. citation link weights > “writes” link weights • Weight of backward edge = indegree of edges pointing to the node
Edge Weight Scaling • Normalize edge score Escore(e) • Make edge weight scale-free by dividing edge weigth by wmin • Problem: Some backward edges have unduly large weights • Depress the scale by defining Escore(e) as log(1+w(e)/wmin ) • Overall Escore E = 1 / (1 + e Escore(e))
Node Weight • Set weight of a Node = Indegree of the node • As per prestige rankings nodes with multiple pointers to them get a higher prestige • So, higher node weight corresponds to higher prestige • Problem: Nodes with many in-edges result in skewed answers • Subdue extreme node weights by using log(1+indegree) • Node score Nscore = Average of node scores (root-node-weight + leaf-node-weights)
Combining Scores • Combining two independent metrics: node weight and edge weight • Normalize each to 0-1 • Combine using weighting factor • Additive: (1- ) Escore + Nscore • Multiplicative: Escore * Nscore • Performance study to compare alternatives and to find reasonable values for
First Step – Symbol Table • The first step is to build a symbol table • This table is in the db and is not normalized • Example:
Searching for Best Answers • Backward Expanding Search Algorithm: • Assume: graph fits in memory • Idea: find vertices from which a forward path exists to at least one node from each Si. • Run concurrent single source shortest path algorithm from each node matching a keyword • Create an iterator for each node matching a keyword • Traverse the graph edges in reverse direction • Output a node whenever it is on the intersection of the sets of nodes reached from each keyword • Answer trees may not be generated in relevance order
Query: sudarshan roy paper MultiQuery Optimization writes S. Sudarshan Prasan Roy authors Backward Expanding Search Iterators
BANKS Query Result Example • Result of “Sudarshan Roy”
Result Ordering • Answers need not be always in Relevance order This tree is output Better Root Missed 2 2 2 5 2 1
Result Ordering (contd…) • Solution: • Generate all connection trees and then sort them • Increases computation costs and leads to a greatly increased time to generate initial results • Create a small heap ordered on the relevance of the trees • Output highest ranked tree from heap to user when heap is full • What about duplicate results? • Maintain a list of generated results for duplicate detection • Discard result according to relevance
Experience and Performance • BANKS provides keyword search coupled with extensive browsing facilities • Schema browsing + data browsing • Graphical display of data • Implemented using Java + servlets • Keyword search response times typically 1 to 3 seconds on • DBLP database with 100,000 tuples/300,000 edges • P3 600 MHz, 512 MB RAM
Anecdotes • “Mohan” • Returns C. Mohan at top based on prestige (number of papers written) • “Transaction” • Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations) • “Sunita Seltzer” • No common papers, but both have papers with Stonebraker: system finds this connection
EdgeLog Effect of Parameters • Log scaling of edge weights worked well • (1- ) E + N versus E Nmade little difference • Best with = .2 (subdue node weights but not entirely)
Motivation • BANKS performs poorly if • Keyword matches lot of nodes (so lot of Dijkstra sources) • Search hits a node with large fan – in. … Wastes time Sudarshan Roy
New Ideas – Forward Search • Why only backward, lets search forward too : How about fwd Searching ? … Sudarshan Roy
New Ideas - Activation • Activation :- Cannot forward search from each node. • Spread activation from keyword nodes to others. • Activation is like Page Rank with decay. High Activation close to many keywords.
Activation Spreading • Spreading Activation • Node with highest activation explored first • Activation spread to neighbors (μ = 0.3) • Gives low activation to neighbors of hubs
Modifications to Model • Graph model stays the same. • BANKS is concerned with search more than how to tune parameters or define node – weights / edge – weights. • BANKS code : Tree Node – Score, N = Tree Edge – Score, E = Total Score = ENl (l = 0.2)
The New Algorithm • Need two priority queues : Qin - do backward search from these nodes Qout - do forward search from these nodes • Each node, n keeps 3 variables per keyword, ti • sp [i] : Node to got to from n for shortest-path to ti • distance [i] : Length of the shortest-path from n to ti • Activation [i] : Activation to n from keyword ‘ti’
The New Algorithm – continued… • Set initial activation keyword nodes and add to Qin for backward-search. • At each step, pick node with maximum activation i.e. if (Qin.getMaxActivation > Qout. getMaxActivation)) // use node from Qin else // use node from Qout • If node from Qin,do backward search and add itself toQout. (newly explored nodes into Qin) • If node from Qout, do forward search • If node has reached from all keyword, generate result-tree. [answer is buffered as results can be out of order]
Explanation with example Qin Qout N100 N4 Roy Sudarshan N1 … N3 N2 Roy Sudarshan
Explanation with example Qin Qout N100 Roy Sudarshan N4 N1 N2 N1 … N3 N2 Roy Sudarshan
Explanation with example Qin Qout N100 N2 N3 … N100 N1 Roy Sudarshan N4 N1 … N3 N2 Roy Sudarshan Result Found !
Generation of top-k results • If we know the score of next-best answer, all buffered answers with better score can be output. • Need upper bounds
Computation of upper bound • For each keyword ti, we have explored nodes upto some length – say li. • So, next – best – score (approx.) = • This is not a true upper bound, but works quite well and is simple !
Are we losing answers ? • BANKS – I used many Dijkstra states, BANKS – II uses 2 only – forward and backward search-states. • The result is that we can now lose answers !
Ny K1 K2 Ny Nx K2 K1 Answer Loss Example Ny K1 Nx K2 K1 This is the generated answer. This answer is lost.
But, we will generate this tree rooted at Nx: • So, a rotated tree with same nodes but different root is often generated ! NY K2 NX K1
Metrics of Performance • Manually obtain best relevant answers. • Determine 2 times : • Time taken to produce last relevant answer. • Time taken to output last relevant answer. • Search algorithms • MI-Bkwd: original backward search • Iterator for every node matching a keyword • SI-Bkwd: backward search with single backward iterator • Bidirec: bidirectional search • Datasets • DBLP, IMDB ~ 2 million nodes, 9 million edges • US Patent DB ~ 4 million nodes, 15 million edges
Graph - I • MI-Bkwd versus SI-Bkwd • SI-Bkwd gain increases with origin size, # keywords
Graph - II • SI-Bkwd versus Bidirec • Bidirec gain increases with origin size, # keywords
A Critique • BANKS needs a lot of memory. • Need to cluster and keep parts of graph on disk. • Work is in progress
DBXplorer : (Microsoft Research) • Use symbol – table to determine which tables to join. • Generate all possible table – join combinations : Figure : T1, T2, T3, T4 and T5 are tables
Cool ideas in DBXplorer • Symbol table need not be at tuple level. If column has an index, column – level symbol table is ok. • Table Compression : • e.g. : KeywordsColumnsKeywordsColumns K1 K1 C1 C1 K2 K2 X C2 K3 C2 K3 K4 K4 K5 K5 Intermediate Column