550 likes | 754 Views
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University. Keyword Search Over Graph Databases. Instructor: Lei Zou. DB/IR: Different Positions. Structured data vs unstructured data Schema vs non-schema Query processing vs ranking. In the Past. Search.
E N D
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Keyword Search Over Graph Databases Instructor: Lei Zou
DB/IR: Different Positions Structured data vs unstructured data Schema vs non-schema Query processing vs ranking
In the Past Search Unstructured (keywords) IR Systems Structured (SQL,Xquery) DB Systems Data Structured (records) Unstructured (documents)
Now Search Unstructured (keywords) IR Systems Keyword search over relational graphs Digital Libraries Enterprise BI Web 2.0 DB Systems Structured (SQL,Xquery) Querying entities and relations from information extraction Text data Schema latter ranking Data Structured (records) Unstructured (documents)
In the future Search Integrated DB & IR System (Models, Algorithms, Platform) Data
1. Approximate matching and record linkage “Beijing University” & “PKU” 2. Too-many-answers ranking Answers are ranked instead of “set”-style outputs 3. Schema relaxation and heterogeneity No Schema approach & Schema Mapping & Data Integration 4. Information Extraction and Entity (and relationship) Search Find all books written by Mark Twain 5. Uncertain Data Management Some data are uncertain, such as extracted data by IE
The fist Step IR-style Search Over Database System Hard to Learn SQL No idea abut the Schema keywords
3 1 2 2 1 3 2 3 3 3 3 3 3 1 4 4 2 4 4 4 4 4 4 3 2 1 2 Full-Text Keyword Search Michelle XML XML Michelle Michelle XML Database Tuples • Find a set of tuples that contain the keywords. • Full text index (Oracle, DB2, SQL Server …) • SQL supports “contain (A, w)” clause to retrieve the set of tuples that contain keyword w in the attribute A.
SELECT student_id, student_name FROM students WHERE CONTAINS( address, 'beijing' )
2 1 2 2 2 2 3 3 3 3 4 4 4 4 2 2 Structural Keyword Search in RDB Michelle Michelle Michelle XML XML XML • Tuples are connected through foreign key references in an RDB. • Find a set of interconnected tuples that contain the keywords
c3 c2 c5 c4 c1 p4 p3 p2 p1 3 3 3 3 a1 a3 a2 4 4 4 w4 w3 w1 w5 w2 w6 RDB Example Paper Author Write Cite
c4 c5 c2 c1 c3 p3 p1 p2 p4 3 3 3 3 a1 a2 a3 4 4 4 w1 w2 w3 w6 w4 w5 Tuple Connections (Database Graph) Michelle XML XML Michelle Tuples are connected through foreign key references
c3 c5 c4 c2 c1 p3 p2 p4 p1 3 3 3 3 a1 a3 a2 4 4 4 w3 w2 w1 w6 w5 w4 What is the relationship between Michelle and XML? Michelle XML Michelle XML XML Michelle
c1 c2 p1 p3 p3 p2 p1 p1 p2 3 3 3 3 3 3 3 a1 a3 4 4 w6 w1 w2 Michelle Michelle XML Michelle XML … Michelle XML XML • How are keywords connected? • Minimal total joining network of tuples • Total: all keywords will be contained • Minimal: it is not total if any tuple is removed • Tmax: maximum number of nodes allowed
Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm
Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm
DISCOVER: Keyword Search in Relational Databases • Vagelis Hristidis University of California, San Diego • Yannis Papakonstantinou University of California, San Diego
Motivation • Keyword Search is the dominant information discovery method in documents • Increasing amount of data stored in databases
Motivation • Currently, information discovery in databases requires: • Knowledge of schema • Knowledge of a query language (eg: SQL) • Knowledge of the role of the keywords • DISCOVER eliminates these requirements
Keyword Query - Semantics • Keywords are: • in same tuple • in same relation • connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • weighted distance
Result of Keyword Query • Result is tree T of tuples where: • each edge corresponds to a primary-foreign key relationship • every keyword contained in a tuple of T (total) • no tuple of T is redundant (minimal)
Example - Schema Subset of TPC-H schema n:1 n:1 ORDERS CUSTOMER NATION
Example – Keyword Query Query: “Smith, Miller”
Example – Keyword Query Query: “Smith, Miller” Results:
Example – Keyword Query Query: “Smith, Miller” Results: Smaller sizes usually denote tighter association between keywords
Architecture User
Candidate Networks Generator - Challenges • A keyword may appear in multiple tuples • # candidate networks can be too big (sometimes unbounded)
Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller
Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN1: OSmith C OMiller size=2
Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN1: OSmith C OMiller size=2 CN2: OSmith C N C OMiller size=4
Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN3: OSmith C OMiller Csize=3
Candidate Network - Example ORDERS Smith n:1 ------------------------------------------------- c1 – o – c2 c1 c2 , because primary to foreign key from CUSTOMER to ORDERS Pruning Condition: RKSRL n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN4: OSmithC O C OMiller size=4
Candidate Networks Generator - Algorithm • Traverse tuple set graph breadth first • Q tuple sets containing keyword k1 • For each network n of tuple sets in Q do • If pruning_condition(n) drop n • else if is_CN(n) output n • else expand n by one tuple set to all possible directions in tuple set graph and insert expansions to Q [eg: if n is OSmith C then we add to Q OSmith C OMiller, OSmith C O, OSmith C N ]
Candidate Networks Generator is Complete and Non-Redundant • Prove that the set of Candidate Networks generated is • Complete: All solutions generated by a CN • Non-redundant: There is database instance, where by removing a CN a solution is lost
Size of Candidate Networks may be Unbounded • Size is unbounded iff schema graph G has one of the following properties: • There is a node of G that has at least two incoming edges. [eg: PARTSUPPLINEITEMORDERS] • G has a directed cycle. [eg: ancestor schemas]
Execution Plan - Challenges • Generated SQL queries are expensive due to joins • Reusability opportunities
Execution Plan • Each CN corresponds to a SQL statement • CN1: OSmith C OMiller CN2: OSmith C N C OMiller • Execution Plan CN1 OSmith C OMiller CN2 OSmith C N C OMiller
Reuse Common Subexpressions - Example • Execution Plan CN1 OSmith C OMiller CN2 OSmith C N C OMiller • Optimized Execution Plan Temp OSmith C CN1 Temp OMiller CN2 Temp N C OMiller
Optimal Reuse of Common Subexpressions is NP-Complete • Simple Cost Model: each join has cost 1 • Prove that finding Optimal Common Subexpressions is NP-Complete. Proof: Reduce string compression problem
Cost Model and Greedy Optimization Algorithm • Actual Cost Model: cost of a join is size of result • Greedy algorithm: In each iteration build intermediate result of size 1 (1 join) that maximizes
Tuning of Greedy Algorithm • a: frequency factor • favors reusability • b: size factor • favors small intermediate results • a=1 • 0b0.3
Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm
RDF example Biopathway example Relational DB Growing Uses of Graph-Structured Data • Data produced directly based on graphs • W3C standards: XML, RDF and OWL • Bioinformatics: BioCyc, BioMaze, etc. • Data that can be made graphs by restoring implicit connections among data items • Relational data Graph • Unstructured and heterogeneous data Graph • Personal information management (PIM) • Information extraction from Web
Keyword Search on Graphs • New and popular query paradigm • Simple, user-friendly query interface • Queryability on graphs without obvious schema • Problem definition • Data: a weighted directed graph G=(V,E), where each node vV may be labeled with text • Query: a keyword query q = (w1, …, wm) • Answer: an answer to q is a pair r,(n1,…,nm), where r and ni’s are nodes satisfying: • Coverage: ni contains keyword wi for every i • Connectivity: a directed path exists from r to ni for every i • Top-k Query returns k distinct root nodes with the highest best scores associated to each root q = (c,d) T1 = <3, (3,6)>T2 = <2, (12,4)>…
Roadmap • Define scoring function • Interplay between semantics and ease of evaluation • Propose better graph search strategies • Worst-case performance guarantee • Combine graph indexing with search • Simple, but impractical, single-level index • Practical bi-level index in BLINKS • Partitioning-based indexing
answer root matches r n2 n1 n3 Scoring Function • Score definition • For an answer T= r,(n1,…,nm) to a query q = (w1, …, wm), the score is defined asS(T) = f( Sr(r) + Sn(ni, wi) + Sp(r, ni)) • Considers both content and graph structure • Match-distributive property • Contribution of matches and root-match paths can be computed in a distributive manner by summing over all matches • Allow pre-computation of best path, independently for each node/keyword • Graph-distance property • The contribution of a root-match path, Sp(r,ni), is defined to be shortest-path distance from r to ni • To simplify presentation, we focus on the path contribution Sp(r,ni) paths from root to matches
Intersection Intersection Graph Search Strategies • Backward search [Bhalotia et al., ICDE’02] • Starting from keyword nodes (containing at least one query keyword) • In each search step, choose an incoming edge to a previously visited node and follow the edge backward to visit its source node • Discover an answer root r if r is visited from every keyword • Bidirectional search [Kacholia et al., VLDB’05] • Explore the graph by following forward edges as well • Choose which node to visit by heuristic activation factors w1 Conceptually, expand clusters of visited nodes for each keyword w2 Graph