1 / 55

Keyword Search Over Graph Databases

北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University. Keyword Search Over Graph Databases. Instructor: Lei Zou. DB/IR: Different Positions. Structured data vs unstructured data Schema vs non-schema Query processing vs ranking. In the Past. Search.

jun
Download Presentation

Keyword Search Over Graph Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Keyword Search Over Graph Databases Instructor: Lei Zou

  2. DB/IR: Different Positions Structured data vs unstructured data Schema vs non-schema Query processing vs ranking

  3. In the Past Search Unstructured (keywords) IR Systems Structured (SQL,Xquery) DB Systems Data Structured (records) Unstructured (documents)

  4. Now Search Unstructured (keywords) IR Systems Keyword search over relational graphs Digital Libraries Enterprise BI Web 2.0 DB Systems Structured (SQL,Xquery) Querying entities and relations from information extraction Text data Schema latter ranking Data Structured (records) Unstructured (documents)

  5. In the future Search Integrated DB & IR System (Models, Algorithms, Platform) Data

  6. 1. Approximate matching and record linkage “Beijing University” & “PKU” 2. Too-many-answers ranking Answers are ranked instead of “set”-style outputs 3. Schema relaxation and heterogeneity No Schema approach & Schema Mapping & Data Integration 4. Information Extraction and Entity (and relationship) Search Find all books written by Mark Twain 5. Uncertain Data Management Some data are uncertain, such as extracted data by IE

  7. The fist Step IR-style Search Over Database System Hard to Learn SQL No idea abut the Schema keywords

  8. 3 1 2 2 1 3 2 3 3 3 3 3 3 1 4 4 2 4 4 4 4 4 4 3 2 1 2 Full-Text Keyword Search Michelle XML XML Michelle Michelle XML Database Tuples • Find a set of tuples that contain the keywords. • Full text index (Oracle, DB2, SQL Server …) • SQL supports “contain (A, w)” clause to retrieve the set of tuples that contain keyword w in the attribute A.

  9. SELECT student_id, student_name FROM students WHERE CONTAINS( address, 'beijing' )

  10. 2 1 2 2 2 2 3 3 3 3 4 4 4 4 2 2 Structural Keyword Search in RDB Michelle Michelle Michelle XML XML XML • Tuples are connected through foreign key references in an RDB. • Find a set of interconnected tuples that contain the keywords

  11. c3 c2 c5 c4 c1 p4 p3 p2 p1 3 3 3 3 a1 a3 a2 4 4 4 w4 w3 w1 w5 w2 w6 RDB Example Paper Author Write Cite

  12. c4 c5 c2 c1 c3 p3 p1 p2 p4 3 3 3 3 a1 a2 a3 4 4 4 w1 w2 w3 w6 w4 w5 Tuple Connections (Database Graph) Michelle XML XML Michelle Tuples are connected through foreign key references

  13. c3 c5 c4 c2 c1 p3 p2 p4 p1 3 3 3 3 a1 a3 a2 4 4 4 w3 w2 w1 w6 w5 w4 What is the relationship between Michelle and XML? Michelle XML Michelle XML XML Michelle

  14. c1 c2 p1 p3 p3 p2 p1 p1 p2 3 3 3 3 3 3 3 a1 a3 4 4 w6 w1 w2 Michelle Michelle XML Michelle XML … Michelle XML XML • How are keywords connected? • Minimal total joining network of tuples • Total: all keywords will be contained • Minimal: it is not total if any tuple is removed • Tmax: maximum number of nodes allowed

  15. Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm

  16. Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm

  17. DISCOVER: Keyword Search in Relational Databases • Vagelis Hristidis University of California, San Diego • Yannis Papakonstantinou University of California, San Diego

  18. Motivation • Keyword Search is the dominant information discovery method in documents • Increasing amount of data stored in databases

  19. Motivation • Currently, information discovery in databases requires: • Knowledge of schema • Knowledge of a query language (eg: SQL) • Knowledge of the role of the keywords • DISCOVER eliminates these requirements

  20. Keyword Query - Semantics • Keywords are: • in same tuple • in same relation • connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • weighted distance

  21. Result of Keyword Query • Result is tree T of tuples where: • each edge corresponds to a primary-foreign key relationship • every keyword contained in a tuple of T (total) • no tuple of T is redundant (minimal)

  22. Example - Schema Subset of TPC-H schema n:1 n:1 ORDERS CUSTOMER NATION

  23. Example - Data

  24. Example – Keyword Query Query: “Smith, Miller”

  25. Example – Keyword Query Query: “Smith, Miller” Results:

  26. Example – Keyword Query Query: “Smith, Miller” Results: Smaller sizes usually denote tighter association between keywords

  27. Architecture User

  28. Architecture

  29. Candidate Networks Generator - Challenges • A keyword may appear in multiple tuples • # candidate networks can be too big (sometimes unbounded)

  30. Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller

  31. Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN1: OSmith C  OMiller size=2

  32. Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN1: OSmith C  OMiller size=2 CN2: OSmith C  N  C  OMiller size=4

  33. Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN3: OSmith C  OMiller Csize=3

  34. Candidate Network - Example ORDERS Smith n:1 ------------------------------------------------- c1 – o – c2 c1 c2 , because primary to foreign key from CUSTOMER to ORDERS Pruning Condition: RKSRL n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN4: OSmithC  O  C OMiller size=4

  35. Candidate Networks Generator - Algorithm • Traverse tuple set graph breadth first • Q  tuple sets containing keyword k1 • For each network n of tuple sets in Q do • If pruning_condition(n) drop n • else if is_CN(n) output n • else expand n by one tuple set to all possible directions in tuple set graph and insert expansions to Q [eg: if n is OSmith C then we add to Q OSmith C  OMiller, OSmith C  O, OSmith C  N ]

  36. Candidate Networks Generator is Complete and Non-Redundant • Prove that the set of Candidate Networks generated is • Complete: All solutions generated by a CN • Non-redundant: There is database instance, where by removing a CN a solution is lost

  37. Size of Candidate Networks may be Unbounded • Size is unbounded iff schema graph G has one of the following properties: • There is a node of G that has at least two incoming edges. [eg: PARTSUPPLINEITEMORDERS] • G has a directed cycle. [eg: ancestor schemas]

  38. Architecture

  39. Execution Plan - Challenges • Generated SQL queries are expensive due to joins • Reusability opportunities

  40. Execution Plan • Each CN corresponds to a SQL statement • CN1: OSmith C  OMiller CN2: OSmith C  N  C  OMiller • Execution Plan CN1  OSmith C  OMiller CN2  OSmith C  N  C  OMiller

  41. Reuse Common Subexpressions - Example • Execution Plan CN1  OSmith C OMiller CN2  OSmith C N  C  OMiller • Optimized Execution Plan Temp OSmith C CN1  Temp OMiller CN2  Temp N  C  OMiller

  42. Optimal Reuse of Common Subexpressions is NP-Complete • Simple Cost Model: each join has cost 1 • Prove that finding Optimal Common Subexpressions is NP-Complete. Proof: Reduce string compression problem

  43. Cost Model and Greedy Optimization Algorithm • Actual Cost Model: cost of a join is size of result • Greedy algorithm: In each iteration build intermediate result of size 1 (1 join) that maximizes

  44. Tuning of Greedy Algorithm • a: frequency factor • favors reusability • b: size factor • favors small intermediate results • a=1 • 0b0.3

  45. Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm

  46. RDF example Biopathway example Relational DB Growing Uses of Graph-Structured Data • Data produced directly based on graphs • W3C standards: XML, RDF and OWL • Bioinformatics: BioCyc, BioMaze, etc. • Data that can be made graphs by restoring implicit connections among data items • Relational data  Graph • Unstructured and heterogeneous data  Graph • Personal information management (PIM) • Information extraction from Web

  47. Keyword Search on Graphs • New and popular query paradigm • Simple, user-friendly query interface • Queryability on graphs without obvious schema • Problem definition • Data: a weighted directed graph G=(V,E), where each node vV may be labeled with text • Query: a keyword query q = (w1, …, wm) • Answer: an answer to q is a pair r,(n1,…,nm), where r and ni’s are nodes satisfying: • Coverage: ni contains keyword wi for every i • Connectivity: a directed path exists from r to ni for every i • Top-k Query returns k distinct root nodes with the highest best scores associated to each root q = (c,d) T1 = <3, (3,6)>T2 = <2, (12,4)>…

  48. Roadmap • Define scoring function • Interplay between semantics and ease of evaluation • Propose better graph search strategies • Worst-case performance guarantee • Combine graph indexing with search • Simple, but impractical, single-level index • Practical bi-level index in BLINKS • Partitioning-based indexing

  49. answer root matches r n2 n1 n3 Scoring Function • Score definition • For an answer T= r,(n1,…,nm) to a query q = (w1, …, wm), the score is defined asS(T) = f( Sr(r) +  Sn(ni, wi) +  Sp(r, ni)) • Considers both content and graph structure • Match-distributive property • Contribution of matches and root-match paths can be computed in a distributive manner by summing over all matches • Allow pre-computation of best path, independently for each node/keyword • Graph-distance property • The contribution of a root-match path, Sp(r,ni), is defined to be shortest-path distance from r to ni • To simplify presentation, we focus on the path contribution Sp(r,ni) paths from root to matches

  50. Intersection Intersection Graph Search Strategies • Backward search [Bhalotia et al., ICDE’02] • Starting from keyword nodes (containing at least one query keyword) • In each search step, choose an incoming edge to a previously visited node and follow the edge backward to visit its source node • Discover an answer root r if r is visited from every keyword • Bidirectional search [Kacholia et al., VLDB’05] • Explore the graph by following forward edges as well • Choose which node to visit by heuristic activation factors w1 Conceptually, expand clusters of visited nodes for each keyword w2 Graph

More Related