Keyword Search Over Graph Databases

北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Keyword Search Over Graph Databases Instructor: Lei Zou

DB/IR: Different Positions Structured data vs unstructured data Schema vs non-schema Query processing vs ranking

In the Past Search Unstructured (keywords) IR Systems Structured (SQL,Xquery) DB Systems Data Structured (records) Unstructured (documents)

Now Search Unstructured (keywords) IR Systems Keyword search over relational graphs Digital Libraries Enterprise BI Web 2.0 DB Systems Structured (SQL,Xquery) Querying entities and relations from information extraction Text data Schema latter ranking Data Structured (records) Unstructured (documents)

In the future Search Integrated DB & IR System (Models, Algorithms, Platform) Data

1. Approximate matching and record linkage “Beijing University” & “PKU” 2. Too-many-answers ranking Answers are ranked instead of “set”-style outputs 3. Schema relaxation and heterogeneity No Schema approach & Schema Mapping & Data Integration 4. Information Extraction and Entity (and relationship) Search Find all books written by Mark Twain 5. Uncertain Data Management Some data are uncertain, such as extracted data by IE

The fist Step IR-style Search Over Database System Hard to Learn SQL No idea abut the Schema keywords

3 1 2 2 1 3 2 3 3 3 3 3 3 1 4 4 2 4 4 4 4 4 4 3 2 1 2 Full-Text Keyword Search Michelle XML XML Michelle Michelle XML Database Tuples • Find a set of tuples that contain the keywords. • Full text index (Oracle, DB2, SQL Server …) • SQL supports “contain (A, w)” clause to retrieve the set of tuples that contain keyword w in the attribute A.

SELECT student_id, student_name FROM students WHERE CONTAINS( address, 'beijing' )

2 1 2 2 2 2 3 3 3 3 4 4 4 4 2 2 Structural Keyword Search in RDB Michelle Michelle Michelle XML XML XML • Tuples are connected through foreign key references in an RDB. • Find a set of interconnected tuples that contain the keywords

c3 c2 c5 c4 c1 p4 p3 p2 p1 3 3 3 3 a1 a3 a2 4 4 4 w4 w3 w1 w5 w2 w6 RDB Example Paper Author Write Cite

c4 c5 c2 c1 c3 p3 p1 p2 p4 3 3 3 3 a1 a2 a3 4 4 4 w1 w2 w3 w6 w4 w5 Tuple Connections (Database Graph) Michelle XML XML Michelle Tuples are connected through foreign key references

c3 c5 c4 c2 c1 p3 p2 p4 p1 3 3 3 3 a1 a3 a2 4 4 4 w3 w2 w1 w6 w5 w4 What is the relationship between Michelle and XML? Michelle XML Michelle XML XML Michelle

c1 c2 p1 p3 p3 p2 p1 p1 p2 3 3 3 3 3 3 3 a1 a3 4 4 w6 w1 w2 Michelle Michelle XML Michelle XML … Michelle XML XML • How are keywords connected? • Minimal total joining network of tuples • Total: all keywords will be contained • Minimal: it is not total if any tuple is removed • Tmax: maximum number of nodes allowed

Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm

DISCOVER: Keyword Search in Relational Databases • Vagelis Hristidis University of California, San Diego • Yannis Papakonstantinou University of California, San Diego

Motivation • Keyword Search is the dominant information discovery method in documents • Increasing amount of data stored in databases

Motivation • Currently, information discovery in databases requires: • Knowledge of schema • Knowledge of a query language (eg: SQL) • Knowledge of the role of the keywords • DISCOVER eliminates these requirements

Keyword Query - Semantics • Keywords are: • in same tuple • in same relation • connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • weighted distance

Result of Keyword Query • Result is tree T of tuples where: • each edge corresponds to a primary-foreign key relationship • every keyword contained in a tuple of T (total) • no tuple of T is redundant (minimal)

Example - Schema Subset of TPC-H schema n:1 n:1 ORDERS CUSTOMER NATION

Example - Data

Example – Keyword Query Query: “Smith, Miller”

Example – Keyword Query Query: “Smith, Miller” Results:

Example – Keyword Query Query: “Smith, Miller” Results: Smaller sizes usually denote tighter association between keywords

Architecture User

Architecture

Candidate Networks Generator - Challenges • A keyword may appear in multiple tuples • # candidate networks can be too big (sometimes unbounded)

Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller

Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN1: OSmith C  OMiller size=2

Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN1: OSmith C  OMiller size=2 CN2: OSmith C  N  C  OMiller size=4

Candidate Network - Example ORDERS Smith n:1 n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN3: OSmith C  OMiller Csize=3

Candidate Network - Example ORDERS Smith n:1 ------------------------------------------------- c1 – o – c2 c1 c2 , because primary to foreign key from CUSTOMER to ORDERS Pruning Condition: RKSRL n:1 n:1 CUSTOMER NATION ORDERS n:1 ORDERS Miller CN4: OSmithC  O  C OMiller size=4

Candidate Networks Generator - Algorithm • Traverse tuple set graph breadth first • Q  tuple sets containing keyword k1 • For each network n of tuple sets in Q do • If pruning_condition(n) drop n • else if is_CN(n) output n • else expand n by one tuple set to all possible directions in tuple set graph and insert expansions to Q [eg: if n is OSmith C then we add to Q OSmith C  OMiller, OSmith C  O, OSmith C  N ]

Candidate Networks Generator is Complete and Non-Redundant • Prove that the set of Candidate Networks generated is • Complete: All solutions generated by a CN • Non-redundant: There is database instance, where by removing a CN a solution is lost

Size of Candidate Networks may be Unbounded • Size is unbounded iff schema graph G has one of the following properties: • There is a node of G that has at least two incoming edges. [eg: PARTSUPPLINEITEMORDERS] • G has a directed cycle. [eg: ancestor schemas]

Architecture

Execution Plan - Challenges • Generated SQL queries are expensive due to joins • Reusability opportunities

Execution Plan • Each CN corresponds to a SQL statement • CN1: OSmith C  OMiller CN2: OSmith C  N  C  OMiller • Execution Plan CN1  OSmith C  OMiller CN2  OSmith C  N  C  OMiller

Reuse Common Subexpressions - Example • Execution Plan CN1  OSmith C OMiller CN2  OSmith C N  C  OMiller • Optimized Execution Plan Temp OSmith C CN1  Temp OMiller CN2  Temp N  C  OMiller

Optimal Reuse of Common Subexpressions is NP-Complete • Simple Cost Model: each join has cost 1 • Prove that finding Optimal Common Subexpressions is NP-Complete. Proof: Reduce string compression problem

Cost Model and Greedy Optimization Algorithm • Actual Cost Model: cost of a join is size of result • Greedy algorithm: In each iteration build intermediate result of size 1 (1 join) that maximizes

Tuning of Greedy Algorithm • a: frequency factor • favors reusability • b: size factor • favors small intermediate results • a=1 • 0b0.3

Outline • Schema BasedDiscover Algorithm • Non-Schema Based Blink Algorithm

RDF example Biopathway example Relational DB Growing Uses of Graph-Structured Data • Data produced directly based on graphs • W3C standards: XML, RDF and OWL • Bioinformatics: BioCyc, BioMaze, etc. • Data that can be made graphs by restoring implicit connections among data items • Relational data  Graph • Unstructured and heterogeneous data  Graph • Personal information management (PIM) • Information extraction from Web

Keyword Search on Graphs • New and popular query paradigm • Simple, user-friendly query interface • Queryability on graphs without obvious schema • Problem definition • Data: a weighted directed graph G=(V,E), where each node vV may be labeled with text • Query: a keyword query q = (w1, …, wm) • Answer: an answer to q is a pair r,(n1,…,nm), where r and ni’s are nodes satisfying: • Coverage: ni contains keyword wi for every i • Connectivity: a directed path exists from r to ni for every i • Top-k Query returns k distinct root nodes with the highest best scores associated to each root q = (c,d) T1 = <3, (3,6)>T2 = <2, (12,4)>…

Roadmap • Define scoring function • Interplay between semantics and ease of evaluation • Propose better graph search strategies • Worst-case performance guarantee • Combine graph indexing with search • Simple, but impractical, single-level index • Practical bi-level index in BLINKS • Partitioning-based indexing

answer root matches r n2 n1 n3 Scoring Function • Score definition • For an answer T= r,(n1,…,nm) to a query q = (w1, …, wm), the score is defined asS(T) = f( Sr(r) +  Sn(ni, wi) +  Sp(r, ni)) • Considers both content and graph structure • Match-distributive property • Contribution of matches and root-match paths can be computed in a distributive manner by summing over all matches • Allow pre-computation of best path, independently for each node/keyword • Graph-distance property • The contribution of a root-match path, Sp(r,ni), is defined to be shortest-path distance from r to ni • To simplify presentation, we focus on the path contribution Sp(r,ni) paths from root to matches

Intersection Intersection Graph Search Strategies • Backward search [Bhalotia et al., ICDE’02] • Starting from keyword nodes (containing at least one query keyword) • In each search step, choose an incoming edge to a previously visited node and follow the edge backward to visit its source node • Discover an answer root r if r is visited from every keyword • Bidirectional search [Kacholia et al., VLDB’05] • Explore the graph by following forward edges as well • Choose which node to visit by heuristic activation factors w1 Conceptually, expand clusters of visited nodes for each keyword w2 Graph

Keyword Search Over Graph Databases

Keyword Search Over Graph Databases

Presentation Transcript

Keyword++: A Framework to Improve Keyword Search Over Entity Databases

XRANK: Ranked Keyword Search Over XML Documents

DBXplorer : A System For Keyword-Based Search Over Relational Databases

Keyword-based Search and Exploration on Databases

Graph databases

DBXplorer: A System for Keyword-Based Search over Relational Databases.

DBXplorer: A System for Keyword-Based Search over Relational Databases

DBXplorer: A System for Keyword B ased Search over Relational Databases

Benchmarking traversal operations over graph databases

Keyword Search over Relational Tables and Streams

Efficient Keyword Search over Virtual XML Views

Efficient Keyword Search Over Virtual XML Views

Keyword Search in Databases using PageRank

Correlation Search in Graph Databases

Efficient IR-Style Keyword Search over Relational Databases

Efficient Keyword Search over Virtual XML Views

Secure Conjunctive Keyword Search Over Encrypted Data

XRANK: Ranked Keyword Search over XML Documents

DISCOVER: Keyword Search in Relational Databases

Efficient IR-Style Keyword Search over Relational Databases

Efficient Keyword Search across Heterogeneous Relational Databases

Keyword Search on Graph-Structured Data