290 likes | 388 Views
Efficient Keyword Search across Heterogeneous Relational Databases. Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University. Key Message of Paper. Precise data integration is expensive
E N D
Efficient Keyword Search acrossHeterogeneous Relational Databases Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University
Key Message of Paper • Precise data integration is expensive • But we can do IR-styledata integration very cheaply, with no manual cost! • just apply automatic schema/data matching • then do keyword search across the databases • no need to verify anything manually • Already very useful Build upon keyword search over a single database ...
Keyword Search over a Single Relational Database • A growing field, numerous current works • DBXplorer [ICDE02], BANKS [ICDE02] • DISCOVER [VLDB02] • Efficient IR-style keyword search in databases [VLDB03], • VLDB-05, SIGMOD-06, etc. • Many related works over XML / other types of data • XKeyword [ICDE03], XRank [Sigmod03] • TeXQuery [WWW04] • ObjectRank [Sigmod06] • TopX [VLDB05], etc. • More are coming at SIGMOD-07 ...
tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR David Ross … u2 c124 John Deferred work to John Smith t1 c124 CiscoMichael Jones … A Typical Scenario Customers Complaints Foreign-Key Join Q = [Michael Smith Cisco] Ranked list of answers score=.8 u1 c124Michael Smith Repair didn’t work t1 c124 Cisco Michael Jones … score=.7
tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown … t1 c124 Cisco Michael Jones … u1 c124 Michael SmithRepair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas across databases Our Proposal:Keyword Search across Multiple Databases Employees Complaints tid empid name v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas Groups Customers tid eid reports-to x1 e23 e37 x2 e14 e37 Query:[Cisco Jack Lucas] IR-style data integration
A Naïve Solution 1. Manually identify FK joins across DBs 2. Manually identify matching data instances across DBs 3. Now treat the combination of DBs as a single DB apply current keyword search techniques Just like in traditional data integration, this is too much manual work
tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown … Kite Solution • Automatically find FK joins / matching data instances across databases no manual work is required from user Employees Complaints tid empid name v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas Groups Customers tid eid reports-to x1 e23 e37 x2 e14 e37
tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith Automatically Find FK Joinsacross Databases • Current solutions analyze data values (e.g., Bellman) • Limited accuracy • e.g., “waterfront” with values yes/no “electricity” with values yes/no • Our solution: data analysis + schema matching • improve accuracy drastically (by as much as 50% F-1) Employees Complaints tid empid name v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas Automatic join/data matching can be wrong incorporate confidence scores into answer scores
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas Incorporate Confidence Scores into Answer Scores • Recall: answer example in single-DB settings score=.8 u1 c124 Michael Smith Repair didn’t work t1 c124Cisco Michael Jones … • Recall: answer example in multiple-DB settings score 0.7 for data matching score 0.9 for FK join • α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q) score (A, Q) = • size (A)
Summary of Trade-Offs SQL queries • Precise data integration • the holy grail • IR-style data integration, naïve way • manually identify FK joins, matching data • still too expensive • IR-style data integration, using Kite • automatic FK join finding / data matching • cheap • only approximates the “ideal” ranked list found by naïve
Kite Architecture Q = [ Smith Cisco ] Index Builder Condensed CN Generator IR index1 IR indexn … • Partial • Full • Deep Foreign key joins Refinement rules Top-k Searcher Data instance Foreign-Key Join Finder matcher Data-based Schema Join Finder Matcher Distributed SQL queries … … D1 D1 Dn Dn Offline preprocessing Online querying
Database 1 Database 2 Relation 1 Relation 2 Relation 1 Relation 2 Online Querying What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers
Customers v1 v2 v3 Complaints u1 u2 J1 J4 J2 J1 J4 Customers{} Complaints{} Emps{} Emps Customers Groups Complaints J2 J3 J1 J4 J3 Groups{} J3 J4 J1 J2 CustomersQ ComplaintsQ EmpsQ J1 J4 Create Answer Templates Find tuples that contain query keywords • Use DB’s IR index • example: Q = [Smith Cisco] Tuple sets: Create tuple-set graph Schema graph: Tuple set graph: Service-DB HR-DB Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1} Groups x1 x2 Employees t1 t2 t3 HR-DB: EmployeesQ={t1} GroupsQ={}
sample CNs J1 CN1: CN2: CN3: CN4: sample tuple set graph J1 J4 Customers{} Complaints{} Emps{} J2 J3 J1 J4 Groups{} J3 J4 J2 J2 J4 J1 J2 CustomersQ CustomersQ Complaints{Q} EmpsQ Groups{} Emps{} Complaints{Q} CustomersQ ComplaintsQ EmpsQ J2 J3 J4 J1 J4 EmpsQ Groups{} Emps{} Complaints{Q} Create Answer Templates (cont.) • Search tuple-set graph to generate answer templates • also called Candidate Networks (CNs) • Each answer template = one way to join tuples to form an answer
Materialize Answer Templatesto Generate Answers • By generating and executing SQL queries J1 CN:CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2}) SQL: SELECT * FROM Customers C, Complaints P WHERE C.cust-id = P.id AND (C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2) • Naïve solution • materialize all answer templates, score, rank, then return answers • Current solutions • find only top-k answers • materialize only certain answer templates • make decisions using refinement rules + statistics
Challenges for Kite Setting • More databases way too many answer templates to generate • can take hours on just 3-4 databases • Materializing an answer template takes way too long • requires SQL query execution across multiple databases • invoking each database incurs large overhead • Difficult to obtain reliable statistics across databases • See paper for our solutions (or backup slides)
The DBLP Schema Sample Inventory Schema AR (aid, biblo) CITE (id1, id2) AUTHOR ARTIST BOOK CD PU (aid, uid) AR (id, title) WH2BOOK WH2CD AU (id, name) CNF (id, name) WAREHOUSE DBLP 1 DBLP 2 Inventory 1 Empirical Evaluation Domains
Hybrid algorithm adapted to run over multiple databases Kite without condensed CNs Kite without adaptive rule selection and without rule Deep Full-fledged Kite algorithm Inventory DBLP time (sec) time (sec) Kite without rule Deep max CCN size max CCN size 2-keyword queries, k=10, 5 databases 2-keyword queries, k=10, 2 databases runtime vs. # of databases Inventory time (sec) # of DBs maximum CCN size = 4, 2-keyword queries, k=10 Runtime Performance (1) runtime vs. maximum CCN size
runtime vs. # of keywords in the query DBLP Inventory time (sec) time (sec) |q| |q| max CCN=6, k=10, 2 databases max CCN=4, k=10, 5 databases runtime vs. # of answers requested Inventory time (sec) time (sec) k k 2-keyword queries, max CCN=4, 5 databases 2-keyword queries, max CCN=4, |q|=2, 5 databases Runtime Performance (2)
Query Result Quality • Pr@k = the fraction of answers that appear in the “ideal” list Pr@k Pr@k k k OR-semantic queries AND-semantic queries
Summary • Kite executes IR-style data integration • performs some automatic preprocessing • then immediately allows keyword querying • Relatively painless • no manual work! • no need to create global schema, nor to understand SQL • Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people • enterprises, on the Web, need only a few answers • emergency (e.g., hospital + police), need answers quickly
Future Directions • Incorporate user feedback interactive IR-style data integration • More efficient query processing • large # of databases, network latency • Extends to other types of data • XML, ontologies, extracted data, Web data IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work
CN3: CN4: J1 J4 Customers{} Complaints{} Emps{} {J2, J3} J1 J4 Groups{}` J4 J1 J1 J4 {J2, J3} Customers{} Complaints{} Emps{} CustomersQ ComplaintsQ EmpsQ J1 J4 J2 J3 J1 J4 Groups{} J3 J4 J2 J2 J4 J1 J2 EmpsQ Groups{} Emps{} Complaints{Q} {J2, J3} J2 J4 CustomersQ ComplaintsQ EmpsQ EmpsQ Groups{} Emps{} Complaints{Q} J2 J3 J4 J1 J4 EmpsQ Groups{} Emps{} Complaints{Q} Condensing Candidate Networks In multi-database settings unmanageable number of CNs • Many CNs share the same tuple sets and differ only in the associated joins • Group CNs into condensed candidate networks (CCNs) sample tuple set graph condense tuple set graph Condense sample CNs sample CCNs Condense
iteration 1 iteration 2 iteration 3 K = {P2, P3}, min score = 0.7 . . . . . . P1 [0.6, 0.8] P [0.6, 1] . P2 0.9 Res = {P2, R2} min score = 0.85 . . . Q [0.5, 0.7] . . . P3 0.7 R1 [0.4, 0.6] . . . . . . . R [0.4, 0.9] R [0.4, 0.9] R2 0.85 Top-k Search Main ideas for top-k keyword search: • No need to materialize all CNs • Sometimes, even partially materializing a CN is enough • Estimate score intervals for CNs, then branch and bound search Kite approach: materialize CNs using refinement rules
Top-k Search Using Refinement Rules • In single-database setting • selecting rules based on database statistics • In multi-database setting Inaccurate statistics • Inaccurate statistics Inappropriate rule selection
t1 u1 UQ TQ TQ UQ t1 0.9 t1 0.9 t1 0.9 0.8 u1 0.8 u1 0.8 u1 , 0.6 u2 0.6 u2 0.6 u2 t2 0.7 t2 0.7 t2 0.7 u1 u3 t1 t1 t3 0.4 t3 0.4 t3 0.4 0.5 u3 0.5 u3 0.5 u3 UQ TQ TQ UQ t4 0.3 t4 0.3 t4 0.3 0.1 u4 0.1 u4 0.1 u4 t1 0.9 0.8 u1 0.6 u2 t2 0.7 t3 0.4 0.5 u3 t4 0.3 0.1 u4 Refinement Rules • Full: • Exhaustively extract all answers from a CN (fully materialize S) • too much data to move around the network (data transfer cost) • Partial: • Try to extract the most promising answer from a CN • invoke remote databases for only one answer (high cost of database invocation) • Deep: • A middle-ground approach • Once a table in a remote database is invoked, extract all answers involving that table • Takes into account database invocation cost
Adaptive Search • Question: which refinement rule to apply next? • In single-database setting based on database statistics • Multi-database setting inaccurate statistics • Kite approach: adaptively select rules goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn) • cost (rule, cn): optimizer’s estimated cost for SQL statements • benefit (rule, cn): reduce the benefit if a rule is applied for a while without making any progress
accuracy (F1) Kite over single database time (sec) max CCN size Other Experiments • Schema matching helps improve join discovery algorithm drastically • Kite also improves single-database keyword search algorithm mHybrid