1 / 52

Learning to Create Data-Integration Queries

Learning to Create Data-Integration Queries Partha Pratim Talukdar , Marie Jacob, Muhammad Salman Mehmood , Koby Crammer, Zachary G. Ives, Fernando Pereira, Sudipto Guha VLDB2008 Seminar Presented by Noel Gunasekar CSE Department – SUNY Buffalo. Learning to Create Data-Integration Queries.

thina
Download Presentation

Learning to Create Data-Integration Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Create Data-Integration QueriesParthaPratimTalukdar, Marie Jacob, Muhammad SalmanMehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, SudiptoGuhaVLDB2008 Seminar Presented byNoel GunasekarCSE Department – SUNY Buffalo

  2. Learning to Create Data-Integration Queries Introduction Motivation Example Existing solutions Q-System Solution Q-System Architecture Query and Query Answers Executing Query Learning From feedback Conclusion Experimental results Future Work Learning to Create Data Integration Queries

  3. Introduction Learning to Create Data Integration Queries

  4. Motivation • Need for non-expert user to pose queries across multiple data resources. • Non-expert user - Not familiar with querying languages • Multiple resource - Databases, Data warehouses, Virtual integrated schemas Learning to Create Data Integration Queries

  5. Bio-Science Field • Many "standardized" databases with overlapping and cross-referenced information • Each site is being independently extended, corrected, and analyzed • Differing levels of data quality/confidence Protein Databases Protein DataBase: - PDB information and service listings at Brookhaven National Laboratory [ BNL ] PIR: - Protein Identification Resource database at [ JHU ] PRF: - Protein Research Foundation database at GenomeNet SwissProt - Protein database at ExPASy [ Switzerland ] Learning to Create Data Integration Queries

  6. Example • “What are the proteins and genes associated with the disease Narcolepsy?” • Life Sciences researcher querying on data-sources like genomics, disease studies and pharmacology. genomics Life Sciences Researcher Disease Studies http://www.expasy.org/uniprot/P04049 Pharmacology Learning to Create Data Integration Queries

  7. ExistingSolution • Using keyword based queries on Web-Forms • Match the keywords with terms in the tuples and form the query by joining different databases using foreign-keys • Cost for the query is fixed and doesn’t accommodate the context of the query http://www.expasy.org/uniprot/P0C852 Learning to Create Data Integration Queries

  8. Proposed Solution - Q System Automatically generate Web-Forms for given set of keywords Pose queries across multiple data resources using the generated web-form Learning to Create Data Integration Queries

  9. Proposed Solution - Q System Create re-usable web-form User (Author) Q System Keywords Protein, gene, disease Reusable Web-Form For querying Use web-form for Querying Users (Author + others) Query Results Parameters Reusable Web-Form For querying Learning to Create Data Integration Queries

  10. Q System Architecture Learning to Create Data Integration Queries

  11. Architecture of Q System • Four Components • Initial Schema Loader • Query Template Creation • Query Execution • Learning Through Feedback Learning to Create Data Integration Queries

  12. Architecture of Q System Learning to Create Data Integration Queries

  13. Initial Setup • Schema Loader • Input • Given a set of data sources with its own schema • Foreign Keys and Links • Schema Mappings • Record Link • Output • Schema Graph Learning to Create Data Integration Queries

  14. b c 0.07 0.04 0.1 d Initial Setup Example Schema Graph Node: Databases and their attributes (UniProt database, Entrez GeneInfo db, term) Edge: Relation based on foreign keys/cross-references (UniProt to PIR) Cost: Reliability, completeness Learning to Create Data Integration Queries

  15. Query Template Creation Learning to Create Data Integration Queries

  16. Query Template Creation - Example Input: “protein”, “plasma membrane”, “gene” and “disease” Output: Learning to Create Data Integration Queries

  17. a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Find trees connecting red nodes a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Query Template Creation - Example Query Keywords a, e, f Schema Graph Q1 Q2

  18. Query Formulation • Trees can be easily written as executable queries: Steiner Tree a b 0.1 0.1 d f 0.1 0.1 e Conjunctive query: a(x,y),b(y,z),d(z,w),e(w,u),f(w,v)

  19. View Refinement

  20. Web-Form

  21. Query Execution Learning to Create Data Integration Queries

  22. Input: Web-Form

  23. Output: Result Answers Q1 Q1 Q1,2 Q2 Q2 Q2 System determines “producer” queries using provenance

  24. Query Execution • Query Processing Engine with • Support for querying remote data sources • Record data provenance • Solution: • ORCHESTRA • http://www.cis.upenn.edu/~zives/orchestra/ Learning to Create Data Integration Queries

  25. Orchestra Project • The ORCHESTRA project focuses on the challenges of data sharing scenarios in the sciences • Bioinformatics Scenario - many "standardized" databases with overlapping information, similar but not identical data and differing levels of data quality/confidence • Each site is being independently extended, corrected, and analyzed • ORCHESTRA collaborative data sharing system (CDSS) is on how to support reconciliation across different schemas, with disagreeing users Learning to Create Data Integration Queries

  26. Orchestra Project – Data Provenance http://www.cis.upenn.edu/~zives/research/exchange.pdf Learning to Create Data Integration Queries

  27. Learning through Feedback Learning to Create Data Integration Queries

  28. Learning through Feedback • Input: Ranked Results + provenance Q1 Q1 Q1,2 Q2 Q2 Q2 Learning to Create Data Integration Queries

  29. Learning through Feedback • User provides feedback Q1 Q1 Q1,2 Q2 Q2 Q2 Learning to Create Data Integration Queries

  30. Query Formulation - Recap a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Find trees connecting red nodes a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Query Keywordsa, e, f Schema Graph Q1 Q2

  31. Learning through Feedback a b a b c 0.1 0.1 0.07 0.05 0.04 Q1 0.1 Q2 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.39 Rank = 2 Cost = 0. 41 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.4 e e • Change weights so Q2 is “cheaper” than Q1 a b c 0.05 0.07 0.1 0.04 0.1 d f 0.1 0.1 e

  32. Iteration! Learning to Create Data Integration Queries

  33. Q-System: Challenges • Computation of ranked queries which in turn produce ranked tuples: K-Best Steiner Tree Generation • Predicting new query rankings based on user feedback over tuples, and also generalizing feedback: Learning • Maintaining associations between tuples and queries: Query answers with provenance • Everything at interactive speed!

  34. f= 1 if the edge connects Term and Synonym tables, else 0 Term Synonym f 1 w8 Cost of a Query • Query Cost = Sum of edge costs in the tree. • Edge Cost = Sum of weights of features defined over it. • Features are properties of the edges, e.g., nodes connected • Each feature has a corresponding weight. • Feature example:

  35. a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Steiner Trees: Finding Lowest-Cost Queries • A tree of minimal cost in a graph (G) which includes all the required nodes (S). • Cost of a Steiner Tree is the sum of costs of edges present in the tree. • Steiner Tree is generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G].

  36. K-Best Steiner Tree Algorithms • Exact (practical for ~100 nodes and edges). • Integer Linear Program (ILP) based formulation for finding K-best Steiner Trees in a graph. • The ILP uses ideas from multi-commodity network flows • Approximate (for 100s+ nodes and edges). • Novel Shortest Paths Complete Subgraph Heuristic. • Significantly faster; in practice, often gives optimal solution.

  37. Multi-Commodity Flow Problem

  38. MIP for min-cost Steiner Tree

  39. MIP for K min-cost Steiner Tree

  40. Constraints C1 :Flow of commodity k starts at root r C2 : Flow of commodity k terminates at node k C3 : Conservation of flow at Steiner nodes C4 : Flow of an edge allowed only if that edge is included ( Yij = 1 ) C5 : Non-negativity constraint C6 : Defines value for Y C7 : Ensures no incoming active edge into the root C8 : Ensures that all nodes have at most one incoming active edge C9 : Flow of at least one commodity on all edges in I C10 : Ensures no flow on edges in X

  41. a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Finding K-Best Steiner Trees 2-best Steiner trees connecting terminal nodes.

  42. K-Best Steiner Tree Algorithms • Approximate (for 100s+ nodes and edges). • Novel Shortest Paths Complete Subgraph Heuristic. • Send “m” shortest path graph as input. • Shortest path between each pair of nodes in S • Significantly faster; in practice, often gives optimal solution.

  43. Tuples Query a b Q d f T e . . . . . . a b c Q* d f T* e Q Challenge : Getting User Feedback Top . . . Bottom T and T* differ in 3 edges. This difference is termed loss: L(T, T*)

  44. Learning: Update Weights Edge Cost: 0.07 Term(T1) Term2Term w8 = 0.06 w25 = 0.01 Edge Cost Update Edge Cost: 0.05 Term(T1) Term2Term w8 = 0.04 w25 = 0.01

  45. Re-ranked Steiner Trees a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 e e Weight Update a b c a b 0.1 0.05 0.1 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 e e Rank 1 Rank 2

  46. Experimental Results • The Key Questions • Can the algorithm start with uninitialized weights and learn expert (“gold standard”) ranking of queries? • Can the results be generated at interactive speeds? • Does the approach scale to larger graphs?

  47. # Gold queries absent in top-3 predictions Results: Learning Expert Weights • Graph: Start with the BioGuide bio sources, with 28 vertices and 96 edges. • Goal: Learn the queries corresponding to the expert-set weights in BioGuide • Methodology: • All weights are set to default. • Sequence of 25 searches • For each, user feedback identifies & promotes a tuple from the gold standard answer. • After 40-60% searches with feedback, system finds the top query immediately. • For each individual search, a single feedback is enough to learn the top query.

  48. Results: Time to generate K-best Queries • Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). It is possible to generate the top query in < 1 sec and top 5 queries in about 2 sec, all within interactive range. Query execution is pipelined.

  49. Results: Scalability to Larger Graphs • Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. Queries (K) It is possible to do K-best inference in larger graphs quickly and with little or no loss (none in this case).

  50. Experimental Results • “Gold Standard” • Using BioGuide – a biomedical information integration system • BioGuide generates the schema graph based on keywords • The edge cost in the schema graph are manually assigned by experts • This expert given schema graph is called the “gold standard” • Experiment involves in comparing the result produced by the q system with the results produced by the gold standard. Learning to Create Data Integration Queries

More Related