520 likes | 722 Views
Learning to Create Data-Integration Queries Partha Pratim Talukdar , Marie Jacob, Muhammad Salman Mehmood , Koby Crammer, Zachary G. Ives, Fernando Pereira, Sudipto Guha VLDB2008 Seminar Presented by Noel Gunasekar CSE Department – SUNY Buffalo. Learning to Create Data-Integration Queries.
E N D
Learning to Create Data-Integration QueriesParthaPratimTalukdar, Marie Jacob, Muhammad SalmanMehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, SudiptoGuhaVLDB2008 Seminar Presented byNoel GunasekarCSE Department – SUNY Buffalo
Learning to Create Data-Integration Queries Introduction Motivation Example Existing solutions Q-System Solution Q-System Architecture Query and Query Answers Executing Query Learning From feedback Conclusion Experimental results Future Work Learning to Create Data Integration Queries
Introduction Learning to Create Data Integration Queries
Motivation • Need for non-expert user to pose queries across multiple data resources. • Non-expert user - Not familiar with querying languages • Multiple resource - Databases, Data warehouses, Virtual integrated schemas Learning to Create Data Integration Queries
Bio-Science Field • Many "standardized" databases with overlapping and cross-referenced information • Each site is being independently extended, corrected, and analyzed • Differing levels of data quality/confidence Protein Databases Protein DataBase: - PDB information and service listings at Brookhaven National Laboratory [ BNL ] PIR: - Protein Identification Resource database at [ JHU ] PRF: - Protein Research Foundation database at GenomeNet SwissProt - Protein database at ExPASy [ Switzerland ] Learning to Create Data Integration Queries
Example • “What are the proteins and genes associated with the disease Narcolepsy?” • Life Sciences researcher querying on data-sources like genomics, disease studies and pharmacology. genomics Life Sciences Researcher Disease Studies http://www.expasy.org/uniprot/P04049 Pharmacology Learning to Create Data Integration Queries
ExistingSolution • Using keyword based queries on Web-Forms • Match the keywords with terms in the tuples and form the query by joining different databases using foreign-keys • Cost for the query is fixed and doesn’t accommodate the context of the query http://www.expasy.org/uniprot/P0C852 Learning to Create Data Integration Queries
Proposed Solution - Q System Automatically generate Web-Forms for given set of keywords Pose queries across multiple data resources using the generated web-form Learning to Create Data Integration Queries
Proposed Solution - Q System Create re-usable web-form User (Author) Q System Keywords Protein, gene, disease Reusable Web-Form For querying Use web-form for Querying Users (Author + others) Query Results Parameters Reusable Web-Form For querying Learning to Create Data Integration Queries
Q System Architecture Learning to Create Data Integration Queries
Architecture of Q System • Four Components • Initial Schema Loader • Query Template Creation • Query Execution • Learning Through Feedback Learning to Create Data Integration Queries
Architecture of Q System Learning to Create Data Integration Queries
Initial Setup • Schema Loader • Input • Given a set of data sources with its own schema • Foreign Keys and Links • Schema Mappings • Record Link • Output • Schema Graph Learning to Create Data Integration Queries
b c 0.07 0.04 0.1 d Initial Setup Example Schema Graph Node: Databases and their attributes (UniProt database, Entrez GeneInfo db, term) Edge: Relation based on foreign keys/cross-references (UniProt to PIR) Cost: Reliability, completeness Learning to Create Data Integration Queries
Query Template Creation Learning to Create Data Integration Queries
Query Template Creation - Example Input: “protein”, “plasma membrane”, “gene” and “disease” Output: Learning to Create Data Integration Queries
a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Find trees connecting red nodes a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Query Template Creation - Example Query Keywords a, e, f Schema Graph Q1 Q2
Query Formulation • Trees can be easily written as executable queries: Steiner Tree a b 0.1 0.1 d f 0.1 0.1 e Conjunctive query: a(x,y),b(y,z),d(z,w),e(w,u),f(w,v)
Query Execution Learning to Create Data Integration Queries
Output: Result Answers Q1 Q1 Q1,2 Q2 Q2 Q2 System determines “producer” queries using provenance
Query Execution • Query Processing Engine with • Support for querying remote data sources • Record data provenance • Solution: • ORCHESTRA • http://www.cis.upenn.edu/~zives/orchestra/ Learning to Create Data Integration Queries
Orchestra Project • The ORCHESTRA project focuses on the challenges of data sharing scenarios in the sciences • Bioinformatics Scenario - many "standardized" databases with overlapping information, similar but not identical data and differing levels of data quality/confidence • Each site is being independently extended, corrected, and analyzed • ORCHESTRA collaborative data sharing system (CDSS) is on how to support reconciliation across different schemas, with disagreeing users Learning to Create Data Integration Queries
Orchestra Project – Data Provenance http://www.cis.upenn.edu/~zives/research/exchange.pdf Learning to Create Data Integration Queries
Learning through Feedback Learning to Create Data Integration Queries
Learning through Feedback • Input: Ranked Results + provenance Q1 Q1 Q1,2 Q2 Q2 Q2 Learning to Create Data Integration Queries
Learning through Feedback • User provides feedback Q1 Q1 Q1,2 Q2 Q2 Q2 Learning to Create Data Integration Queries
Query Formulation - Recap a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Find trees connecting red nodes a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Query Keywordsa, e, f Schema Graph Q1 Q2
Learning through Feedback a b a b c 0.1 0.1 0.07 0.05 0.04 Q1 0.1 Q2 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.39 Rank = 2 Cost = 0. 41 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.4 e e • Change weights so Q2 is “cheaper” than Q1 a b c 0.05 0.07 0.1 0.04 0.1 d f 0.1 0.1 e
Iteration! Learning to Create Data Integration Queries
Q-System: Challenges • Computation of ranked queries which in turn produce ranked tuples: K-Best Steiner Tree Generation • Predicting new query rankings based on user feedback over tuples, and also generalizing feedback: Learning • Maintaining associations between tuples and queries: Query answers with provenance • Everything at interactive speed!
f= 1 if the edge connects Term and Synonym tables, else 0 Term Synonym f 1 w8 Cost of a Query • Query Cost = Sum of edge costs in the tree. • Edge Cost = Sum of weights of features defined over it. • Features are properties of the edges, e.g., nodes connected • Each feature has a corresponding weight. • Feature example:
a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Steiner Trees: Finding Lowest-Cost Queries • A tree of minimal cost in a graph (G) which includes all the required nodes (S). • Cost of a Steiner Tree is the sum of costs of edges present in the tree. • Steiner Tree is generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G].
K-Best Steiner Tree Algorithms • Exact (practical for ~100 nodes and edges). • Integer Linear Program (ILP) based formulation for finding K-best Steiner Trees in a graph. • The ILP uses ideas from multi-commodity network flows • Approximate (for 100s+ nodes and edges). • Novel Shortest Paths Complete Subgraph Heuristic. • Significantly faster; in practice, often gives optimal solution.
Constraints C1 :Flow of commodity k starts at root r C2 : Flow of commodity k terminates at node k C3 : Conservation of flow at Steiner nodes C4 : Flow of an edge allowed only if that edge is included ( Yij = 1 ) C5 : Non-negativity constraint C6 : Defines value for Y C7 : Ensures no incoming active edge into the root C8 : Ensures that all nodes have at most one incoming active edge C9 : Flow of at least one commodity on all edges in I C10 : Ensures no flow on edges in X
a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Finding K-Best Steiner Trees 2-best Steiner trees connecting terminal nodes.
K-Best Steiner Tree Algorithms • Approximate (for 100s+ nodes and edges). • Novel Shortest Paths Complete Subgraph Heuristic. • Send “m” shortest path graph as input. • Shortest path between each pair of nodes in S • Significantly faster; in practice, often gives optimal solution.
Tuples Query a b Q d f T e . . . . . . a b c Q* d f T* e Q Challenge : Getting User Feedback Top . . . Bottom T and T* differ in 3 edges. This difference is termed loss: L(T, T*)
Learning: Update Weights Edge Cost: 0.07 Term(T1) Term2Term w8 = 0.06 w25 = 0.01 Edge Cost Update Edge Cost: 0.05 Term(T1) Term2Term w8 = 0.04 w25 = 0.01
Re-ranked Steiner Trees a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 e e Weight Update a b c a b 0.1 0.05 0.1 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 e e Rank 1 Rank 2
Experimental Results • The Key Questions • Can the algorithm start with uninitialized weights and learn expert (“gold standard”) ranking of queries? • Can the results be generated at interactive speeds? • Does the approach scale to larger graphs?
# Gold queries absent in top-3 predictions Results: Learning Expert Weights • Graph: Start with the BioGuide bio sources, with 28 vertices and 96 edges. • Goal: Learn the queries corresponding to the expert-set weights in BioGuide • Methodology: • All weights are set to default. • Sequence of 25 searches • For each, user feedback identifies & promotes a tuple from the gold standard answer. • After 40-60% searches with feedback, system finds the top query immediately. • For each individual search, a single feedback is enough to learn the top query.
Results: Time to generate K-best Queries • Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). It is possible to generate the top query in < 1 sec and top 5 queries in about 2 sec, all within interactive range. Query execution is pipelined.
Results: Scalability to Larger Graphs • Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. Queries (K) It is possible to do K-best inference in larger graphs quickly and with little or no loss (none in this case).
Experimental Results • “Gold Standard” • Using BioGuide – a biomedical information integration system • BioGuide generates the schema graph based on keywords • The edge cost in the schema graph are manually assigned by experts • This expert given schema graph is called the “gold standard” • Experiment involves in comparing the result produced by the q system with the results produced by the gold standard. Learning to Create Data Integration Queries