560 likes | 685 Views
Additional DI Paradigms. Peer to Peer Databases. Database Approaches to Semantic Integration. Data warehouse Design a single schema Do physical DB design Map data into warehouse schema Periodically update warehouse. Virtual data integration (EII) Design mediated schema
E N D
Database Approaches to Semantic Integration Data warehouse • Design a single schema Do physical DB design • Map data into warehouse schema • Periodically update warehouse Virtual data integration (EII) • Design mediated schema • Map sources to mediated schema • Queries are rewritten and answered on demand from sources
A Single Centralized Schema is a Bottleneck! Challenging to form a single schema for all domain data • People don’t agree on how concepts should be represented • Data warehouse: physical design is a strong consideration • Mediated schema very different from original users’ schemas Difficult to evolve this single schema as needs change • May “break” existing queries • Must build consensus for any schema changes
Peer Data Management: Decentralized Mediation for Ad Hoc Extensibility DB Projects Stanford IIT Mumbai UPenn UW Data integration: 1 mediated schema, m mappings to sources Peer data management system (PDMS): • n mediated “peer schemas,” as few as (n - 1) mappings between them – evaluated transitively • m mappings to sources
Answering Queries in a PDMS:Transitively Evaluating Mappings Start with schema being queried • Look up mappings to neighbors; expand • Continue iteratively until queries only over sources Mappings in a PDMS may be a combination of LAV, GAV techniques: • General form p1a(X, Y), p1b(Y,Z), … = p2a(Y, X), p2b(X, Z), …(see paper for examination of what is actually tractable) • Requires unfolding and AQUV We use a rule-goal “tree” to expand the mappings • Extend some of the ideas of MiniCon to avoid unnecessary expansions • Challenges to avoid redundancy – see paper for optimizations
Example of Query Answering Query: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) r1 r0 Author (a,w) SameProject (a1,a2,p) ProjMember (a1,p) CoAuthor (a1,a2) Sched(f,s,e) r3 r2 S2 S1 Mappings between peers’ schemas: r0: SameProject(a1,a2,p) :- ProjMember(a1,p), ProjMember(a2,p) r1: CoAuthor(a1,a2) Author(a1,w), Author(a2,w) Mappings to data sources: r2: S1(a,p,s) ProjMember(a,p), Sched(f,s,end) r3: CoAuthor(f1,f2) :- S2(f1,f2)
Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q
Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
Mappings between peers’ schemas: r0: SameProject(a1,a2,p) :- ProjMember(a1,p), ProjMember(a2,p) r1: CoAuthor(a1,a2) Author(a1,w), Author(a2,w) Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
Mappings between peers’ schemas: r0: SameProject(a1,a2,p) :- ProjMember(a1,p), ProjMember(a2,p) r1: CoAuthor(a1,a2) Author(a1,w), Author(a2,w) Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q SameProject(a1,a2,p) Author(a1,w) Author(a2,w) r0 r1 r1
Mappings to data sources: r2: S1(a,p,s) ProjMember(a,p), Sched(a,s,end) r3: CoAuthor(f1,f2) = S2(f1,f2) Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q SameProject(a1,a2,p) Author(a1,w) Author(a2,w) r0 r1 r1 ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1)
Mappings to data sources: r2: S1(a,p,s) ProjMember(a,p), Sched(a,s,end) r3: CoAuthor(f1,f2) = S2(f1,f2) Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q SameProject(a1,a2,p) Author(a1,w) Author(a2,w) r0 r1 r1 ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1) r3 r3 r2 r2
Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q SameProject(a1,a2,p) Author(a1,w) Author(a2,w) r0 r1 r1 ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1) r3 r3 r2 r2 S1(a1,p,_) S1(a2,p,_) S2(a2,a1) S2(a1,a2)
Example Rule-Goal Tree Expansion q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w) q SameProject(a1,a2,p) Author(a1,w) Author(a2,w) r0 r1 r1 ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1) r3 r3 r2 r2 S1(a1,p,_) S1(a2,p,_) S2(a2,a1) S2(a1,a2) Q’(a1,a2) :- S1(a1,p,_), S1(a2,p,_), S2(a1,a2) S1(a1,p,_), S1(a2,p,_), S2(a2,a1)
ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data Zachary Ives, Nitin Khandelwal, Aneesh Kapur, University of Pennsylvania Murat Cakir, Drexel University
Data Exchange among Bioinformatics Warehouses & Biologists Different bioinformatics institutes, research groups store their data in separate warehouses with related, “overlapping” data • Each source is independently updated, curated locally • Updates are published periodically in some “standard” schema • Each site wants to import these changes, maintain a copy of all data • Individual scientistsalso import the data and changes, and would like to share their derived results • Caveat: not all sites agreeon the facts! Often, no consensus on the “right” answer!
A Clear Need for a General Infrastructure for Data Exchange Bioinformatics exchange is done with ad hoc, custom tools – or manually – or not at all! • (NOT an instance of file sync, e.g., Intellisync, Harmony; or groupware) It’s only one instance of managing the exchange of independently modified data, e.g.: • Sharing subsets of contact lists (colleagues with different apps) • Integrating and merging multiple authors’ bibTeX, EndNote files • Distributed maintenance of sites like DBLP, SIGMOD Anthology This problem has many similarities to traditional DBs/data integration: • Structured or semi-structured data • Schema heterogeneity, different data formats, autonomous sources • Concurrent updates • Transactional semantics
Challenges in Developing Collaborative Data Sharing “Middleware” • How do we coordinate updates between conflicting collaborators? • How do we support rapid & transient participation, as in the Web or P2P systems? • How do we handle the issues of exchanging updates across different schemas? • These issues are the focus of our work on the ORCHESTRA Collaborative Data Sharing System
Our Data Sharing Model • Participants create & independently update local replicas of an instance of a particular schema • Typically stored in a conventional DBMS • Periodically reconcile changes with those of other participants • Updates are accepted based on trust/authority – coordinated disagreement • Changes may need to be translated across mappings between schemas • Sometimes only part of the information is mapped
Conclusions and Future Work ORCHESTRA focuses on trying to coordinate disagreement, rather than enforcing agreement • Significantly different from prior data sharing and synchronization efforts • Allows full autonomy of participants – offers scalability, flexibility Central ideas: • A new data model that supports “coordinated disagreement” • Global reconciliation and support for transient membership via P2P distributed hash substrate • Update translation using extensions to peer data management and view update/maintence Currently working on integrated system, performance optimization
The MOBS Project • Learn from multitude of users to improve tool accuracy, thus significantly reducing builder workload • MOBS = MassCollaborationto Build Systems Questions Answers
Mass Collaboration • Build software artifacts • Linux, Apache server, other open-source software • Knowledge bases, encyclopedia • wikipedia.com • Review & technical support websites • amazon.com, epinions.com, quiq.com, • Detect software bugs • PLDI 03 & 05 • Label images on the Web • Improve search engines, recommender systems Why not data integration systems?
Example: Duplicate Data Matching • Hard for machine, but easy for human • Serious problem in many settings (e.g., epinions.com) Dell laptop X200 with mouse ... Mouse for Dell laptop 200 series ... Dell X200. Mouse at reduced price ...
Key Challenges • How to modify tools to learn from users? • gather more training data, learn domain constraints, verify intermediate and final predictions • ask questions that are hard for machine, easy for human • How to combine noisy user answers • build user models, learn them via interaction with users • Novel form ofactive learning • with multiple noisy oracles • How to obtain user participation? • data experts, often willing to help (e.g., Illinois Fire Service) • may be asked to help (e.g., epinions) • volunteer (e.g., online communities), "payment" schemes
Key Challenges (Cont.) • How to convince the database community? • many love the idea, some skeptical • need solid papers, real systems, more evangelizing • Current status • built prototype, experimented with 3-132 users, for source discovery and schema matching • improve accuracy by 9-60%, reduced workload by 29-88% • built two simple DI systems on Web almost exclusively with users • building a real system on top of DBworld • See [McCann et al., WebDB-03, ICDE-05, AAAI Spring Symposium-05, Tech Report-06]
Simplify Mediated Schema Keyword Search over Multiple Databases • Novel problem • Extremely useful for urgent / one-time DI needs • also when users are SQL-illiterate (e.g., Electronic Medical Records) • Solution [Tech Report 05, forthcoming] • combines IR, schema matching, data matching, and AI planning Can do joins across data sources
Simplify Wrappers Structured Queries over Text/Web Data • Novel problem • attracts attention from database / AI / Web researchers at Columbia, IBM TJ Watson/Almaden, UCLA, IIT-Bombay • Handles urgent / one-time DI need [Tech Report 05b] • combines database, IR, and machine learning techniques • Handles long-term DI needs [Tech Report 05c] SELECT ... FROM ... WHERE ... E-mails, text, Web data, news, etc.
Beyond Data Integration:Community Information Management Web pages Researcher Homepages Conference Pages DB Group Pages DBworld mailing list etc. * * * * Text documents * * * * * Jim Gray Jim Gray Keyword search SQL querying Question answering Browse, mining * * giving-talk * SIGMOD-04 SIGMOD-04 * * * DBLP
The Big Picture [Speculative Mode] Many apps will - increasingly involve all three - want keyword search + SQL + etc. Structured data (relational, XML) Unstructured data (text, Web, email) Database: SQL IR/Web/AI/Mining: keyword, QA Multitude of users Semantic Web Industry/Real World
(Almost) Hands-OffInformation Integration for the Life Sciences Ulf Leser, Felix Naumann
Aladin • Basic idea • Urgent need for data integration in the life sciences • Life science databases have certain characteristics • Life science database users have certain intentions • These can be exploited to automate integration • ALmost Automatic Data INtegrationfor the Life Sciences • Minimize manual effort • Keep quality of integrated data as high as possible • Use domain-specific heuristics
Export schema Export schema Federated schema Federated schema Export schema Export schema Export schema Component schema Component schema Component schema Local schema Local schema Local schema Data Data Data Source Source Source Integration? • Database integration • Schema level • Data integration • Data level
Two Cultures of Integration • Schema-driven (computer scientists) • Much smaller than data, (hopefully) well-defined elements • Resolve redundancy and heterogeneity at the schema level • High degree of automation once system is set-up • Focus on methods - you rarely publish a “data paper” • Data-driven (biologists) • Value is in the data, abstraction is a result of analysis • Don‘t bother with schemas • Abstraction is volatile and depends on experimental technique • Manual integration at data level, constant high effort • You rarely publish a (database) “method paper”
Two Cultures: TAMBIS & SWISS-PROT • Database of protein sequences • Papers, pers. comm., ext. databases, … • Large effort: 30+ data curators • Gold standard database • Mostly perceived and used as a book • Semantic middleware • 6 sources, 1200 concepts • Ever adopted in any other project? • Integrated schema difficult to understand • No agreement on “global” concepts • Data provenance
Linking Associated Objects • Schema-driven • Too abstract; tends to blur data provenance • Data-driven • Costly and time-consuming; inadequate use of DB technology • Alternative: Concentrate on object links • Example: SRS • Maps a flat-file into a semi-structured, “one class” representation • Never mixes data from diff. sources • Use cross-references for navigation and joins
Aladin’s Scenario • Assumptions • Integration of many, many biological databases • As little manual interventions as possible • Do not merge data from different databases • Challenges • Push automation as far as possible without lowering quality of integrated data too much • Systematically evaluate quality of automatic integration • Why will it work? • Integrate by generating / finding links between objects • Exploit characteristics of life science databases
Properties – and how to use them • Data sources have only one “type” of object • Objects have nested, semi-structured annotations • Detect hierarchical structure • Objects have stable, unique accession numbers • Databases heavily cross-reference each other • Detect objects • Detect existing cross-references • Objects have rich annotations (often free text, sequences) • Detect further associations based on “similarity”
SCOP Class Fold Superfamily SwissProt Description Domains Feature DSSP Secondarystructure elements KEGG Pathway Enzyme EC Number Columba: Multidimensional Integration • Interdisciplinary project • Integrates 15 sources annotating protein structures • Sources are dimensions for PDB entries • Neither data nor schema integration - links • Advantages • Users recognize their sources • Intuitive query concept • “Relatively” easy to maintain/extend CATH Class Architecture Topology Homolog. sf PDB PDB_ID Compounds ChainsLigands GeneOntology Terms TermRelations Ontologies
Columba Experiences • = Aladin’s assumptions • Relational approach feasible: Sources are downloadable, parsers exist • Databases are collections of each one type • Hierarchical structure, only 1:n relationships • Objects have unique accession numbers • Importance of and lack of cross references • Lessons learned • Schema reengineering is extremely time-consuming • Although we will only use a small part at the end • There is more demand than resources • Why not be less specific about which data to integrate, but much faster?
PubMed KEGG Genbank Brenda OMIM PDB SWISSPROT Materialized Integration DataWarehouse BIND
PubMed KEGG Genbank Brenda OMIM PDB SWISSPROT Materialized Integration Aladin BIND
Five Steps to Integration Source-specific • Download source, parse, import into RDBMS • Guess primary objects • Guess (hierarchically structured) annotation Across data sources • Guess cross-references • Objects sharing some piece of information • Guess duplicates • Highly similar objects
Overview – Steps 1-3 • Steps 2 and 3 • Guess primary objects • Guess accession number • Guess / find FK constraints • Step 1 • Parse and import • Arbitrary target schema • With or without FK constraints
Overview – Steps 4+5 • Step 5 • Guess duplicates • Different degrees of “duplicateness” • Step 4 • Guess existing cross-refs • Compute new cross-refs
1. Download, parse, import • Q: Is that possible in an automatic way? • Q: What is the target schema? • Answers • Here, some manual work is involved, but … • Parsers are almost always available (BioXXX) • Aladin doesn‘t mind the target schema • Target schemas are completely source-specific • … may or may not contain FK constraints (MySql is …!) • But: Universal relation won’t work
2. Guess Primary Objects • Q: What’s a primary object? • Q: How do you find them? • Answers • A database is a collection of objects of one type • Many biological databases started as books • These primary objects have stable accession numbers • Accession numbers look very much the same • P0496, DXS231, 1DXX, … • Analyze length, composition, variation, uniqueness, NOT NULL • But: Databases may have more than one primary type