730 likes | 970 Views
Data Integration in the Life Sciences. Kenneth Griffiths and Richard Resnick. Tutorial Agenda. 1:30 – 1:45 Introduction 1:45 – 2:00 Tutorial Survey 2:00 – 3:00 Approaches to Integration 3:00 – 3:05 Bio Break 3:05 – 4:00 Approaches to Integration (cont.)
E N D
Data Integration in the Life Sciences Kenneth Griffiths and Richard Resnick
Tutorial Agenda 1:30 – 1:45 Introduction 1:45 – 2:00 Tutorial Survey 2:00 – 3:00 Approaches to Integration 3:00 – 3:05 Bio Break 3:05 – 4:00 Approaches to Integration (cont.) 4:00 – 4:15 Question and Answer 4:15 – 4:30 Break 4:30 – 5:00 Metadata Session 5:00 – 5:30 Domain-specific example (GxP) 5:30 Wrap-up
Life Science Data Recent focus on genetic data “genomics: the study of genes and their function. Recent advances in genomics are bringing about a revolution in our understanding of the molecular mechanisms of disease, including the complex interplay of genetic and environmental factors. Genomics is also stimulating the discovery of breakthrough healthcare products by revealing thousands of new biological targets for the development of drugs, and by giving scientists innovative ways to design new drugs, vaccines and DNA diagnostics. Genomics-based therapeutics include "traditional" small chemical drugs, protein drugs, and potentially gene therapy.” The Pharmaceutical Research and Manufacturers of America - http://www.phrma.org/genomics/lexicon/g.html Study of genes and their function Understanding molecular mechanisms of disease Development of drugs, vaccines, and diagnostics
The Study of Genes... • Chromosomal location • Sequence • Sequence Variation • Splicing • Protein Sequence • Protein Structure
… and Their Function • Homology • Motifs • Publications • Expression • HTS • In Vivo/Vitro Functional Characterization
Metabolic and regulatory pathway induction Understanding Mechanisms of Disease
Development of Drugs, Vaccines, Diagnostics • Differing types of Drugs, Vaccines, and Diagnostics • Small molecules • Protein therapeutics • Gene therapy • In vitro, In vivo diagnostics • Development requires • Preclinical research • Clinical trials • Long-term clinical research • All of which often feeds back into ongoing Genomics research and discovery.
The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources
What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …
Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty
“Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)
Approaches to Integration In order to ask this type of question across multiple domains, data integration at some level is necessary. When discussing the different approaches to data integration, a number of key issues need to be addressed: • Accessing the original data sources • Handling redundant as well as missing data • Normalizing analytical data from different data sources • Conforming terminology to industry standards • Accessing the integrated data as a single logical repository • Metadata (used to traverse domains)
Approaches to Integration (cont.) So if one agrees that the preceding issues are important, where are they addressed? In the client application, the middleware, or the database? Where they are addressed can make a huge difference in usability and performance. Currently there are a number of approaches for data integration: • Federated Databases • Data Warehousing • Indexed Data Sources • Memory-mapped Data Structures
Integrated Application (Q1Q2Q3) Middleware (CORBA, DCOM, etc) SeqWeb TxP App PubMed Proprietary App genbank proprietary cDNA µArraydb Medline Oligo TxP DB SEQUENCE EXPRESSION LITERATURE Federated Database Approach “Show me all genes that are homologous to known transcription factors” “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal cells “Show me all genes in the public literature that are putatively related to hyperprolactinemia”
Advantages to Federated Database Approach • quick to configure • architecture is easy to understand - no knowledge of the domain is necessary • achieves a basic level of integration with minimal effort • can wrapper and plug in new data sources as they come into existence
Problems with Federated Database Approach • Integration of queries and query results occurs at the integrated application level, requiring complex low-level logic to be embedded at the highest level • Naming conventions across systems must be adhered to or query results will be inaccurate - imposes constraints on original data sources • Data sources are not necessarily clean; integrating dirty data makes integrated dirty data. • No query optimization across multiple systems can be performed • If one source system goes down, the entire integrated application may fail • Not readily suitable for data mining, generic visualization tools • Relies on CORBA or other middleware technology, shown to have performance (and reliability?) problems
Integrated Application SeqWeb TxP App Relationship Service genbank proprietary cDNA µArraydb Oligo TxP DB SEQUENCE EXPRESSION Solving Federated Database Problems Semantic Cleaning Layer Middleware (CORBA, DCOM, etc) PubMed Proprietary App Medline LITERATURE
Data Warehousing for Integration Data warehousing is a process as much as it is a repository. There are a couple of primary concepts behind data warehousing: • ETL (Extraction, Transformation, Load) • Component-based (datamarts) • Typically utilizes a dimensional model • Metadata-driven
Source Data Data Warehouse (integrated Datamarts) Data Warehousing E (Extraction) T (Transformation) L (Load)
SEQUENCE EXPRESSION LITERATURE SeqWeb TxP App PubMed Proprietary App cDNA µArray DB genbank proprietary Oligo TxP DB Medline Data Staging Layer - ETL Metadata layer Data Warehouse Presentation Application Presentation Application (Q1Q2Q3) “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemia and normal pituitary cells, and are homologous to known transcription factors.” Presentation Application Data-level Integration Through Data Warehousing
Data Staging • Storage area and set of processes that • extracts source data • transforms data • cleans incorrect data, resolves missing elements, standards conformance • purges fields not needed • combines data sources • creates surrogate keys for data to avoid dependence on legacy keys • builds aggregates where needed • archives/logs • loads and indexes data • Does not provide query or presentation services
Data Staging (cont.) • Sixty to seventy percent of development is here • Engineering is generally done using database automation and scripting technology • Staging environment is often an RDBMS • Generally done in a centralized fashion and as often as desired, having no effect on source systems • Solves the integration problem once and for all, for most queries
Warehouse Development and Deployment Two development paradigms: Top-down warehouse design: conceptualize the entire warehouse, then build, tends to take 2 years or more, and requirements change too quickly Bottom-up design and deployment: pivoted around completely functional subsections of the Warehouse architecture, takes 2 months, enables modular development.
Warehouse Development and Deployment (cont.) • The Data Mart: • “A logical subset of the complete data warehouse” • represents a completable project • by itself is a fully functional data warehouse • A Data Warehouse is the union of all constituent data marts. • Enables bottom-up development
Warehouse Development and Deployment (cont.) Examples of data marts in Life Science: • Sequence/Annotation - brings together sequence and annotation from public and proprietary dbs • Expression Profiling datamart - integrates multiple TxP approaches (cDNA, oligo) • High-throughput screening datamart - stores HTS information on proprietary high-throughput compound screens • Clinical trial datamart - integrates clinical trial information from multiple trials • All of these data marts are pieced together along conformed entities as they are developed, bottom up
Advantages of Data-level Integration Through Data Warehousing • Integration of data occurs at the lowest level, eliminating the need for integration of queries and query results • Run-time semantic cleaning services are no longer required - this work is performed in the data staging environment • FAST! • Original source systems are left completely untouched, and if they go down, the Data Warehouse still functions • Query optimization across multiple systems’ data can be performed • Readily suitable for data mining by generic visualization tools
Issues with Data-level Integration Through Data Warehousing • ETL process can take considerable time and effort • Requires an understanding of the domain to represent relationships among objects correctly • More scalable when accompanied by a Metadata repository which provides a layer of abstraction over the warehouse to be used by the application. Building this repository requires additional effort.
Indexing Data Sources • Indexes and links a large number of data sources (e.g., files, URLs) • Data integration takes place by using the results of one query to link and jump to a keyed record in another location • Users have the ability to develop custom applications by using a vendor-specific language
I I I Sequence indexed data sources GxP indexed data sources SNP information Indexed Data Source Architecture Index Traversal Support Mechanism
Advantages quick to set up easy to understand achieves a basic level of integration with minimal effort Disadvantages does not clean and normalize the data does not have a way to directly integrate data from relational DBMSs difficult to browse and mine sometimes requires knowledge of a vendor-specific language Indexed Data Sources: Pros and Cons
Memory-mapped Integration • The idea behind this approach is to integrate the actual analytical data in memory and not in a relational database system • Performance is fast since the application retrieves the data from memory rather than disk • True data integration is achieved for the analytical data but the descriptive or complementary data resides in separate databases
Memory-mapped Integrated Data Memory Map Architecture Sample/Source Information Sequence DB #1 Sequence DB #2 Descriptive Information Data Integration Layer CORBA
Memory Maps: Pros and Cons Disadvantages • typically does not put non-analytical data (gene names, tissue types, etc.) through the ETL process • not easily extensible when adding new databases with descriptive information • performance hit when accessing anything outside of memory (tough to optimize) • scalability restricted by memory limitations of machine • difficult to mine due to complicated architecture Advantages • true “analytical” data integration • quick access • cleans analytical data • simple matrix representation
The Need for Metadata For all of the previous approaches, one underlying concept plays a critical role to their success: Metadata. Metadata is a concept that many people still do not fully understand. Some common questions include: • What is it? • Where does it come from? • Where do you keep it? • How is it used?
Metadata “The data about the data…” • Describes data types, relationships, joins, histories, etc. • A layer of abstraction, much like a middle layer, except... • Stored in the same repository as the data, accessed in a consistent “database-like” way
Metadata (cont.) • Back-end metadata - supports the developers • Source system metadata: versions, formats, access stats, verbose information • Business metadata: schedules, logs, procedures, definitions, maps, security • Database metadata - data models, indexes, physical & logical design, security • Front-end metadata - supports the scientist and application • Nomenclature metadata - valid terms, mapping of DB field names to understandable names • Query metadata - query templates, join specifications, views, can include back-end metadata • Reporting/visualization metadata - template definitions, association maps, transformations • Application security metadata - security profiles at the application level
Metadata Benefits • Enables the application designer to develop generic applications that grow as the data grows • Provides a repository for the scientist to become better informed on the nature of the information in the database • Is a high-performance alternative to developing an object-relational layer between the database and the application • Extends gracefully as the database extends
Integration Technologies • Technologies that support integration efforts • Data Interchange • Object Brokering • Modeling techniques
Data Interchange • Standards for inter-process and inter-domain communication • Two types of data • Data – the actual information that is being interchanged • Metadata – the information on the structural and semantic aspects of the Data • Examples: • EMBL format • ASN.1 • XML
XML Emerges • Allows uniform description of data and metadata • Metadata described through DTDs • Data conforms to metadata description • Provides open source solution for data integration between components • Lots of support in CompSci community (proportional to cardinality of Perl modules developed) • XML::CGI - a module to convert CGI parameters to and from XML • XML::DOM - a Perl extension to XML::Parser. It adds a new 'Style' to XML::Parser,called 'Dom', that allows XML::Parser to build an Object Oriented data structure with a DOM Level 1 compliant interface. • XML::Dumper - a simple package to experiment with converting Perl data structures to XML and converting XML to perl data structures. • XML::Encoding - a subclass of XML::Parser, parses encoding map XML files. • XML::Generator is an extremely simple module to help in the generation of XML. • XML::Grove - provides simple objects for parsed XML documents. The objects may be modified but no checking is performed. • XML::Parser - a Perl extension interface to James Clark's XML parser, expat • XML::QL - an early implementation of a note published by the W3C called "XML-QL: A Query Language for XML". • XML::XQL - a Perl extension that allows you to perform XQL queries on XML object trees.
XML in Life Sciences • Lots of momentum in Bio community • GFF (Gene Finding Features) • GAME (Genomic Annotation Markup Elements) • BIOML (BioPolymer markup language) • EBI’s XML format for gene expression data • … • Will be used to specify ontological descriptions of Biology data
XML – DTDs • Interchange format defined through a DTD – Document Type Definition • <!ELEMENT bioxml-game:seq_relationship (bioxml-game:span, bioxml-game:alignment?)> <!ATTLIST bioxml-game:seq_relationship seq IDREF #IMPLIED type (query | subject | peer | subseq) #IMPLIED > • And data conforms to DTD • <seq_relationship seq="seq1 "type="query"> • <span> • <begin>10</begin> • <end>15</end> • </span> • </seq_relationship> <seq_relationship seq="seq2" type="subject"> <span> <begin>20</begin> <end>25</end> </span> <alignment> query: atgccg ||| || subject: atgacg </alignment></seq_relationship>
XML Summary Benefits Drawbacks • Metadata and data have same format • HTML-like • Broad support in CompSci and Biology • Sufficiently flexible to represent any data model • XSL style sheets map from one DTD to another • Doesn’t allow for abstraction or partial inheritance • Interchange can be slow in certain data migration tasks
Object Brokering • The details of data can often be encapsulated in objects • Only the interfaces need definition • Forget DTDs and data description • Mechanisms for moving objects around based solely on their interfaces would allow for seamless integration
Enter CORBA • Common Object Request Broker Architecture • Applications have access to method calls through IDL stubs • Makes a method call which is transferred through an ORB to the Object implementation • Implementation returns result back through ORB
CORBA IDL • IDL – Interface Definition Language • Like C++/Java headers, but with slightly more type flexibility
CORBA Summary Benefits Drawbacks • Distributed • Component-based architecture • Promotes reuse • Doesn’t require knowledge of implementation • Platform independent • Distributed • Level of abstraction is sometimes not useful • Can be slow to broker objects • Different ORBS do different things • Unreliable? • OMG website is brutal
Modeling Techniques • E-R Modeling • Optimized for transactional data • Eliminates redundant data • Preserves dependencies in UPDATEs • Doesn’t allow for inconsistent data • Useful for transactional systems • Dimensional Modeling • Optimized for queryability and performance • Does not eliminate redundant data, where appropriate • Constraints unenforced • Models data as a hypercube • Useful for analytical systems