The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Outline • Microarray data and standards overview • ArrayExpress overall principles • ArrayExpress architecture • AE repository • AE data warehouse • Future plans and conclusions

Sample annotations problem 1 Gene expression levels – problem 2 Gene annotations Gene expression data and annotation Samples Gene expression matrix Genes

Platform comparison (Tan et al, PNAS, 2003) ‘Our conclusion was very straightforward: there was very little overlap in the types of data in terms of differential expression’ (Margareth Cam, NIH)

labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Microarray array array array Gene expression data matrix Protocol Protocol Protocol Protocol Protocol Protocol normalization integration Experiment genes Sample Sample Sample Sample Sample Array design RNA extract RNA extract RNA extract RNA extract RNA extract hybridisation labelled nucleic acid hybridisation array hybridisation hybridisation hybridisation

Array scans Quantitations Samples Spots Genes A B D C Different processing levels of MA data

MGED standards • MIAME – minimum information about a microarray experiment • MAGE-OM and MAGE-ML – microarray gene expression object model and mark-up language • MO – microarray ontology • Data normalisation and transformations (and quality control)

UML Packages of MAGE results what was done what was used HigherLevelAnalysis Experiment BioMaterial BioAssayData BioAssay Array QuantitationType ArrayDesign miscellaneous AuditAndSecurity Measurement DesignElement Protocol Description BQS BioSequence BioEvent

MAGE – an example diagram

ArrayExpress aims • An archive for microarray data supporting scientific publications • Providing easy access to public gene expression and other to microarray data in a structured format • Facilitating the sharing of microarray designs and protocols • Facilitating the establishment of infrastructure for microarray data sharing

AE users • Experimentalists • “Single-gene” biologists • Bioinformaticians; genome-wide studies • Bioinformaticians – algorithm developers • Software developers

EBI Submissions Submissions ArrayExpres Array Manufacturers (Affymetrix, Agilent) www MIAMExpress MAGE-ML External MIAMExpress installations (Camb. U., EMBL) MAGE-ML Submission tracking/ curation tool Other Microarray Databases (SMD, TIGR, Utrecht, RZPD) MAGE-ML ArrayExpress repository Queries, analysis MAGE-ML Analysis Warehouse (Biomart) www Data Analysis Software (R/Bioconductor, J-Express, Resolver) Expression Profiler External Databases (EMBL, UniProt, Ensemble) Data analysis ArrayExpress infrastructure

AE: overall principles • Adherence to community standards • Data captured in a granular, formalized manner • Modern but proven software technologies • Incremental development

AE design considerations • Separate data archiving from the query-optimized data warehouse • Generate default implementation, then refine • ~2 full-time developers • pressure to bring system online quickly • Use object abstraction layer • deal with performance overhead on case-by-case basis

Repository architecture overview MAGE-ML (doc) MAGE-ML (doc) MAGE-ML DTD MAGE-ML document Tomcat Web page template Web page template error.log Velocity Curationenvironment MAGE validator Java servlets MAGE-OM MAGE loader object/ relational mapping Castor MAGE unloader Oracle DB

AE schema • Why auto-generated? • AE must be able to import any valid MAGE-ML and not lose information • good for navigating through data in terms of object model • if some queries don’t work well, add something to the schema • Experiment-Biomaterial, Experiment-Protocol links • so far works for 400Gb of data

Auto-generated web pages

To ontologize ornot to ontologize At the beginning: At the end:

Model vs. ontology • Model – stable; ontologies – flexible • Adding/modifying/deleting attributes – easy; adding/modifying/deleting associations – hard • Therefore: attributes and their types in ontologies, domain structure (classes + associations) in the model

>15 000 000 000 data points • Experiment1 • type • performer • …. • Hybridization data 1 • Experimental factors • Quantitation type definitions • … NetCDF

Data warehouse schema

What BioMart gives to AEDW • Query language abstraction • Joins automatically generated • Schema optimized for performance • Clear database integration roadmap

ArrayExpress environment

Future plans • Data management environment automation • Flexible data warehouse interface • Programmatic interface (HTTP/XML based) • Distributed infrastructure??

Distributed data infrastructure Users query ArrayExpress deliverdata Query broker find resource A local database A local database A local database

Conclusions • Conceptual object modeling works well for complex life sciences domains • Many software infrastructure components can be auto-generated from object models • A range of approaches can be used for modeling, e.g., UML framework + ontologies • Repository and data warehouse – different aims and different implementation principles

Acknowledgements • MGED collaborators • Stanford, TIGR, Affymetrix, EMBL, …. • BioMart team • Gonzalo Garcia Lara - web interface • Ahmet Oezcimen - DBA • Anjan Sharma - curation tool • Sergio Contrino, Richard Coulson – data warehouse • Niran Abeygunawardena – webmaster • Mohammadreza Shojatalab – MIAMExpress • Misha Kapushesky – Expression Profiler • Curation team: • Helen Parkinson, Ele Holloway, Gaurab Mukherjee, Anna Farne, Tim Rayner • Domain-specific projects: • Susanna Sansone, Philippe Rocca-Serra • Alvis Brazma

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective

Presentation Transcript

Gene Expression Profiling

Classification of Microarray Gene Expression Data

Gene Network Modeling

1. Introduction to Software Engineering

Gene Expression Arrays (Haverford College, Fall 2001)

CSC321: Database Management Systems

Lecture 2: RF Issues for Software Radios RF Engineering for the DSP Engineer

Design and Software Architecture

Carlo Colantuoni carlo@illuminatobiotech

Regulation of Gene Expression Chapter 18

Classification of Microarray Gene Expression Data

Gene Expression Data and Cluster Analysis

Regulation of Gene Expression

Gene Network Modeling

Software Engineering

Chapter 5: DNA, Gene Expression, and Biotechnology

Chapter 13 (Sections 13.1-13.3) Gene Expression

From DNA to Protein: Gene Expression