290 likes | 417 Views
The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective. Ugis Sarkans European Bioinformatics Institute. Outline. Microarray data and standards overview ArrayExpress overall principles ArrayExpress architecture AE repository AE data warehouse
E N D
The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute
Outline • Microarray data and standards overview • ArrayExpress overall principles • ArrayExpress architecture • AE repository • AE data warehouse • Future plans and conclusions
Sample annotations problem 1 Gene expression levels – problem 2 Gene annotations Gene expression data and annotation Samples Gene expression matrix Genes
Platform comparison (Tan et al, PNAS, 2003) ‘Our conclusion was very straightforward: there was very little overlap in the types of data in terms of differential expression’ (Margareth Cam, NIH)
labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Microarray array array array Gene expression data matrix Protocol Protocol Protocol Protocol Protocol Protocol normalization integration Experiment genes Sample Sample Sample Sample Sample Array design RNA extract RNA extract RNA extract RNA extract RNA extract hybridisation labelled nucleic acid hybridisation array hybridisation hybridisation hybridisation
Array scans Quantitations Samples Spots Genes A B D C Different processing levels of MA data
MGED standards • MIAME – minimum information about a microarray experiment • MAGE-OM and MAGE-ML – microarray gene expression object model and mark-up language • MO – microarray ontology • Data normalisation and transformations (and quality control)
UML Packages of MAGE results what was done what was used HigherLevelAnalysis Experiment BioMaterial BioAssayData BioAssay Array QuantitationType ArrayDesign miscellaneous AuditAndSecurity Measurement DesignElement Protocol Description BQS BioSequence BioEvent
ArrayExpress aims • An archive for microarray data supporting scientific publications • Providing easy access to public gene expression and other to microarray data in a structured format • Facilitating the sharing of microarray designs and protocols • Facilitating the establishment of infrastructure for microarray data sharing
AE users • Experimentalists • “Single-gene” biologists • Bioinformaticians; genome-wide studies • Bioinformaticians – algorithm developers • Software developers
EBI Submissions Submissions ArrayExpres Array Manufacturers (Affymetrix, Agilent) www MIAMExpress MAGE-ML External MIAMExpress installations (Camb. U., EMBL) MAGE-ML Submission tracking/ curation tool Other Microarray Databases (SMD, TIGR, Utrecht, RZPD) MAGE-ML ArrayExpress repository Queries, analysis MAGE-ML Analysis Warehouse (Biomart) www Data Analysis Software (R/Bioconductor, J-Express, Resolver) Expression Profiler External Databases (EMBL, UniProt, Ensemble) Data analysis ArrayExpress infrastructure
AE: overall principles • Adherence to community standards • Data captured in a granular, formalized manner • Modern but proven software technologies • Incremental development
AE design considerations • Separate data archiving from the query-optimized data warehouse • Generate default implementation, then refine • ~2 full-time developers • pressure to bring system online quickly • Use object abstraction layer • deal with performance overhead on case-by-case basis
Repository architecture overview MAGE-ML (doc) MAGE-ML (doc) MAGE-ML DTD MAGE-ML document Tomcat Web page template Web page template error.log Velocity Curationenvironment MAGE validator Java servlets MAGE-OM MAGE loader object/ relational mapping Castor MAGE unloader Oracle DB
AE schema • Why auto-generated? • AE must be able to import any valid MAGE-ML and not lose information • good for navigating through data in terms of object model • if some queries don’t work well, add something to the schema • Experiment-Biomaterial, Experiment-Protocol links • so far works for 400Gb of data
To ontologize ornot to ontologize At the beginning: At the end:
To ontologize ornot to ontologize At the beginning: At the end:
Model vs. ontology • Model – stable; ontologies – flexible • Adding/modifying/deleting attributes – easy; adding/modifying/deleting associations – hard • Therefore: attributes and their types in ontologies, domain structure (classes + associations) in the model
>15 000 000 000 data points • Experiment1 • type • performer • …. • Hybridization data 1 • Experimental factors • Quantitation type definitions • … NetCDF
What BioMart gives to AEDW • Query language abstraction • Joins automatically generated • Schema optimized for performance • Clear database integration roadmap
Future plans • Data management environment automation • Flexible data warehouse interface • Programmatic interface (HTTP/XML based) • Distributed infrastructure??
Distributed data infrastructure Users query ArrayExpress deliverdata Query broker find resource A local database A local database A local database
Conclusions • Conceptual object modeling works well for complex life sciences domains • Many software infrastructure components can be auto-generated from object models • A range of approaches can be used for modeling, e.g., UML framework + ontologies • Repository and data warehouse – different aims and different implementation principles
Acknowledgements • MGED collaborators • Stanford, TIGR, Affymetrix, EMBL, …. • BioMart team • Gonzalo Garcia Lara - web interface • Ahmet Oezcimen - DBA • Anjan Sharma - curation tool • Sergio Contrino, Richard Coulson – data warehouse • Niran Abeygunawardena – webmaster • Mohammadreza Shojatalab – MIAMExpress • Misha Kapushesky – Expression Profiler • Curation team: • Helen Parkinson, Ele Holloway, Gaurab Mukherjee, Anna Farne, Tim Rayner • Domain-specific projects: • Susanna Sansone, Philippe Rocca-Serra • Alvis Brazma