600 likes | 727 Views
DESPRAD subproject. Alvis Brazma EMBL-EBI Hinxton, October 20, 2003. DESPRAD – Development and Establishment of Standards and Prototype Repository for Array Data. Participants. EBI UMC Utrecht University of Bergen RZPD Cambridge University EMBL Heidelberg
E N D
DESPRAD subproject Alvis Brazma EMBL-EBI Hinxton, October 20, 2003
DESPRAD – Development and Establishment of Standards and Prototype Repository for Array Data
Participants • EBI • UMC Utrecht • University of Bergen • RZPD • Cambridge University • EMBL Heidelberg • University of Marseille (CIML) • University of Madrid (CMB)
Three major sets of WPs: • Developing standards and an international infrastructure for microarray data sharing (WP1 – WP4) • Establishing a public repository for microarray data – ArrayExpress (WP4 – WP9) • Research in gene expression data analysis and gene networks (WP9 – WP12)
ArrayExpress goals • Serving as an archival repository for microarray data supporting publications • Providing easy access to microarray data in a structured and standardised format for research community • Facilitating the sharing of microarray designs and protocols
ArrayExpress approach • To collect the necessary information enabling the user to understand how to interpret the data • To try to represent the information in a structured way potentially allowing for automated analysis and mining • To work towards a community agreement to represent the microarry data in a standard way – founding of the MGED society
1. Standards • Founding the Microarray Gene Expression Data (MGED) society • Development of the standards • MIMAE • MAGE • MGED ontology
Array scans Quantitations Samples Spots Genes A B D C Sharing microarray data – which data?
Sample annotations problem 1 Gene expression levels – problem 2 Gene annotations Annotations Samples Gene expression matrix Genes
MGED Society • Microarray Gene Expression Data Society is an international organisation for facilitating the sharing of functional genomics and proteomics array data MGED 1, Hinxton, November 1999 MGED 2, Heidelberg, May 2000 MGED 3, Stanford University, April 2001 MGED 4, Boston, February 2002 MGED 5, Tokyo, September 2002 MGED 6, Aix-en-Provence, September 2003 MGED 7, Toronto, September 2004 Board of directors – EBI, Stanford, UCB, TIGR, Affymetrix, Rosetta,…
labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Microarray array array array Gene expression data matrix Protocol Protocol Protocol Protocol Protocol Protocol normalization integration Experiment genes Sample Sample Sample Sample Sample Array design RNA extract RNA extract RNA extract RNA extract RNA extract hybridisation labelled nucleic acid hybridisation array hybridisation hybridisation hybridisation
The first database model - developed in collaboration with DKFZ in 1999
MGED standards – MAGE-ML MAGE-ML
Affymetrix Agilent Biodiscovery (Imagene5.5) BASE (Open source project coordinated at Lund) Iobion (Gene Traffic) Manchester University (MAXDB) Molmine (J-Express) NCI NIEHS Rosetta Biosoftware (Rosetta Resolver) RZPD Sanger Institute LIMS (MIDAS) Silicon Genetics (GeneNet) Stanford University (SMD) TIGR (MADAM) UC at Berkeley University of Pennsylvania (RAD) UMC Utrecht The organisations and software supporting MAGE-ML include
Affymetrix Agilent Biodiscovery (Imagene5.5) BASE (Open source project coordinated at Lund) Iobion (Gene Traffic) Manchester University (MAXDB) Molmine (J-Express) NCI NIEHS Rosetta Biosoftware (Rosetta Resolver) RZPD Sanger Institute LIMS (MIDAS) Silicon Genetics (GeneNet) Stanford University (SMD) TIGR (MADAM) UC at Berkeley University of Pennsylvania (RAD) UMC Utrecht The organisations and software supporting MAGE-ML include
~3000 1172 ~250 Data in ArrayExpress Hybs 3000 2000 1000 ~100 6 2004 2003 2002 April September September February November
ArrayExpress content (experiments) +1 drosophyla experiment By experiment
SUBSELECT Expression Profiler(component interface) 1 CLUSTER 2
ArrayExpress web-page hits • 2002 – 49 245 • 2003 – 274 983 (by 12 September)
ArrayExpress components Submissions Queries, Analysis Large-scale microarray facilities ArrayExpress Export to local analysis tools MAGE-ML MAGE-ML MIAMExpress - online submission tool Expression Profiler - online analysis tool Internet Smaller labs www
MIAMExpress • Online since December 1, 2002 • 2002 – 15 951 hits • 2003 – 112 871 hits by 12 September • So far ~20 submissions completed through MIAMExpress, i.e., about 25% of all experiments in ArrayExpress • MIAMExpress is open source software - installed in at least 15 labs (EMBL, RZPD, Leipzig, Leuven, Vancouver, VIB) • Tox-MIAMExpress – a specialised version for Toxicology
ArrayExpress infrastructure Submissions Access ArrayExpress www MIAMExpress (MySQL) Desktop Data Analysis software MIAMExpress Local installations (Cambridge,…) MAGE-ML Repository (Oracle) www MAGE-ML retrieval Local databases (RZPD,Stanford) Queries Query interface (Tomcat) Local databases LIMS (EMBL,TIGR) MAGE-ML pipelines Expression Profiler www Array Manufacturers (Affymetrix,Agilent)
More complex queries (genes, expression levels, etc) Simple queries (species, author, lab, array types, etc) Repository (MAGE-OM model) Warehouse (simple gene-centric model) Ensmart submissions curation curation Links back to the evidence Hyperlinks to other databases Database integration ArrayExpress development
Sample annotations Gene expression levels Gene annotations Gene expression data matrix Samples Genes
Summarised information about which gene is expressed where More complex queries (genes, expression levels, etc) Simple queries (species, author, lab, array types, etc) Repository (MAGE-OM model) Warehouse (simple gene-centric model) Gene Expression Atlas Ensmart submissions curation curation curation Links back to the evidence Hyperlinks to other databases Database integration Database integration ArrayExpress development
New in ArrayExpress • Password protected logins • Can be used to support anonymous refereeing of microarray papers • Discussions with Nature
Data growth in ArrayExpress Hybs 4000 ? 3000 2000 1000 2004 2003 2002
Distributed data collection Small lab Small lab Small lab Small lab Small lab Small lab Small lab Small lab National microarray centre National microarray centre National microarray centre EMBL ArrayExpress Stanford Sanger TIGR
Data analysis tools • Expression profiler – complete redevelopment of the earlier tool – new interface, new functionality, XML based modularity – beta version will be ready on months 24 • J-express – (developed in Bergen), talk by Inge Jonassen
Research • Microarray based gene network analysis – 2 publications out, 1 in print, 1 submitted • S. Pombe gene expression data analysis (in collaboration with the Sanger Institute) – publication in preparation • New algorithms for clustering and cluster comparison – 2 publications in preparation
Transcription factor binding network • Chromatin IP experiments on a chip (ChiP on chip) • Using microarrays for finding genomic (intragenic) sequences (of length of few hundred bp) where a particular transcription factor is likely to bind • ChIP by Lee et al. (Science 2002) – binding site location data in yeast genome for 107 transcription factors (from about 250 yeast transcription factors in total) • Identified around 4500 binding locations
DA DC DB C A gene A gene B gene C B D gene D Gene disruption network
Data for over 200 gene disruptions in Yeast Hughes et al, Cell, 102 (2000)
Three networks in yeast • ChIP network (Lee et al) • Mutation network (Hughes et al) • In silico network – matching 38 experimentally known transcription factor binding sites (Pilpel et al) against yeast genome sequence
Intersection of the networks Red – 39 arcs present in all networks Green – arcs present in at least 2 networks and adjacent to one of SWI4, SWI6 or MBP1
All genes t Transcription factors h Disrupted genes How Chip-chip and disruption networks relate? All genes Regulation set of t Effectual set of h