1 / 35

MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it

MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it. Helen Parkinson Microarray Informatics Team European Bioinformatics Institute Hinxton. Three parts of my talk. Microarray data standards Ontologies for gene expression data

kendis
Download Presentation

MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIAME and ArrayExpress- astandard for microarray data annotation and a database to store it Helen Parkinson Microarray Informatics Team European Bioinformatics Institute Hinxton

  2. Three parts of my talk • Microarray data standards • Ontologies for gene expression data • ArrayExpress - a public database for microarray data • Analysis tools at the EBI

  3. The size of the datasets • Experiments: • ~100 000 different transcripts in human • ~320 cell types • 2000 compounds • 3 time points • 2 concentrations • 2 replicates • Data • 8 x 1011 data-points • 1 x 1015 = 1 Peta Byte for Affymetrix (data from Jerry Lanfear)

  4. Microarray data • Microarrays are widely used in experiments and already producing massive amounts of data • These data have to be stored in a well organised and standard way, if they are to be accessed and analysed by the wide research community • There is a general consensus that there is a need for a public repository for microarray data • It is much less clear what exactly should be stored in such a repository

  5. Sample annotations Gene annotations A gene expression database from the data analyst’s point of view Samples Gene expression matrix Genes Gene expression levels

  6. Three parts of a gene expression database • Gene annotation – can be given by links to gene sequence databases and GO (function,process,cell compartment) – not perfect but lets not worry about it • Sample annotation – we do not have any external databases for sample description (except species taxonomy) – problem 1 • Gene expression matrix – what are the measurement units for gene expression levels? – problem 2

  7. Problem/consideration 1 – sample annotation • Gene expression data only have meaning in the context of detailed sample descriptions • If the data is going to be interpreted by independent parties, sample information has to be searchable and in the database • Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample description

  8. Sample annotation- what can be done? • Few cv’s and ontologies for sample description are available (species taxonomy, model organisms) • Some use of free text descriptions are unavoidable (curation workload) • Existing efforts of creating such ontologies should be coordinated (MGED ontology working group) • Use existing ontologies and cv’s wherever possible

  9. Problem 2 – the lack of gene expression measurement units • What we would like to have • gene expression levels expressed in some standard units (e.g. molecules per cell) • reliability measure associated with each value (e.g. standard deviation) • What have we got • each experiment using different units • no reliability information

  10. cm inc Comparing expression data

  11. ? ? Comparing expression data

  12. Comparing expression data

  13. What to do in the absence of standard measurement units? • Record raw, intermediate and final analysis data together with the detailed annotation of how the analysis has been performed • This effectively passes on the responsibility about interpreting the final analysis data to the user

  14. Quantitation matrices Gene expression data Raw data Array scans Quantitations Samples Spots Genes Spot quantitations Gene exp. levels Three levels of microarray data processing

  15. Measurement units • In perspective: • standard controls for experiments (on chips and in the samples) should be introduced • replicate measurements will become a norm • Temporary solution: • storing intermediate analysis results (including the images) and annotations of how they were obtained • Standards within experiments themselves (standard controls and protocols)

  16. Standards for microarray data • Standards are needed to build a well organised microarray database • Standards for annotation • Standards for data exchange • Standards for controls in the experiment and data normalisation • www.dnachip.org/mged/normalization.html

  17. How to create microarray data standards • To understand thoroughly what is the minimum information about a microarray experiment that is needed to interpret it unambiguously and what is the structure of this information (objects and relationships) • To create the technical data format able to capture this information • Finding appropriate controlled vocabularies

  18. Standardisation of microarray data and annotations -MGED group The goal of the group is to facilitate the adoption of standards for DNA-array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalisation methods. Includes most of the worlds largest microarray laboratories and companies (TIGR,Affymetrix Stanford,Sanger,Agilent etc) www.mged.org

  19. MGED • MGED 2 meeting in Heidelberg in 2000, MGED 3 in Stanford in 2001, both ~ 300 participants • Minimum Information About a Microarray Experiment – MIAME version 1.0 posted • Collaboration with OMG on data formats MAML+GEML = MAGE-ML and MAGE-OM • MGED 4 meeting in February 2001, Boston • MGED will become an ISCB Special Interest Group

  20. Experiment Source (e.g., Taxonomy) Gene (e.g., EMBL) Sample Hybridisation Array Data Normalisation MIAME – Minimum Information About a Microarray Experiment External links Publication 6 parts of a microarray experiment www.mged.org

  21. MIAME Section on Sample Source and Treatment • sample source and treatment ID as used in section 1 • organism (NCBI taxonomy) • additional "qualifier, value, source" list; the list includes: • cell source - provider • type (if derived from primary sources (s)) • sex • age • growth conditions • development stage • organism part (tissue) • animal/plant strain or line • genetic variation (e.g., gene knockout, transgenic variation) • individual • individual genetic characteristics (e.g., disease alleles, polymorphisms) • disease state or normal • target cell type • cell line and source (if applicable) • in vivo treatments (organism or individual treatments) • in vitro treatments (cell culture conditions) • treatment type (e.g., small molecule, heat shock, cold shock, food deprivation) • compound • is additional clinical information available (link) • separation technique (e.g., none, trimming, microdissection, FACS) • laboratory protocol for sample treatment……

  22. What is an ontology? • An ontology is a specification of concepts that includes the relationships between those concepts. • Provides semantics and constraints • Allows for computational inferences and reliable comparisons

  23. MGED Biomaterial Ontology • Under construction by Chris Stoeckert • Using OILed (may use others) • Motivated by MIAME and coordinated with the database model • Extend classes, provide constraints, define terms, provide terms to use,develop cv’s for submissions (EBI)

  24. Use case scenario

  25. Ontology Example • Concept=Age def=in standard units referenced to an identifiable time point from (class) developmental stage • Age=6 {units=days}, • {dev_stage}=dauer • Hierarchy=Dev_stage->larva->dauer

  26. Excerpts from a Sample Descriptioncourtesy of M. Hoffman, S. Schmidtke, Lion BioSciences • Organism: mus musculus [ NCBI taxonomy browser ] • Cell source: in-house bred mice (contact: person@somewhere.ac.uk) • Sex: female [ MGED ] • Age: 3 - 4 weeks after birth [ MGED ] • Growth conditions: normal • controlled environment • 20 - 22 oC average temperature • housed in cages according to EU legislation • specified pathogen free conditions (SPF) • 14 hours light cycle • 10 hours dark cycle • Developmental stage: stage 28 (juvenile (young) mice)) [ GXD "Mouse Anatomical Dictionary" ] • Organism part: thymus [ GXD "Mouse Anatomical Dictionary" ] • Strain or line: C57BL/6 [International Committee on Standardized Genetic Nomenclature for Mice] • Genetic Variation: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to 1937. This substrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the H9, Igh2 and Lv loci. Maint. by J,N, Ola. [International Committee on Standardized Genetic Nomenclature for Mice ] • Treatment: in vivo [MGED] intraperitoneal injection of Dexamethasone into mice, 10 microgram per 25 g bodyweight of the mouse • Compound: drug [MGED] synthetic glucocorticoid Dexamethasone, dissolved in PBS

  27. Experiment Source (e.g., Taxonomy) Gene (e.g., EMBL) Sample Hybridisation Array Data ArrayExpress conceptual model Publication External links Normalisation

  28. ArrayExpress object model

  29. ArrayExpress – the state of the art • ArrayExpress Object model supporting MIAME requirements developed • Data model implemented in Oracle • Data loader from MAML file format • Expression Profiler – data analysis tool already available

  30. ArrayExpress – plans and schedule • EU grant – new staff being recruited • A web based query interface - under development • A web based submission tool – under test • Participation in OMG – MAGE-OM & MAGE-ML • MAGE-ML will replace MAML in October • Full scale database operation expected to start at the beginning of 2002 • Expression Profiler to link to ArrayExpress

  31. Microarray data analysis • Expression Profiler – a web based gene expression data analysis tool: www.ebi.ac.uk/microarray/

  32. Expression Profiler - web based tool for microarray data analysis http://www.ebi.ac.uk/microarray/ External data, tools pathways, function, etc. Expression data EPCLUST (cluster Expression profiles) GENOMES sequence, function, annotation URLMAP: provide links SPEXS (Sequence Pattern Exhaustive Search) novel patterns PATMATCHknownpatterns

  33. Conclusions • Microarray standardisation is a challenge and an imperative • Join MGED to contribute to this process www.mged.org • Participate in the development of ontologies and controlled vocabularies • Send me your protocols • Make your data available • Feedback on MIAME, it’s up for discussion

  34. Acknowledgments • Microarray Informatics Team, EBI Alvis Brazma, Katja Kivinen, Helen Parkinson, Olga Perez, Johan Rung, Ugis Sarkans,Thomas Schlitt, Mohammad Shojatalab, Lev Soinov, Koichi Tazaki, Jaak Vilo • Industry Support team, EBI Alan Robinson • MGED steering committee • MIAME working group • Chris Stoeckert, U. Penn. and MGED

  35. Useful URL’s • www.mged.org • www.tigr.org • www.ebi.ac.uk/array • www.geneontology.org • www.hgmp.mrc.ac.uk • www.dnachip.org/mged/normalization.html • parkinson@ebi.ac.uk

More Related