1 / 70

MIAME/Env: Enhancing Microarray Data Annotation for Environmental Researchers

This article highlights the importance of adhering to MIAME standards in microarray data annotation and introduces MIAME/Env, an initiative to extend the standards for environmental genomic data. Researchers are encouraged to use the software maxdLoad2 for MIAME/Env annotation and submit their data to public repositories like ArrayExpress to get accession numbers for publication. Compliance with MIAME standards facilitates data sharing and ensures the integrity and reproducibility of experiments.

anauman
Download Presentation

MIAME/Env: Enhancing Microarray Data Annotation for Environmental Researchers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Annotation and Analysis of Microarray DataA primer for NERC researchers

  2. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Data and the NERC • Data is an asset • Data may have unforeseen uses • Analysis loses information • Bulk analysis and data mining needs “uniform” data • Data stored without adequate annotation is useless • Data rescue is expensive and unreliable

  3. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Metadata and Microarrays • Sequence data is static • Post-genome is very state-dependant • Transcriptome = no. of cell types * no. of environmental conditions • Annotation matters • Data comparisons matter • We need to take lessons from the gene debacle • Protein-tyrosine phosphatase, non-receptor type 6, Protein-tyrosine phosphatase 1C, PTP-1C, Hematopoietic cell protein-tyrosine phosphatase, SH-PTP1, Protein-tyrosine phosphatase SHP-1 • LARD, death receptor 3 beta, WSL-1R protein, lymphocyte associated receptor of death, death receptor 3

  4. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Metadata standards and data repositories • Repository needs to keep all relevant metadata associated with a data set • To be easily submitted, and to be searchable, data must adhere to standards, both in content and format • Thus, have to decide: • What should be captured and how? • What format should data be in for submission?

  5. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk What is MIAME? • MIAME is the internationally adopted standard for the Minimal Information About a Microarray Experiment. • The result of a MGED (www.mged.org) driven effort to codify the description of a microarray experiment. • MIAME aims to define the core that is common to most experiments. • Ultimately, it tries to specify the collection of information that would be needed to allow somebody to completely reproduce an experiment that was performed elsewhere.

  6. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk The Six Parts of MIAME • Experimental design:the set of hybridization experimentsas a whole • Array design:each array used and each element (spot,feature) on the array • Samples:samples used, extract preparation and labeling • Hybridizations:procedures and parameters • Measurements:images, quantification and specifications • Normalization controls: types, values and specifications

  7. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk MIAME definitions • Available from www.mged.org • All details mentioned in MIAME should be captured • Latest draft: Version 1.1 (Draft 5, March 5, 2002) • See also: A. Brazma, et al., Nature Genetics, vol 29 (December 2001), pp 365 - 371

  8. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk But… • Environmental genomics is a diverse, heterogeneous discipline, often involving multi-factorial experiments that can have an almost infinite number of experimental parameters. • Describing this sort of data is hard. • MIAME does not have the required vocabulary. • However, NERC has made a commitment to making MIAME compliance a de-facto standard within its Science Programmes. • NERC has invested in reconciling these…

  9. Environmental Genomics Thematic Programme • Data Centre • http://envgen.nox.ac.uk MIAME/Env • MIAME/Envis an initiative spearheaded by the EGTDC to extend MIAME standards for annotation of environmental genomic data • Includes the development of controlled vocabularies / ontologies to describe environmental genomic experiments. • MIAME/Env developed with the support of MGED society and in collaboration with MIAME/Tox and members of the EBI.

  10. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Microarray Annotation for Environmental Researchers • use the Standard • MIAME/Env model is developed in communication with EG funded researchers to ensure that environmental genomics experiments and data can be adequately described to MIAME standards • use the Software • maxdLoad2 is software developed by EGTDC partners facilitating • MIAME/Env annotation • Export in an appropriate format for submission to ArrayExpress

  11. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Do I have to? Simple Answer: YES!!! More specifically: • You need to adhere to metadata standards to submit to a public repository • You need to submit to a public repository (e.g. ArrayExpress) to get an accession number for your data • You need to have an accession number for your data in order to publish on it in major journals The final word: • NERC requires grant holders to comply with MIAME standards for microarray data

  12. End users/Researchers Facilitates data sharing Catalogued / Backed-up Pervasive advertisement for your work Bioinformaticians/Developers Access to data for analysis and algorithm development Improves search capabilities Encourages development of more capable software for annotation, analysis and submission Benefits of using a data repository

  13. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Bio-LinuxThe EGTDC distribution system for bioinformatics solutions • Key bioinformatics software and documentation in a Linux environment • Aim: to maximise the benefits of a pre-installed analysis system. • provision of key software • tools for automation of analysis and other customisations • computing power • ensure that what is provided can be reasonably maintained and supported

  14. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Software on Bio-Linux Includes programs for: • Sequence analysis • Similarity searching • Sequence alignment • Phylogenetics • Genome annotation and analysis • Est’s • Transcriptomics

  15. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Bio-Linux Transcriptomics Databases maxdLoad2 GeNet access Transcriptomics Analysis maxdView GeneSpring R/BioConductor MIAME/Env annotation and MAGE/ML export maxdLoad2

  16. Raw Data R/BioConductor Proprietary software (e.g. Affymetrix) Expression measures (not normalised) Quality Control Normalisation Analysis Presentation Other analysis programs GeneSpring R/BioConductor maxDView ArrayExpress GeNet maxDLoad2 MIAME/Env Annotation

  17. Raw Data R/BioConductor Proprietary software (e.g. Affymetrix) Expression measures (not normalised) Quality Control Normalisation Analysis Presentation Other analysis programs GeneSpring R/BioConductor maxDView Bio-Linux ArrayExpress GeNet maxDLoad2 MIAME/Env Annotation

  18. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Transcriptomics Databases Tools on Bio-Linux maxdLoad2 GeNet access

  19. maxdLoad2 Navigator Top level user interface

  20. GeNet Via GeneSpring Via Web Interface

  21. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk GeNet and maxdLoad2 Both are databases designed to handle transcriptomic data Differences: GeNet • Centralised repository • Geared towards use as an analysis and sharing tool as well as a storage area • Partial MIAME compliance is possible, but not the default • Great for sharing data and analyses maxdLoad2 • Local repository • More like a LIMS system for transcriptomic data • Geared towards MIAME compliant annotation, storage and export to public database

  22. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Transcriptomic Analysis Tools on Bio-Linux maxdView GeneSpring R/BioConductor

  23. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Which software should I use?? Commercial vs. Open Source GeneSpring maxdView R/BioConductor Ease of Use GeneSpring > maxdView > R/BioConductor Fine tuned control R/BioConductor > maxdView > GeneSpring

  24. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Why use just one?? E.g. Fine Tuned Control R/BioConductor Ease of Use +GeneSpring Pre-analysis Choices  R/BioConductor Easy but fine tuned manipulation  +maxdView Alternatively: maxdView + GeneSpring All of them…

  25. GeneSpring Benefits: • Graphical interface • Choices of views • Venn diagram visualisations • Intuitive interface for filtering • Extensive documentation • Context dependent help

  26. maxdView Benefits: • Graphical interface • Quality control options • Many analyses possible via menus or “calculator” • Strong filtering capabilities • Context dependent help

  27. R/BioConductor Command line package Benefits: • flexible • many, many functions to choose from • take advantage of the full functionality of the R stats package • high degree of control • great plotting facilities • promotes thinking about data • lots of documentation and help available • automation possibilities • some graphical facilities available

  28. Documentation and Tutorials

  29. Load Data Quality Control Apply Filters Normalise Analyse Overview of Microarray Analysis Steps Text, GPR file, etc… Step 1 Step 2 Step 3 Step 4 Step 5

  30. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Raw Data translation into Expression measures (not normalised) • The raw microarray data scanned from images needs to be translated into some measurement of expression. • The measurement used depends on the technology – e.g. relative measures (cDNA chips), or absolute measures (e.g. GeneChip). • The measurements calculated depend on the algorithm used (e.g. MAS 5.0 vs. RMA for GeneChips). • Background correction happens at this point

  31. Import

  32. Export

  33. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Quality Control Very Important! Generating high quality microarray data requires vigorous quality control measures at each individual step of the process: • experimental design of the study • the generation of samples • extraction of RNA • labeling of the probe • microarray hybridization • analysis Systematic, reproducible errors can be minimized by applying various normalisations…BUT: You should not try to rescue low quality hybridizations with mathematical techniques!

  34. Quality Control Do the arrays look alright? Look at the actual image scans – are there quality issues to be addressed on any of the chips?

  35. Quality Control Does the data have the distribution you expect? The common array analysis functions assume that most genes will not change in expression level and that your data is lognormal.

  36. Quality Control Figure and text from: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/Method_qc2.html

  37. Quality Control

  38. Quality Control Does the data have the distribution you expect? This plot is the result of running the Benford Analyser on data (pre-normalisation) in maxdView.

  39. Quality Control Fit your data and take a look at the reconstructed image surface using R/BioConductor: >library(affyPLM) >pset  fitPLM(myData) >image(pset)

  40. Quality Control Check out the density curves of the PM data using R/BioConductor >hist(myData, col=pops2, type=“l”)

  41. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Normalisation • General advice: • Apply normalisations that make sense for your data • Use plotting facilities to view your data before and after normalisation to check

  42. Normalisation GeneSpring maxdView

  43. Normalisation R/BioConductor >eset  myData, bgcorrect=“rma”, normalize.method = “quantile”, pmcorrect.method=“pmonly”, summary.method = “medianpolish”) >boxplot(eset, col = pos2 +1) >pops2  pData(myData)[,2] >boxplot(myData, col = pos2 +1) Pre-normalisation Post-normalisation

  44. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Filters • A Filter is a rule applied to each Spot • Spots which do not pass through the filter are ignored in downstream steps • Filters are useful for reducing the complexity of analyses or visualisations by discarding uninteresting Spots. They can also be used to locate Spots which match particular criteria.

  45. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk GeneSpring Filter on Error

  46. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk maxdView MultiFilter

  47. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk R/BioConductor >library(genefilter) Have to define your filter and then apply it. Filters can be saved and used again.

  48. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk Statistics and clustering Most statistical tests have underlying assumptions – know what these are and whether they are valid for your data! GeneSpring, maxdView and R/BioConductor all provide facilities to run various statistical analyses and clustering algorithms. R provides the most extensive choice.

  49. GeneSpring

  50. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk maxdView TTest

More Related