1 / 46

MIAME and Ontologies for Sample Description

MIAME and Ontologies for Sample Description. Helen Parkinson Microarray Informatics Team European Bioinformatics Institute EMBO Course, October 2001. Talk Structure. Standards Ontologies for gene expression data

euclid
Download Presentation

MIAME and Ontologies for Sample Description

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIAME and Ontologies for Sample Description Helen Parkinson Microarray Informatics Team European Bioinformatics Institute EMBO Course, October 2001

  2. Talk Structure • Standards • Ontologies for gene expression data • ArrayExpress - a public database for microarray data and integration of ontologies • Submission and annotation tool • Practical this afternoon

  3. Problems of microarray dataanalysis • Size of the datasets • Different platforms - nylon, glass • Different technologies on platforms- oligo/spotted • Referencing external databases which are not stable • Sample annotation • Array annotation • Need for LIMS systems and the need for bioinformaticians

  4. Problems with standards • Standards are not sexy, sad but true • Must be useful, consensual, easy to understand and implement • Need an incentive (journals) • “Most scientists are anarchists, including me” Alvis Brazma

  5. General MIAME principles • Recorded info should be sufficient to interpret and replicate the experiment • Information should be structured so that querying and automated data analysis and mining are feasible

  6. Definition of an ontology • An ontology is an explicit specification of some topic. • It is a formal and declarative representation which includes the vocabulary (or names) for referring to the terms in that subject area and the logical statements that describe what the terms are, how they are related to each other, and how they can or cannot be related to each other. • Ontologies provide a vocabulary for representing and communicating knowledge about some topic and a set of relationships that hold among the terms in that vocabulary. MGED

  7. Ontologies vs. controlled vocabularies • A cv is a set of restrictive terms used to describe something, in the simplest case it could be a list • An ontology describes the relationships between the terms in a structured way

  8. Why we need standards- a real example • Free text entries in databases cause problems • A good ontology -NCBI taxonomy, ~ 10 people maintain it and it still has problems • Example - many genes and proteins can have the same name even in well annotated organisms • Many important projects have no coordination of standards • Whose responsibility is this?

  9. SRS, EMBL gene=“ssp1” S.pombe • AL441624 S.pombe chromosome I cosmid c110. • AL136235 S.pombe chromosome I cosmid c664. • AL159180 S.pombe chromosome I P1 p14E8. • X59987 S.pombe SSP1 gene for mitochondrial Hsp70 protein (Ssp1) • AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76. • AL049609 S.pombe chromosome III cosmid c297.

  10. Genbank search “S.pombe, ssp1” • 1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76 • 2: AL441624 S.pombe chromosome I cosmid c110 • 3: AL159180 S.pombe chromosome I P1 p14E8 • 4: AL049609 S.pombe chromosome III cosmid c297 • 5: AL136235 S.pombe chromosome I cosmid c664 • 6: D45882 Yeast ssp1 gene for protein kinase, complete cds • 7: X59987 S.pombe SSP1 gene for mitochondrial Hsp70 protein (Ssp1)

  11. Gene synonyms • Problem, a name can identify many genes even in a well annotated organism like S.pombe Ssp1=SPAC664.11 SPAC110.04c SPCC297.03

  12. Possible Solutions • Build your own gene synonym table and maintain it • Or find some other stable identifier than gene name (CDS number?) • Have a nomenclature body which assigns name and resolves disputes (HUGO) • Use GO terms and their ID’s (already done for S.pombe)

  13. Sample annotations Gene annotations A gene expression database from the data analyst’s point of view Samples Gene expression matrix Genes Gene expression levels

  14. Gene Annotation • Can be given by links to gene sequence databases and GO can be used on the analysis side (function,process,cell compartment) • MIAME is flexible, allows many kinds of sequence identifiers or even sequence itself. • In some cases it’s more useful to include a real sequence than an inaccurate id • In the end we will need a mapping from a gene list to all the spots on all arrays, this is non trivial given the problems with names

  15. Sample annotation • Gene expression data only have meaning in the context of detailed sample descriptions • If the data is going to be interpreted by independent parties, sample information has to be searchable and in the database • Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample description

  16. Standardisation of microarray data and annotations -MGED group The goal of the group is to facilitate the adoption of standards for DNA-array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalisation methods. Includes most of the worlds largest microarray laboratories and companies (TIGR,Affymetrix Stanford,Sanger,Agilent etc) www.mged.org

  17. Example of a class, subclass relationship Class def African elephant sub-class of elephant slot constraint comes from slot has filler Africa Just a way to say that African elephants are a type of elephant that come from Africa Ian Herrick's, creator of Oiled

  18. Why do we need an ontologyfor the database • To perform structured queries • To accurately compare data • To avoid problems with free text searching • To avoid excessive curation workload in future

  19. Sample annotation- what can be done? • Build an ontology for gene expression data (MGED) • Use existing ontologies and link them in • Incorporate the ontology into the database • Develop internal editing tools for the ontology • Develop browser or other interface for the ontology and link to LIMS • Some use of free text descriptions are unavoidable (curation workload)

  20. MAGE-OM/MGED semantics • Biomaterial: The [source of the] nucleic acid used to generate labelled material for the microarray experiment. • Biosource: The primary source of the nucleic acid used to generate labelled material for the microarray experiment. • Biosample: The biosource after any treatment.

  21. Use case scenarios • Return a summary of all experiments that use a specified type of biosource (primary source). • Group the experiments according to treatment. • Return a summary of all experiments done examining effects of a specified treatment • Group the experiments according to biosource. • Return a summary of all experiments measuring the expression of a specified gene. • Indicate when experiments confirm results, provide new information, or conflict.

  22. Experiment Source (e.g., Taxonomy) Gene (e.g., EMBL) Sample Hybridisation Array Data Normalisation MIAME – Minimum Information About a Microarray Experiment External links Publication 6 parts of a microarray experiment www.mged.org

  23. MGED Biomaterial (sample) Ontology • Under construction by Chris Stoeckert • Using OILed (though other tools exist) • Motivated by MIAME and coordinated with the database model • We will extend classes, provide constraints, define terms, provide new terms and develop cv’s for submissions (EBI)

  24. MIAME Descriptions • Experimental design: the set of hybridisation experiments as a whole • Array design: each array used and each element (spot) on the array • Samples: samples used, extract preparation and labelling • Hybridisations: procedures and parameters • Measurements: images, quantitation, specifications • Normalisation controls: types, values, specifications

  25. Part of the MGED biomaterial ontology class Age documentation: The time period elapsed since an identifiable point in the life cycle of an organism. If a developmental stage is specified, the identifiable point would be the beginning of that stage. Otherwise the identifiable point must be specified such as planting. type: primitive superclasses: BiosourceProperty constraints: slot-constraint has_measurement has-value Measurementslot-constraint initial_time_point has-value one-of (planting beginning_of_stage) used in slots: initial_time_point

  26. MIAME Section on Sample Source and Treatment • organism (NCBI taxonomy) • cell source - provider • cell type (if derived from primary sources (s)) • sex • age • growth conditions • development stage • organism part (tissue) • animal/plant strain or line • genetic variation (e.g., gene knockout, transgenic variation) • individual • individual genetic characteristics (e.g., disease alleles, polymorphisms) • disease state or normal • target cell type • cell line and source (if applicable) • in vivo treatments (organism or individual treatments) • in vitro treatments (cell culture conditions) • treatment type (e.g., small molecule, heat shock, cold shock, food deprivation) • compound • is additional clinical information available (link) • separation technique (e.g., none, trimming, microdissection, FACS) • laboratory protocol for sample treatment……

  27. Examples of usable external ontologies • NCBI taxonomy database • Jackson Lab mouse strains and genes • Edinburgh mouse atlas anatomy • HUGO nomenclature for Human genes • Chemical and compound Ontologies - Merck index • TAIR • Flybase • GO

  28. Excerpts from a Sample Descriptioncourtesy of M. Hoffman, S. Schmidtke, Lion BioSciences • Organism: Mus musculus [ NCBI taxonomy browser ] • Cell source: in-house bred mice (contact: person@somewhere.ac.uk) • Sex: female [ MGED ] • Age: 3 - 4 weeks after birth [MGED] • Growth conditions: normal • controlled environment • 20 - 22 oC average temperature • housed in cages according to EU legislation • specified pathogen free conditions (SPF) • 14 hours light cycle • 10 hours dark cycle • [Developmental stage]: stage 28 (juvenile (young) mice)) [ GXD "Mouse Anatomical Dictionary" ] • Organism part: thymus [ GXD "Mouse Anatomical Dictionary" ] • Strain or line: C57BL/6 [International Committee on Standardized Genetic Nomenclature for Mice] • Genetic Variation: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to 1937. This substrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the H9, Igh2 and Lv loci. Maint. by J,N, Ola. [International Committee on Standardized Genetic Nomenclature for Mice ] • Treatment: in vivo [MGED][intraperitoneal] injection of [Dexamethasone] into mice, 10 microgram per 25 g bodyweight of the mouse • Compound: drug [MGED] synthetic [glucocorticoid] [dexamethasone], dissolved in PBS

  29. External Ontologies MGED/ ArrayExpress Ontology Production Curation Tool and Browser Public Browser MAGE-ML Data checking vs. ArrayExpress ontology Submission tool LIMS LIMS

  30. Introduction to the database • ArrayExpress is implemented in Oracle • The submission tool is a different implementation of the ArrayExpress model in Mysql • Faster, easier to update • Short term solution to the problem of data submission

  31. Experiment Source (e.g., Taxonomy) Gene (e.g., EMBL) Sample Hybridisation Array Data ArrayExpress conceptual model Publication External links Normalisation

  32. Submission tool • Includes all MIAME concepts • Uses as much CV as possible • Future versions organism specific pages and related linked ontologies • Allow user driven ontology development • Will be developed according to user needs • Will also need to be an update tool

  33. Submitter LIMS User Login Large Scale Submissions MAGE-ML format Browse Arrays Browse Protocols Array Submission Curation Database Protocol Sub. Experiment submission Submission tool ArrayExpress Database MAGE-OM Model Browse Arrays Browse Protocols Query Interface for Public Data External Applications External Databases, EMBL, Ontology Resources… etc Data File Export Analysis Tools Expression Profiler

  34. Qualifier,V,S Qualifier,V, S Submission Tool Front End Overview Login/contact info Samples à change Array Hybs à change protocol Protocols à change Submit Authors à change Data files à change Array Submission Pending array Protocol sub submissions Experiment Extract Pending exp. submission submissions Browse Label existing Experiment Authors arrays from details Hyb ArrayExpress Protocol top page Scan Sample Browse Public details, n/n /User protocols samples Other Add new protocol Laboratory Protocols Hybs Overview prot./sample Samples à change hybs Hybs à change Protocols à change Authors à change Data upload Data files à change

  35. Login/contact info Array Submission Protocol sub Experiment submission Extract Pending exp. submissions Label Experiment details Hyb Protocol top page Scan Sample details, n/n samples Other Add new protocol Qualifier,V, S Qualifier,V,S Laboratory Protocols Hybs prot./sample hybs Submission tool at present

  36. Expected Users • Users with limited local bioinformatics support • Users of bought in arrays without LIMS • Small scale users with self made arrays. • Array Submissions are expected from manufacturers (MAGE-ML format)

  37. Design Considerations • Speed and ease of use • Requires ability to browse existing protocols and array designs in ArrayExpress • Requirement for curator control over submissions • Submissions tracking and unique id’s • Need for a prototype quickly • Future use as a LIMS • Flexibility

  38. Features of submission tool • Creates a user login account instead of on-the-fly submissions so sessions can be saved • Allows existing protocols to be copied and saved and linked to more than one hyb/expt • Forms the basis of a LIMS using the ArrayExpress model • Will be available as a stand alone tool for local installation • Is open source and free • Will be supported

  39. State of the Art • MIAME publication in press Nature Genetics • Curation database in place • Help linked to test version • Public version by Spring 2002 • Prototype submission tool under user test (practical this pm)

  40. Work still to do • Appearance • Help, fully contextual • Overview and data upload • Protocol storage • Update considerations • Ontology integration • Curation tools • Tracking • Copy across functions • Recruit curation staff • Lie awake and worry….. etc etc etc

  41. Fun Stuff • We need a name for the submission tool • ExpressIn has been suggested • If you suggest the one that we use there will be a (very) small prize

  42. How can you get involved? • Send me your protocols • MGED 4 meeting in February 2001, Boston • Join MGED and come to the meeting • Provide some feedback after the practical session - tell us what you want www.mged.org

  43. Acknowledgments • Whole Microarray Informatics Team, EBI, esp. Alvis Brazma, Mohammad Shojatalab and Ugis Sarkans • Industry Support team, EBI • MGED steering committee • MIAME working group • Chris Stoeckert, U. Penn. and members of MGED

More Related