270 likes | 416 Views
Plant Ontologies – Industrial Science meets Renaissance Concepts. Dave Selinger Computational Biologist Pioneer Hi-Bred, DuPont Agriculture and Nutrition. Outline. What is the nature of the problem that a Plant Anatomy Ontology can solve? What is an Ontology?
E N D
Plant Ontologies – Industrial Science meets Renaissance Concepts Dave Selinger Computational Biologist Pioneer Hi-Bred, DuPont Agriculture and Nutrition
Outline • What is the nature of the problem that a Plant Anatomy Ontology can solve? • What is an Ontology? • How do you make a Plant Anatomy Ontology? • Does it really solve the problem?
Industrial Science • Not science in industry, but the industrialization of data creation, i.e. the ‘omics revolutions. • High-throughput data • Sequencing • Expression • Medium-throughput data • Proteomics • Metabolomics • Low-throughput data • Gene/protein function • Phenotype
The double-edged sword of Industrial Science • Industrial science means lots of cheap data • Sequencing << $0.01/base • $10,000 prokaryotic genomes are reality • $10,000 eukaryotic genomes will be reality in the next five years • Expression <$0.50/gene • And much of this data is available for free after it is produced! • Lots of data means that you can’t sit down with your lab notebook and analyze the data by hand. • Databases, software for searching and comparing • Whole new areas of research devoted to finding meaningful patterns in lots of data.
Organizing information • Information is not knowledge. • But knowledge can be acquired from information. • But only with a lot of effort, see third law of thermodynamics • Central challenge with Industrial science is organizing the information. • The organization of the information determines what you can discover. • Experimental design • Good design will produce a contrast that will support or refute a hypothesis. • Statistical rigor – • Is the signal higher than the noise? • How conclusive will the discoveries be?
Context • How do we compare across experiments? • Not too hard if one person did all the experiments and kept careful notes. • If multiple people, then we need to define what was done, what the analysis was, and what the sample was. • What was done – e.g. MIAME standard for describing the technical details of an expression experiment. • Analysis – e.g. ANOVA, SAM, etc. • Sample – ?
Renaissance concepts (historically Enlightenment) • Things can be systematically described and classified • Organisms - Linneaus, Species Plantarum, 1758 • Linneaus’ problem is much the same as the sample description problem • Variable specificity • California Laurel or Oregon Myrtlewood? • Kernel or seed? • In addition, a term like kernel assumes all parts, but this assumption could be wrong
Ontologies to the rescue? • Ontology = the study of being (Philosophy) • The specification of a conceptualization of a domain of interest (Computer Science) • Original and continuing computer science interest was Artificial Intelligence. • How can a computer make inferences? • Need to define meanings – can for example. • Structure and relationships in an ontology allow a computer to make inferences. • Mary is the mother of Bill. Is Mary a parent of Bill? • IsA Mother Parent • Parts of an ontology • Concepts -> objects, real and abstract, processes, functions • Partitions -> rules that can classify concepts • Attributes -> properties of a concept, can have individual and class attributes • Relationships -> is a, part of
Does an ontology make sense? • The value of ontologies is a current debate among information scientists. • One group advocates that ontologies are necessary for computers to understand content. • Semantic web -> an extension of the current HTML/XML based web to something with ontological inference • Others argue that ontologies are not needed and are not practical • Complexity is ok and just use a Google like search to connect concepts. • However, some problems, like organismal classification and the periodic table are very amenable to an ontological approach. • Formal categories and stable entities • Expert users and catalogers
Forms of ontologies • Ontologies can take several forms (data structures) • Controlled vocabulary (List) • Terms but no relationships • Enforces systematic naming • Hierarchy (tree structure) => Taxonomy • Terms and “is a” relationship • Children are unique and have a single parent • Directed acyclic graph => Gene Ontology • Multiple relationship types • Children with multiple parents
Features of Trees • Because each child node has only one parent • There is an unambiguous path to the root from each leaf • Child nodes can be easily grouped at any level of the structure • Trees can express only one organizing principle • Work well for taxonomy (at least eukaryotic taxonomy) • Organizing principle is classification by similarity • All terms have an “is a” relationship to the next level term • Organisms were classified before evolution was hypothesized, but the classification matches the evolutionary relationships • Similar example would be the periodic table of the elements • Classification can facilitate discovery of underlying principles
A tree based Anatomy Ontology • Developed by Winston Hide’s group at SANBI and Electric Genetics • Single concept, orthogonal trees • Cells • Tissues • Organs • Disease state • Each tree is independent, but has related dimensions describing a sample • Set operations, intersection or union, between trees allows specific queries.
Features of DAGs • A tree is a special case of the DAG class • Children can have multiple parents. • Allows multiple classifications of the same child • E.g. a guard cell is both part of a leaf and is an epidermal cell. • Allows for more than a binary classification of a concept • If this results from poor definition of the concept, then it is not good. • Multiple parentage fits a “normalized” data model • Like a normalized relational database, a DAG can minimize duplication of objects (concepts).
Sample DAG • Root • Cooking • Spices • Bay leaf • Laurel nobilis • Umbellulariacalifornica (California laurel) • Trees • Lauraceae • Laurel • Laurel nobilis • Umbellularia • Umbellulariacalifornica
Constructing the Pioneer Plant Ontology • Decided to produce a DAG • Used DAGeditor (editor developed for GO) • Developed our own web based viewing tool • AmiGO was too complicated to re-use. Other public browsers did not have the functionality we wanted. • Decided to focus on Corn and Soybeans • Used Kiesselbach’s 1949 Monograph on Corn structure and reproduction as the primary source. • Used Iowa State University Ag Extension publications for the development stages of corn and soybeans • Added information from a botany textbook to cover missing terms from soybean.
To collaborate or not to collaborate? • Advantage of just using the Pioneer Ontology was that it served our needs and was focused on corn and soybeans, our major crops. • Disadvantage was that it was not synchronized to the public • We would not be able to easily integrate public tissue classifications to ours • We would not be able to easily take advantage of improvements to the public ontology • Presumably the public ontology would be more “botanically correct” than ours.
Plant Ontology Consortium • Focused on model organisms • Arabidopsis • Rice and other grasses with the rice terms (corn). • Used a DAG approach • Multiple concepts • Structure (cells, tissues, sporophyte and gametophyte) • Development • Used DAGeditor and other GO approaches • Most terms have multiple parents • Same software and data structures as GO
Plant Ontology • Domain = Plant anatomy and development • Concepts • Plant parts (leaf, root, flower, meristem, etc.) • Life cycle stages (sporophyte, gametophyte) • Developmental stages (V1, flowering, R1, etc.) • Relationships between concepts • “A kind of” (Is a) • A prop root is a root • “A part of” (part of) • A root cap is part of a root • In addition, for plant anatomy a “develops from” relation is needed • For example the relationship between stomatal guard cells and the guard mother cell • Guard cells develop from guard mother cells
Adapting the POC ontology for Pioneer’s needs • Problem is that it has many more terms than required for our experiments • Some terms describe tissues or cells that are not practical to collect (e.g. antipodal cells) • Some terms describe parts not found in corn (e.g. nectary) • Another problem is that we collect samples that are convenient subdivisions of structures • Tip and base of an immature ear. Each differs from a whole immature ear in terms of what it contains. • Basal endosperm – morphologically distinct from starchy endosperm, but not found in the ontology
Our current solution • Add additional terms to the POC ontology • Use a different id system • easily distinguished from POC terms • will not be overwritten by on-going public curation efforts. • Label experiments with the terms from the ontology. • Create a Custom ontology • Query the whole ontology with the terms used in the labeling and keep only • terms that are used to label an experimental sample • Parent terms of used terms. • Can be readily rebuilt if new experiments or terms are added.
What can you do with the ontology? • Provides a grouping mechanism • Summarize expression for a tissue • Compare expression between tissues • Make complex queries that involve multiple tissues • Provides a systematic label for annotating genes • Where is the gene expressed? • Query annotation of genes based on terms • Provides a description of the complexity of tissue samples • Leaf sample is composed of multiple cell types with different roles • Cell types can be shared between tissues or structures
Comparing by tissue • The ontology provides the groupings, but how to summarize • Mean? • Median? • Maximum value? • Significance of differences? • Each group will be much more variable than a set of samples from a controlled experiment. • But you may be able to eliminate the inevitable false discoveries that appear when looking at large numbers of genes.
Annotating genes • This is the primary use for TAIR and Gramene • Potentially label most genes with tissues of expression • However, need to differentiate presence with preferential expression. • A gene may be present in many tissues, but highly expressed in a few • Another gene may be present in the same tissues, but similarly expressed in all of them. • Might need to precompute and indicate which tissues the gene is significantly preferentially expressed in. • Might be able to use the RMS differences between expression in each tissue as a measure of consistency.
Complexity • Genes may appear to differ between tissues for trivial reasons • Example: Gene appears to be preferentially expressed in stem versus leaf tissue. • If gene is really specific to vascular tissue and stem has more… • Gene is expressed late in development, adjacent leaves and stems may differ in development. • Ontology can guide further experiments • Compare vascular and non-vascular tissue from both leaf and stem. • Compare multiple leaf and stem samples from different positions (developmental stages).
Conclusions • The Plant Ontology classifies experiments and genes based on anatomical and developmental concepts. • Now that we have significant data, can we, like Darwin, discern the underlying mechanisms for how anatomical and developmental differences occur. • The Plant Ontology will be successful and used long term if it facilitates these kinds of investigations.
Acknowledgements • Pioneer • Henry Mirsky • Lane Arthur • Bob Merrill • POC • Doreen Ware (Gramene) • Katica Ilic (TAIR)