1 / 38

The philosophy of biocuration and its use to analyse the fission yeast genome content

The philosophy of biocuration and its use to analyse the fission yeast genome content. Valerie Wood. What is Biocuration. Two main aspects to fission yeast curation

Download Presentation

The philosophy of biocuration and its use to analyse the fission yeast genome content

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The philosophy of biocurationand its use to analyse the fission yeast genome content Valerie Wood

  2. What is Biocuration Two main aspects to fission yeast curation 1. Literature curation: involves reading the full text of publications and associating novel biological information with the appropriate genes or features 2. Sequence analysis: to infer biological information for unpublished genes

  3. The Challenges • We need to make annotations as specific (complete depth), and as comprehensively (complete breadth) as possible. We need to group similar annotations consistently so users can • Access required information on a gene by gene basis • Analyse their own datasets e.g enrichment • Search for candidate genes of interest • Access similar features in other organisms

  4. traditionally small number of genes • requires detailed literature searching • time-consuming Gene 1 RNA recognition motif mRNA export protein phosphorylation nuclear mitotic cell cycle phosphorylated .... Gene 2 SAP domain mRNA export nucleolar RNA elongation (pol II) … Gene 3 mRNA export transcription (pol II) … Gene 4 mRNA export transcription polyadenylation … Gene 5 mRNA export RNA elongation … Gene 6 mRNA export rRNA transcription DNA topological change … Gene 5000 cell cycle chromosome segregation kinetochore assembly protein localization … Not Scalable! Data gathering for genes of interest

  5. mRNA export Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 nucleolar Gene 10 Gene 15 Gene 18 … phosphorylated Gene 1 Gene 7 Gene 10 … transcription Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 .. Cell cycle Gene 1 Gene 7 Gene 8 … RNA recognition motif Gene 1 Gene 7 Gene 8 … Grouping by “feature” By establishing links between similar features we can begin to identify tends (enrichments and depletions) in thousands of genes typically obtained in functional genomics datasets

  6. The literature corpus What is the size of the ‘annotation problem’? Fission yeast OR pombe gives 9264 Adding “cell cycle” gives 2871 Solutions More curators Community curation Problems Funders don’t want to fund curation Can we make the community curate

  7. Grant • Additional curators (2) to ensure comprehensive and deep curation of the literature • Software to support curation activities (including community curation) • A computational infrastructure to integrate nd display the curated data with the HTP data within Ensembl

  8. Need to make an intuitive web based user interface where the community can add “consistent” and comprehensive curation Watch this space! http://www.sanger.ac.uk/Projects/S_pombe/

  9. Ontologies • Ontologies provides a “controlled vocabulary” for biological knowledge • Consistent unambiguous descriptions • Species independent, interpreted identically both within and between genomes, therefore enabling cross species comparisons • Provides a way to capture and represent biological knowledge in a computable form • Ability to annotate to different levels of granularity depending what is know or what can be inferred • Ontologies Include: • A vocabulary of terms (names for concepts) • Definitions • Defined logical relationships to each other

  10. bud initiation? tooth bud initiation, cell bud initiation, plant bud initiation Disambiguation and Grouping Conversely different names are used for the same concepts MVB sorting, multivesicular body sorting, late endosome to vacuole transport, alternative names are exact synonyms This principle applies to any type of curation, for example when describing phenotypes, similar cells can be described as “skittle” “bottle” or “dumbell”

  11. Demonstrating ontology principles with GO GO is 3 ontologies F molecular function (activity, GTPase, transporter, receptor) P biological process (cell division transcription,gluconeogenesis C cellular component (location or complex)

  12. DAG: Directed Acyclic Graph Heirarchy Many-to-many parental relationship One-to-many parental relationship Each child may have one or more parents Each child has only one parent DAG Structure

  13. cell membrane chloroplast mitochondrial chloroplast membrane membrane is-a part-of Relationships between terms

  14. gene A Inheritance An important feature of GO is that broader parents give rise to more specific children.When a gene is directly annotated to a term (I.e DNA replication), it is automatically indirectly annotated to all of its parent terms Allows curators to assign terms at different levels of granularity, depending what is known or can be inferred

  15. Ontologies..... • Provides a standard for annotation • Have 2 components the ontology and the annotations • Allows experimental work to be evaluated in the context of other experimental data which may be annotated at different levels of granularity • Allows biologists to search and analyse data (particularly for identifying groups of overrepresented genes in large scale experiments) • Becomes increasingly powerful as the ontologies and annotations are refined

  16. Other annotation types • products (special case, unique descriptors) • annotation status • species distribution • orthology • phenotype data, will use (PATO) • protein modifications, will use(MOD) • metabolites will use (Chebi, chemical entities of biological importance) • sequence features will use (SO) • protein-protein interactions will use (MI) and BioGrid Increasingly, features will be described using “cross products” derived from multiple ontologies: e.g.“response to a specific drug” will be made with the GO biological process term “response to drug” and a drug from the ChEBI e.g. phenotypes are typically annotated using a PATO “quality” term combined with a wild-type GO process (e.g. conjugation, defective; crossover formation, abolished)

  17. The curation process and annotation status

  18. GO Curation Strategy Manual Curation • Emphasis on Primary Literature • Manual inspection of sequence similarity Computational Mappings • Inferred electronically No data for FP or C 2542 Total 34032 1829 publications 17655 annotations 9708 annotations 4127 annotations

  19. Evidence Codes Used Oct 07 Dec 08 June 09 8618 88899076 IDA inferred from direct assay 776 991 1083 IPI inferred from physical interaction 901 11291164 IGI inferred from genetic interaction 1089 10911106 TAS traceable author statement 1073 1164 1264 IC inferred by curator 9045 9706 9708 ISS inferred from sequence similarity 1912 23282455 IMP inferred from mutant phenotype 522 595 617 NAS non-traceable author statement 6397 46204127 IEA from electronic annotation 2542 ND no data, root node annotations 185 IEP 702 RCA 30333 31676 34032

  20. GO annotation progress MolecularFunction: 9049 Biological Process: 10985 Cellular Component: 13998 Total 34032 30,616 annotations to 3080 terms06/06/07 31,676annotations to 3263 terms 13/12/08 34,035 annotations to 3361 terms 16/06/09

  21. Analysing the curated data

  22. GO aspect coverage Total 5025 All 3 aspects unknown 118

  23. Protein Annotation Status 56 312 36.7 % 639 43.0 % 1817 12.9 % 6.3 % 2133 1.1 % Total 4957

  24. The conserved “unknown” unknowns 98 Bacteria,Fungi,Plant 196 Fungi only 639 346 to Metazoa of these 235 1:1 of these 131 nuclear over 100 nature papers?

  25. This is the 53 at the top of the list Splicing?

  26. Kim D-U, Hayles J, Kim D et al (manuscript submitted)

  27. “Slimming” • High level view of GO (genes annotated to granular terms are mapped to higher level terms) • Allows users to group genes into broader categories to assess their distribution, useful for large scale, genome wide analyses or smaller gene sets • Different Annotation groups have created specific GO_Slims are available at GO’s FTP site (pombe now has an “official GO slim” which give good coverage of high level processes). • You can create and use your own GO slim with high level terms of interest • CARE: not a gene product count, as gene products have multiple annotations (will explain this in the workshop)

  28. Process Super Slim Added 8454 i.e. more than the number of genes. Not mutually exclusive, therefore it doesn’t make sense to put in a pie chart and show as percentages Also important to show which genes are not annotated (root node annotations) Which genes are not in the slim set but are annotated to other terms

  29. Term Enrichment • Finding significantly enriched terms shared among a list of genes • Discover what these genes may have in common • Statistical measure of how likely your differentially regulated genes fall into that category by chance

  30. This is a comparative enrichment analysis (fission yeast vs. budding yeast) It is showing processes enriched in the essential gene set in the non-essential gene set. The enrichment also identified many child terms which were enriched but the results were presented as a “slim” of the high level terms, and the complete tem lists are presented in supplementary data Kim D-U, Hayles J, Kim D et al (manuscript submitted)

  31. Acknowledgements • Martin Aslett (WT Sanger UK) • Midori Harris and the GO editorial team (EBI UK) • Jacky Hayles (CRUK) and the deletion project consortium (Kwang Lae-Hoe)

  32. UPDATE Data mining, complex What: You can data mine the entire genome to find overlaps and intersections between terms of interest to target genes for further study

  33. Additional points • A gene product can have several functions, cellular locations and be involved in many processes • Annotation of a gene product to one ontology is independent from its annotation to other ontologies • Annotations are only to terms reflecting a normal activity or location • Usage of ‘unknown’ GO terms

  34. Modifying the interpretation of an annotation: the Qualifier column • 1. NOT • a gene product is NOT associated with the GO term • to document conflicting claims in the literature. • 2. Contributes to • distinguishes between individual subunit functions and whole complex functions • used with GO Function Ontology • 3. Colocalizes with • transiently or peripherally associated with an organelle or complex • used with GO Component Ontology

  35. Fatty acid biosynthesis (Swiss-Prot Keyword) EC:6.4.1.2 (EC number) IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry) GO:Fatty acid biosynthesis (GO:0006633) GO:acetyl-CoA carboxylaseactivity (GO:0003989) GO:acetyl-CoA carboxylase activity (GO:0003989) Electronic Annotations

  36. Unknown v.s. Unannotated • Direct root node annotations are used when the curator has determined that there is no existing literature to support an annotation. • Biological process GO:0000004 • Molecular function GO:0005554 • Cellular component GO:0008372 • NOT the same as having no annotation at all • No annotation means that no one has looked yet

  37. Function 3542 (includes protein binding) 993 Biological Process 4019 Cellular Component 4821 GO aspect coverage (old) 18 191 54 3279 (3455) 679 672 14 Total 5004 (5780 S. cerevisiae) All three aspects unknown 105 (564 S. cerevisiae)

More Related