380 likes | 485 Views
The philosophy of biocuration and its use to analyse the fission yeast genome content. Valerie Wood. What is Biocuration. Two main aspects to fission yeast curation
E N D
The philosophy of biocurationand its use to analyse the fission yeast genome content Valerie Wood
What is Biocuration Two main aspects to fission yeast curation 1. Literature curation: involves reading the full text of publications and associating novel biological information with the appropriate genes or features 2. Sequence analysis: to infer biological information for unpublished genes
The Challenges • We need to make annotations as specific (complete depth), and as comprehensively (complete breadth) as possible. We need to group similar annotations consistently so users can • Access required information on a gene by gene basis • Analyse their own datasets e.g enrichment • Search for candidate genes of interest • Access similar features in other organisms
traditionally small number of genes • requires detailed literature searching • time-consuming Gene 1 RNA recognition motif mRNA export protein phosphorylation nuclear mitotic cell cycle phosphorylated .... Gene 2 SAP domain mRNA export nucleolar RNA elongation (pol II) … Gene 3 mRNA export transcription (pol II) … Gene 4 mRNA export transcription polyadenylation … Gene 5 mRNA export RNA elongation … Gene 6 mRNA export rRNA transcription DNA topological change … Gene 5000 cell cycle chromosome segregation kinetochore assembly protein localization … Not Scalable! Data gathering for genes of interest
mRNA export Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 nucleolar Gene 10 Gene 15 Gene 18 … phosphorylated Gene 1 Gene 7 Gene 10 … transcription Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 .. Cell cycle Gene 1 Gene 7 Gene 8 … RNA recognition motif Gene 1 Gene 7 Gene 8 … Grouping by “feature” By establishing links between similar features we can begin to identify tends (enrichments and depletions) in thousands of genes typically obtained in functional genomics datasets
The literature corpus What is the size of the ‘annotation problem’? Fission yeast OR pombe gives 9264 Adding “cell cycle” gives 2871 Solutions More curators Community curation Problems Funders don’t want to fund curation Can we make the community curate
Grant • Additional curators (2) to ensure comprehensive and deep curation of the literature • Software to support curation activities (including community curation) • A computational infrastructure to integrate nd display the curated data with the HTP data within Ensembl
Need to make an intuitive web based user interface where the community can add “consistent” and comprehensive curation Watch this space! http://www.sanger.ac.uk/Projects/S_pombe/
Ontologies • Ontologies provides a “controlled vocabulary” for biological knowledge • Consistent unambiguous descriptions • Species independent, interpreted identically both within and between genomes, therefore enabling cross species comparisons • Provides a way to capture and represent biological knowledge in a computable form • Ability to annotate to different levels of granularity depending what is know or what can be inferred • Ontologies Include: • A vocabulary of terms (names for concepts) • Definitions • Defined logical relationships to each other
bud initiation? tooth bud initiation, cell bud initiation, plant bud initiation Disambiguation and Grouping Conversely different names are used for the same concepts MVB sorting, multivesicular body sorting, late endosome to vacuole transport, alternative names are exact synonyms This principle applies to any type of curation, for example when describing phenotypes, similar cells can be described as “skittle” “bottle” or “dumbell”
Demonstrating ontology principles with GO GO is 3 ontologies F molecular function (activity, GTPase, transporter, receptor) P biological process (cell division transcription,gluconeogenesis C cellular component (location or complex)
DAG: Directed Acyclic Graph Heirarchy Many-to-many parental relationship One-to-many parental relationship Each child may have one or more parents Each child has only one parent DAG Structure
cell membrane chloroplast mitochondrial chloroplast membrane membrane is-a part-of Relationships between terms
gene A Inheritance An important feature of GO is that broader parents give rise to more specific children.When a gene is directly annotated to a term (I.e DNA replication), it is automatically indirectly annotated to all of its parent terms Allows curators to assign terms at different levels of granularity, depending what is known or can be inferred
Ontologies..... • Provides a standard for annotation • Have 2 components the ontology and the annotations • Allows experimental work to be evaluated in the context of other experimental data which may be annotated at different levels of granularity • Allows biologists to search and analyse data (particularly for identifying groups of overrepresented genes in large scale experiments) • Becomes increasingly powerful as the ontologies and annotations are refined
Other annotation types • products (special case, unique descriptors) • annotation status • species distribution • orthology • phenotype data, will use (PATO) • protein modifications, will use(MOD) • metabolites will use (Chebi, chemical entities of biological importance) • sequence features will use (SO) • protein-protein interactions will use (MI) and BioGrid Increasingly, features will be described using “cross products” derived from multiple ontologies: e.g.“response to a specific drug” will be made with the GO biological process term “response to drug” and a drug from the ChEBI e.g. phenotypes are typically annotated using a PATO “quality” term combined with a wild-type GO process (e.g. conjugation, defective; crossover formation, abolished)
GO Curation Strategy Manual Curation • Emphasis on Primary Literature • Manual inspection of sequence similarity Computational Mappings • Inferred electronically No data for FP or C 2542 Total 34032 1829 publications 17655 annotations 9708 annotations 4127 annotations
Evidence Codes Used Oct 07 Dec 08 June 09 8618 88899076 IDA inferred from direct assay 776 991 1083 IPI inferred from physical interaction 901 11291164 IGI inferred from genetic interaction 1089 10911106 TAS traceable author statement 1073 1164 1264 IC inferred by curator 9045 9706 9708 ISS inferred from sequence similarity 1912 23282455 IMP inferred from mutant phenotype 522 595 617 NAS non-traceable author statement 6397 46204127 IEA from electronic annotation 2542 ND no data, root node annotations 185 IEP 702 RCA 30333 31676 34032
GO annotation progress MolecularFunction: 9049 Biological Process: 10985 Cellular Component: 13998 Total 34032 30,616 annotations to 3080 terms06/06/07 31,676annotations to 3263 terms 13/12/08 34,035 annotations to 3361 terms 16/06/09
GO aspect coverage Total 5025 All 3 aspects unknown 118
Protein Annotation Status 56 312 36.7 % 639 43.0 % 1817 12.9 % 6.3 % 2133 1.1 % Total 4957
The conserved “unknown” unknowns 98 Bacteria,Fungi,Plant 196 Fungi only 639 346 to Metazoa of these 235 1:1 of these 131 nuclear over 100 nature papers?
This is the 53 at the top of the list Splicing?
“Slimming” • High level view of GO (genes annotated to granular terms are mapped to higher level terms) • Allows users to group genes into broader categories to assess their distribution, useful for large scale, genome wide analyses or smaller gene sets • Different Annotation groups have created specific GO_Slims are available at GO’s FTP site (pombe now has an “official GO slim” which give good coverage of high level processes). • You can create and use your own GO slim with high level terms of interest • CARE: not a gene product count, as gene products have multiple annotations (will explain this in the workshop)
Process Super Slim Added 8454 i.e. more than the number of genes. Not mutually exclusive, therefore it doesn’t make sense to put in a pie chart and show as percentages Also important to show which genes are not annotated (root node annotations) Which genes are not in the slim set but are annotated to other terms
Term Enrichment • Finding significantly enriched terms shared among a list of genes • Discover what these genes may have in common • Statistical measure of how likely your differentially regulated genes fall into that category by chance
This is a comparative enrichment analysis (fission yeast vs. budding yeast) It is showing processes enriched in the essential gene set in the non-essential gene set. The enrichment also identified many child terms which were enriched but the results were presented as a “slim” of the high level terms, and the complete tem lists are presented in supplementary data Kim D-U, Hayles J, Kim D et al (manuscript submitted)
Acknowledgements • Martin Aslett (WT Sanger UK) • Midori Harris and the GO editorial team (EBI UK) • Jacky Hayles (CRUK) and the deletion project consortium (Kwang Lae-Hoe)
UPDATE Data mining, complex What: You can data mine the entire genome to find overlaps and intersections between terms of interest to target genes for further study
Additional points • A gene product can have several functions, cellular locations and be involved in many processes • Annotation of a gene product to one ontology is independent from its annotation to other ontologies • Annotations are only to terms reflecting a normal activity or location • Usage of ‘unknown’ GO terms
Modifying the interpretation of an annotation: the Qualifier column • 1. NOT • a gene product is NOT associated with the GO term • to document conflicting claims in the literature. • 2. Contributes to • distinguishes between individual subunit functions and whole complex functions • used with GO Function Ontology • 3. Colocalizes with • transiently or peripherally associated with an organelle or complex • used with GO Component Ontology
Fatty acid biosynthesis (Swiss-Prot Keyword) EC:6.4.1.2 (EC number) IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry) GO:Fatty acid biosynthesis (GO:0006633) GO:acetyl-CoA carboxylaseactivity (GO:0003989) GO:acetyl-CoA carboxylase activity (GO:0003989) Electronic Annotations
Unknown v.s. Unannotated • Direct root node annotations are used when the curator has determined that there is no existing literature to support an annotation. • Biological process GO:0000004 • Molecular function GO:0005554 • Cellular component GO:0008372 • NOT the same as having no annotation at all • No annotation means that no one has looked yet
Function 3542 (includes protein binding) 993 Biological Process 4019 Cellular Component 4821 GO aspect coverage (old) 18 191 54 3279 (3455) 679 672 14 Total 5004 (5780 S. cerevisiae) All three aspects unknown 105 (564 S. cerevisiae)