510 likes | 612 Views
Pre-SIG Genome Annotation Database Operations. Suzanna Lewis. FlyBase/Berkeley Drosophila Genome Project. Gene Ontology Consortium. Having it all. Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features
E N D
Pre-SIG Genome AnnotationDatabase Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium
Having it all • Complete: every occurrence is found • Precise: every occurrence is accurate • Comprehensive: all types of features • Richly described: biological functional data
Having it all • Complete: every occurrence is found • Precise: every occurrence is accurate • Comprehensive: all types of features • Richly described: biological functional data
Contradictions and Complications • Assembly errors • Missed gene merges • Missed gene splits • Complex adjustments • Dicistronic genes • Overlaps and intersections
Having it all • Complete: every occurrence is found • Precise: every occurrence is accurate • Comprehensive: all types of features • Richly described: biological functional data deleted: 41 new: 179 merges: 31 splits: 26 reinstated: 32
Broad Institute TIGR JGI Baylor College of Medicine Washington University FlyBase Ensembl GMOD contributors meeting March 2004
The Essentials • Visualization and manual editors
The Essentials • Visualization and manual editors • Combiners
The Essentials • Visualization and manual editors • Combiners • Full-length cDNA sequences
The Essentials • Visualization and manual editors • Combiners • Full-length cDNA sequences • High-quality assemblies
The Essentials • Visualization and manual editors • Combiners • Full-length cDNA sequences • High-quality assemblies • Annotation standards and verification
The Essentials • Visualization and manual editors • Combiners • Full-length cDNA sequences • High-quality assemblies • Annotation standards and verification • Evidence tracking and versioning
The Essentials • Visualization and manual editors • Combiners • Full-length cDNA sequences • High-quality assemblies • Annotation standards and verification • Evidence tracking and versioning • Open source software components and standards are critical to long term success
The Essentials • Visualization and manual editors • Combiners • Full-length cDNA sequences • High-quality assemblies • Annotation standards and verification • Evidence tracking and versioning • Open source software components and standards are critical to long term success
Annotation verification • Community input • On-line error reporting • Curation of the literature • Confirmation by comparison to cDNA sequences
SWISSPROT Comparison • Perfect match • 100% identity over 100% of lengths • Single AA substitutions • 99% identity over 100% of lengths • The above account for 75% of all genes with a SWISSPROT cognate (2,771 out of 3,687).
SWISSPROT Comparison • Significant mismatch • spans of 40 residues or 20% peptide length, with at least 97% sequence identity • No match • poor or empty matches • These remaining differences were due to lingering annotation errors or errors in the reported DNA sequence from SWISSPROT
Having it all • Complete: every occurrence is found • Precise: every occurrence is accurate • Comprehensive: all types of features • Richly described: biological functional data
Protein-coding 13,410 Annotation of all types of features tRNA 291 microRNA 23 snRNA 32 snoRNA 29 Pseudogenes 19 Non-coding RNA 36 Transposons 1,572 Promoters (thousands) TSS (thousands) P element Insertions (thousands) Total 15,412
Having it all • Complete: every occurrence is found • Precise: every occurrence is accurate • Comprehensive: all types of features • Richly described: biological functional data
How to find what you need By database ID? By function? By name? FlyBase SGD MGI Cappuccino BNI1 Formin 2 FBgn0000256 S0005215 MGI:1859252 Actin binding Actin binding Actin binding
But in 1998 there was a problem… • None of the organism databases used standard terminology to describe biological function.
For example Protein synthesis translation You want all gene products that are involved in bacterial protein synthesis, But the sequences are significantly different from those in humans. • It will be difficult for you—and even harder for a computer—to find functionally equivalent gene products.
How to best describe biology? • Natural language • Highly expressive • Ambiguous in meaning • Hard to compute on • Structured representation • Limited in expressivity • Precise • May be computed on We needed to find a middle ground, that supports and enables both.
The aims of GO • To develop comprehensive shared vocabularies. • Use the vocabularies to describe the gene products held in different databases. • To provide access to the vocabularies, the annotations, and associated data. • To provide software tools to assist biological researchers.
The early key decisions • The vocabulary itself requires a serious and ongoing effort. • Carefully define every concept • Initially keep things as simple as possible and only use a minimally sufficient data representation. • Focus initially on molecular aspects that are shared between many organisms.
A sequence is not equal to a gene • Physically a gene is composed of sequences. • DNA, RNA, and protein • Different strains, ESTs, cDNAs, alleles… • A fully characterized gene has multiple sequence references
GO is NOT a gene nomenclature system • Communities decide upon the ‘official’ gene name or symbol and their community databases maintain these data. • Sequence repositories (I.e. Genbank/EMBL/DDBJ/SwissProt) provide sequence identifiers and protein names • Proteins may be named differently than genes • e.g. HUGO and UniProt IDs
GO encompasses descriptions for all functional molecular entities • A gene product may be either a functional RNA or a protein • Protein • tRNA • miRNA • snRNA • rRNA • …
The breakdown of work • Task 1 • Building the ontology: a computable description of the biological world • Task 2 • Describing your gene—annotation • Protein structure • Phenotype • Expression data • Function, process, localization…
Vocabulary and relationships Collect what is known from the literature Look up concept to accurately express biology Your gene product Choose approved name and synonyms Refer to representative sequences Gene nomenclature decisions Sequence DB
GO databases: distributed and centralized • Support cross-database queries • By having a mutual understanding of the definition and meaning of any word used to describe a gene product • Provide database access to a common repository of annotations • By submitting a summary of gene products that have been annotated
If we build it… • What is a term? • Definition of term concepts • How to represent and manage the concepts • Biological scope • Annotation
What is a ‘term’? • Must have a stable ID • May have synonyms • Have relationships to other terms • Can be made obsolete • Can be split or merged • Must have a definition
Definitions • Purpose is to remove ambiguity of interpretation and alternate meanings • All definitions are supported by cross-references to the source(s)
Annotating gene products • Expert curation accepted from any group that can provide an ID and evidence • Each annotation must be supported by evidence, including a cross-reference • http://www.geneontology.org/GO.evidence.html • Gene can be annotated with multiple terms • Annotation is at finest possible granularity • Guidelines • http://www.geneontology.org/doc/GO.usage.html
GO functional analysis • Sequence similarity • Literature harvesting • Motif analysis • Expression studies • Interaction studies
Current work • Low coverage genome sequencing • Multiple species genome comparisons • SO—the Sequence Ontology
e.g. What is a pseudogene? • Human • Sequence similar to known protein but contains frameshift(s) and/or stop codons which disrupts the ORF. • Neisseria • A gene that is inactive - but may be activated by translocation (e.g. by gene conversion) to a new chromosome site. • - note some would call such a gene a “cassette” in yeast.
SO is useful if you want to: • Annotate sequence using consistent terminology for the same features across genomes. • Enable practical querying and comparisons between sequence databases. • Describe and propagate features at all levels of the sequence from genomic to mature protein.
Thank You to…www.flybase.org • Curators-Berkeley • Sima Misra, Josh Kaminker, Simon Prochnik, Chris Smith, Jon Tupy • Curators-Harvard • Lynn Crosby, Bev Matthews, Kathy Campbell, Pavel Hradecky, Yanmei Huang, Leyla Bayraktaroglu • Curators-Cambridge • Gillian Millburn, Rachel Drysdale, Chihiro Yamada • Curators-SWISSPROT • Eleanor Whitfield Software-Berkeley Chris Mungall, Ben Berman, Joe Carlson, Mark Gibson, Nomi Harris, George Hartzell, Brad Marshall, John Richter, ShengQiang Shu Software-Harvard David Emmert Software-Cambridge Aubrey de Grey Software-Ensembl Michelle Clamp, Vivek Iyer, Steve Searle
and Thanks to…www.geneontology.org David Hill Joel Richardson GO Curators Michael Ashburner Judith Blake J. Michael Cherry • Christopher Mungall • John Day-Richter • Brad Marshall • Karen Eilbeck • Mark Yandell • George Hartzell