380 likes | 394 Views
Prokaryotic Annotation at TIGR. Michelle Gwinn Giglio June, 2005. Prokaryotic Annotation at TIGR. we work in a high-throughput environment our team of 7 annotators finish 1500-2000 genes per month there is a constant backlog of genomes waiting for manual annotation
E N D
Prokaryotic Annotation at TIGR Michelle Gwinn Giglio June, 2005
Prokaryotic Annotation at TIGR • we work in a high-throughput environment • our team of 7 annotators finish 1500-2000 genes per month • there is a constant backlog of genomes waiting for manual annotation • most of our genomes have little, or no, experimentally characterized proteins • we rely heavily on sequence similarity methods to determine the functions of proteins in our genomes • nearly all of our projects receive complete manual annotation prior to publication (only a very few have been released with automatic annotation)
GO Annotation at TIGR • our manual annotation process is the same whether we add GO terms to our proteins or not • using GO to categorize our proteins allows us to capture information that we have discovered in the manual annotation process that would otherwise be lost • GO offers a system for the unambiguous communication of annotation information in a format amenable to computer searching and easy exchange.
Some History • TIGR always recognized the importance of grouping genes according to the functions and processes in which they were involved. • with the first prokaryotic genome published, H. influenzae, we adapted Monica Riley’s E. coli role categories • We have continued to modify that role scheme and still assign TIGR roles today • In 1998 we recognized that it would be really useful to have a set of role categories that could be used by all species and we started a project in that direction • Also in 1998, I met Michael Ashburner and Suzi Lewis and learned of their efforts with GO, we decided to stop our project and wait to use GO • During 2000-2001, TIGR’s genome V. cholerae was annotated to GO • In 2002 TIGR joined the GO consortium • Currently, TIGR has 11 prokaryotic genomes deposited with the GO repository (and many more with manual GO annotation, waiting internally at TIGR for publication.)
Adding GO Annotation to our system…. • required us all to learn the GO system, its rules, data formats, etc. • required significant changes to our tools and databases for the visualization and storage of GO data • took time, however, there are vastly more resources available today then there were 5 years ago when we were making the shift, when GO was still quite young
The Goal of the Annotation Process • determine the function of the protein if possible • assign annotation to the protein: common name, gene symbol, EC number, TIGR role, GO terms, comments as needed • store evidence for the annotation (something we always did) • annotation should only be as specific as evidence supports, err on the side of undercalling rather than overcalling
How do we determine the functions of the proteins? • The best thing is to do an experiment on the protein - not really possible for us to do • shared sequence implies shared function • we are well aware of cases where one amino acid change results in change of function • all of our functional assignments must be considered putative until experimentally confirmed • collect and evaluate information from many sequence based search and prediction tools • BER (BLAST-extend-repraze) • HMM (Hidden Markov Model) • TMHMM (Transmembrane HMM) • SignalP (Signal peptide) • PROSITE • InterPro • Paralogous families • Genome Properties • The ISS annotations in TIGR data are sequence evaluations performed by us, not from authors in the literature • We use our annotation tool Manatee to view information and make annotations, you will see screen shots in following slides
BER searches • TIGR’s pairwise alignment tool • initial BLAST to collect proteins with any similarity to the search protein • modified Smith-Waterman alignment generated between search protein and each BLAST result • result is a file containing one pairwise alignment for each match protein from the BLAST • view alignments in our Manatee annotation tool • we do the 2-step process because BLAST is fast and Smith-Waterman is slow, so it saves CPU time to only do the Smith-Waterman alignments on things that have any hope of matching
BER in pictures genome.pep niaa (non-identical amino acid) BLAST mini-db for protein #1 mini-db for protein #2 mini-db for protein #3000 mini-db for protein #3 ... , , Significant hits put into mini-dbs for each protein modified Smith- Waterman Alignment File of pairwise alignments
Are all matches with equal alignment quality of equal value to annotation? • NO! • we want to see matches of our genome proteins to proteins from other species which have been experimentally characterized in that other species • only such “characterized matches” can be used as evidence for functional annotation • to help in our annotation process we have created a database storing accessions of proteins known to be experimentally characterized (does not contain all such proteins, but we add to it constantly) • our tools highlight experimentally characterized proteins to help annotators see them
HMMs • Hidden Markov Model • statistical model of the patterns of amino acids in a multiple alignment of proteins (called the “seed) which share sequence and functional similarity • at TIGR, each HMM is assigned to a category (called “isology type”) which describes the type of relationship the proteins in the model have to each other • equivalog • superfamily • subfamily • domain • one can search proteins against HMMs, they receive a score indicating how well they match the model • by comparing this score to the cutoff scores assigned to each model, one can determine whether or not the search protein is a member of the group defined by the HMM • “trusted cutoff’ - proteins scoring above this score are considered a member of the group defined by the HMM • “noise cutoff” - proteins scoring below this score are considered NOT to be a member of the group defined by the HMM • for proteins scoring between trusted and noise, the HMM evidence is not sufficient to determine whether the protein is a member of the functional group or not
Annotation is attached to HMMs • TIGR00433 • isology: equivalog • name: biotin synthase • EC: 2.8.1.6 • gene symbol: bioB • TIGR role: 77 (Biotin biosynthesis) • GO terms: GO:0004076 (biotin synthase activity), GO:0009102 (biotin biosynthesis) • PF04055 • isology: domain • name: radical SAM domain protein • EC: not applicable • gene symbol: not applicable • TIGR role: 703 (enzymes of unknown specificity) • GO terms: GO:0003824 (catalytic activity), GO:0008152 (metabolism)
Things to ask yourself when using HMMs • Does my protein score above the trusted cutoff? • What isology type is the HMM? • What annotation on the HMM can I use for my protein?
Genome Properties • Used to get “the big picture” of an organism. Specifically to record and/or predict the presence/absence of: • metabolic pathways • biotin biosynthesis • cellular structures • outer membrane • traits • anaerobic vs. aerobic • optimal growth temperature • Particular property has a given “state” in each organism, for example: • YES - the property is definitely present • NO - the property is definitely not present • Some evidence - the property may be present and more investigation is required to make a determination • The state of some properties can be determined computationally • metabolic pathway • the property is defined be several reaction steps which are modeled by HMMs • HMM matches to steps in pathway indicate that the organism has the property • Other property’s states must be entered manually (growth temp, anaerobic/aerobic, etc.) • data for a particular genome viewable in Manatee • links from HMM section on the Gene Curation Page • links from gene list for role category • entire list of properties and states can be viewed • Searchable across genomes on the Comprehensive Microbial Resource (CMR) site
Goals • assign annotation to each protein • name, gene symbol, EC number, TIGR role, GO terms • confirm coordinates of gene • avoid transitive annotation
AutoAnnotate • computationally gives preliminary annotation to each protein • adds GO terms with IEA • from HMM match • from BER match • AutoAnnotate designed for a system in which all annotations are manually reviewed • If automatic annotation was the endpoint for our projects, we would have to change AutoAnnotate to be more strict and conservative in its decisions
Knowledge about function reflected in specificity of protein names • high confidence - • “adenylosuccinate lyase”, purB, 4.3.2.2 • general function, lacks specificity • “carbohydrate kinase, FGGY family • no gene symbol, partial EC number • family designation • “Cbby family protein” • homolog designation • “recA homolog” • hypotheticals • “hypothetical protein” • “conserved hypothetical protein” • “putative recA” • used sparingly
Knowledge about function reflected in specificity of GO terms Sample GO trees Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism available evidence for 3 genes #1 -HMM for ribokinase -match to an experimentally characterized ribokinase #2 -HMM for kinase -match to experimentally characterized glucokinase and fructokinase #3 -HMM for kinase
translation disruptions • authentic frameshift • authentic point mutation • degenerate • truncation • deletion • insertion • interruption • fusion • fragment Get GO terms No GO terms TIGR role 270 - “disrupted reading frame”
Assigning GO terms • Once we have found out all that we can about a protein, we assign GO terms to describe the protein • things that facilitate finding a term • fast/easy ontology search tools • tools that make term suggestions • tools that format the evidence for you • tools that reduce copy/paste/typing as much as possible
Tools that suggest terms • Mapping files • ec2go • tigrfams2go • interpro2go • Manatee suggestions • Matches to V. cholerae, B. anthracis • Genome Properties • HMMs • Automated assignments • From HMMs and good pairwise matches • Viewed as suggestions, not final annotation
Our Manatee Tool • Prevents assignment of GO terms that are non-existent or obsolete • Knows the correct format for the evidence fields • Allows addition of terms and evidence with one click • Uses correct abbreviations • Rarely a need to copy and paste • In many cases the term you need is suggested on the page somewhere already
Clicking on the various GO suggestions around the Manatee Gene Curation page puts the correct info into the correct fields in the correct format without the need to copy and paste.
Searching for terms in Manatee • Searches of ontologies • go_id search (returns tree, term info) • GO term keyword search (searches synonyms too) • Searches of annotations • Protein name keyword search • go_id search (returns lists of proteins assigned that term) • Correlations (input a go_id and receive a list of terms assigned in conjunction with input term and the percent of occurrence of each correlation) • EC number search (input EC #, return go_id) • GO BLAST page (searches all proteins annotated to GO)
Keeping up with GO content • TIGR downloads the newest version of the ontologies nightly into our db for use by our tools • Periodically we check our annotations for the presence of obsolete or secondary ids and we send updates
Changing GO content • TIGR has been contributing requests for ontology content changes continuously since we joined GO (close to 200 submissions) • The SourceForge submission system works very well • Most requests are handled within a few days, some more complicated things may take a few weeks, the rare really complicated thing may take a few months (again, that’s very rare, see PAMGO example) • Initially there were some aspects of the ontologies that were incorrect for proks (ex. ATPsynthase), these have been fixed as they were discovered.
Future directions • Develop prok GO slim • Use it where we now use TIGR roles • Cease use of TIGR roles • Add more functionality to CMR GO tools • More refined searches • Search across all TIGR GO data, not just prokaryotic • Use accumulated prok GO data more effectively to predict annotations for new proteins
More about Manatee • TIGR’s main manual annotation tool • web based • Displays all known information about a protein • interface for entry of annotation information into the database • open source, freely available on SourceForge for downloading and local use (manatee.sourceforge.net) • TIGR offers a hands on 3-day annotation course, 4 times per year which details our annotation process, the use of Manatee, installation of Manatee, and the use of the CMR • Taught by Michelle Gwinn Giglio, Tanja Davidson, and Todd Creasy • Next class June 28-30, Aug. 23-25
Annotation Engine • clients send us a DNA sequence • we run our entire pipeline up to the point where manual curation starts • we return a MySQL database and associated files with all of the data so the client can do manual annotation of the genome • the client can install Manatee locally and run it using the MySQL database • the data is kept completely confidential if that is the desire of the client • this service allows researchers access to TIGR’s infrastructure and tools, saving the need to expend the time and expense (which they might not have) to create infrastructure of their own
It’s all a team effort • Owen White • Prokaryotic annotation team • Bill Nelson, Bob Dodson, Scott Durkin, Sean Daugherty, Ramana Madupu, Lauren Brinkac, Steven Sullivan, Sagar Kothari • Todd Creasy, Tanja Davidson • Eukaryotic annotation team • All of our tool developers • GO group (last but definitely not least)