290 likes | 302 Views
Explore how Gene Ontology (GO) standardizes gene product descriptions, aiding efficient data retrieval and analysis in biological research fields. From inception to present day, understand GO's significance in unifying diverse databases and promoting consistent terminology. Dive into GO's structured vocabularies which categorize gene products based on molecular functions, biological processes, and cellular components, enabling precise queries and annotations across various organisms. Discover the principles guiding GO’s organization and its pivotal role in characterizing gene behaviors comprehensively within a cellular context.
E N D
Biologists waste time searching for all available information about each small area of research. This is hampered further by variations in terminology in common usage at any given time, and that inhibit effective searching by computers as well as people. E.g., In a search for new targets for antibiotics, you want all gene products involved in bacterial protein synthesis, that have significantly different sequence or structure from those in humans. If one DB says these molecules are involved in 'translation' and another uses 'protein synthesis', it is difficult for you - and even harder for a computer - to find functionally equivalent terms. GO is an effort to address the need for consistent descriptions of gene products in different DBs. The project began in 1988 as a collaboration between three model organism databases: FlyBase (Drosophila), Saccharomyces Genome Database (SGD) Mouse Genome Database (MGD). Since then, the GO Consortium has grown to include several of the world's major repositories for plant, animal and microbial genomes. See the GO web page for a full list of member orgs. VERTIGO (Vertical Gene Ontoloty)
GO has3 structured, controlled vocabularies (ontologies) describing gene products (the RNA or protein resulting after transcription) by their species-independent, associated biological processes (BP), cellular components (CC) molecular functions (MF). There are three separate aspects to this effort: The GO consortium 1. writes and maintains the ontologies themselves; 2. makes associations between the ontologies and genes / gene products in the collaborating DBs, 3. develops tools that facilitate the creation, maintainence and use of ontologies. The use of GO terms by several collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that you can query them at different levels: e.g., 1. use GO to find all gene products in the mouse genome that are involved in signal transduction, 2. zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product.
GO is not a database of gene sequences or a catalog of gene products GO describes how gene products behave in a cellular context. GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not sufficient. Reasons include: Knowledge changes and updates lag behind. Curators evaluate data differently (e.g., agree to use the word 'kinase', but not to support this by stating how and why we use 'kinase', and consistently to apply it. Only in this way can we hope to compare gene products and determine whether they are related. GO does not attempt to describe every aspect of biology. For example, domain structure, 3D structure, evolution and expression are not described by GO. GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a consensus. The 3 organizing GO principles: molecular function, biological process, cellular component. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. E.g., the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix, mitochondrial inner membrane.
The three organizing principles of the GO (Molecular Function): Molecular function describes e.g., catalytic or binding activities, at the molecular level. GO molecular function terms represent activities rather than the entities (molecules / complexes) that perform actions, and do not specify where or when, or in what context, the action takes place. Molecular functions correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; Examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding. It is easy to confuse a gene product with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". The documentation on gene products explains this confusion in more depth.
Organizing GO principles (Biological Process; Cellular Component) A Biological Process is series of events accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms: cellular physiological process or signal transduction. Examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps. A biological process is not equivalent to a pathway. We are specifically not capturing or trying to represent any of the dynamics or dependencies that would be required to describe a pathway. A cellular component is just that, a component of a cell but with the proviso that it is part of some larger object, which may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).
What does the Ontology look like? GO terms are organized in structures called directed acyclic graphs (DAGs), which differ from hierarchies in that a child (more specialized term) can have many parent (less specialized term). For example, the biological process term hexose biosynthesis has two parents, hexose metabolism and monosaccharide biosynthesis. This is because biosynthesis is a subtype of metabolism, and a hexose is a type of monosaccharide. When any gene involved in hexose biosynthesis is annotated to this term, it is automatically annotated to both hexose metabolism and monosaccharide biosynthesis, because every GO term must obey the true path rule: if the child term describes the gene product, then all its parent terms must also apply to that gene product.
It is easy to confuse a gene product and its molecular function, because very often these are described in exactly the same words. For example, 'alcohol dehydrogenase' can describe what you can put in an Eppendorf tube (the gene product) or it can describe the function of this stuff. There is, however, a formal difference: a single gene product might have several molecular functions, and many gene products can share a single molecular function. For example, there are many gene products that have the function 'alcohol dehydrogenase'. Some, but by no means all, of these are encoded by genes with the name alcohol dehydrogenase. A particular gene product might have both the functions 'alcohol dehydrogenase' and 'acetaldehyde dismutase', and perhaps other functions as well. It's important to grasp that, whenever we use terms such as alcohol dehydrogenase activity in GO, we mean the function, not the entity; for this reason, most GO molecular function terms are appended with the word 'activity'. Many gene products associate into entities that function as complexes, or 'gene product groups', which often include small molecules. They range in complexity from the relatively simple (for example, hemoglobin contains the gene products alpha-globin and beta-globin, and the small molecule heme) to complex assemblies of numerous different gene products, e.g., the ribosome. At present, small molecules are not represented in GO. In the future, we might be able to create cross products by linking GO to existing databases of small molecules such as Klotho , LIGAND
How do the terms in GO become associated with their appropriate gene products? Collaborating databases annotate their gene products (or genes) with GO terms, providing references and indicating what kind of evidence is available to support the annotations. More information can be found in the GO Annotation Guide. If you browse any of the contributing databases, you'll find that each gene or gene product has a list of associated GO terms. Each database also publishes a table of these associations, and these are freely available from the GO ftp site. You can also browse the ontologies using a range of web-based browsers. A full list of these, and other tools for analyzing gene function using GO, is available on the GO Tools page . In addition, the GO consortium has prepared GO slims, 'slimmed down' versions of the ontologies that allow you to annotate genomes or sets of gene products to gain a high-level view of gene functions. Using GO slims you can, for example, work out what proportion of a genome is involved in signal transduction, biosynthesis or reproduction. See the GO Slim Guide for more information.
All data from the GO project is freely available. You can download the ontology data in a number of different formats, including XML and mySQL, from the GO Downloads page. For more information on the syntax of these formats, see the GO File Format Guide. If you need lists of the genes or gene products that have been associated with a particular GO term, the Current Annotations table tracks the number of annotations and provides links to the gene association files for each of the collaborating databases is available. GO allows us to annotate genes and their products with a limited set of attributes. For example, GO does not allow us to describe genes in terms of which cells or tissues they're expressed in, which developmental stages they're expressed at, or their involvement in disease. It is not necessary for GO to do these things because other ontologies are being developed for these purposes. The GO consortium supports the development of other ontologies and makes its tools for editing and curating ontologies freely available. A list of freely available ontologies that are relevant to genomics and proteomics and are structured similarly to GO can be found at the Open Biomedical Ontologies website. A larger list, which includes the ontologies listed at OBO and also other controlled vocabularies that do not fulfil the OBO criteria is available at the Ontology Working Group page of the Microarray Gene Expression Data Society (MGED).
Cross-products: The existence of several ontologies will also allow us to create 'cross-products' that maximize the utility of each ontology while avoiding redundancy. For example, by combining the developmental terms in the GO process ontology with a second ontology that describes Drosophila anatomical structures, we could create an ontology of fly development. We could repeat this process for other organisms without having to clutter up GO with large numbers of species-specific terms. Similarly, we could create an ontology of biosynthetic pathways by combining biosynthesis terms in the GO process ontology with a chemical ontology. Mappings to other classification systems GO is not the only attempt to build structured controlled vocabularies for genome annotation. Nor is it the only such series of catalogs in current use. We have attempted to make translation tables between these catalogs and GO. We caution that these mappings are neither complete nor exact; they are to be used as a guide. One reason for this is absence of definitions from many of the other catalogs and of a complete set of definitions in GO itself. More information on the syntax of these mappings can be found in the GO File Format Guide. Contributing to GO The GO project is constantly evolving, and we welcome feedback from all users. If you need a new term or definition, or would like to suggest that we reorganize a section of one of the ontologies, please do so through our online request-tracking system, which is hosted by SourceForge.net. Errors or omissions in annotations are reported to GO annotation mailing list. You can also send questions or suggestions to the GOHELP. More information on mailing lists is available from the mailing lists page.
What is a GO term? The purpose of GO is to define particular attributes of gene products. A term is simply the text string used to describe an entry in GO, e.g. cell, fibroblast growth factor receptor binding or signal transduction. A node refers to a term and all its children. GO does not contain the following: Gene products: e.g. cytochrome c is not in GO; attributes of it, e.g., oxidoreductase activity, are. Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene. Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see OBO web site for more information). Protein domains or structural features. Protein-protein interactions.
General conventions when adding terms The following stylistic points should be applied to all aspects of the ontologies. Spelling conventions: Where there are differences in the accepted spelling between English and US usage, use the US form, e.g. polymerizing, signaling, rather than polymerising, signalling. There is a dictionary of 'words' used in GO terms in the file GODict.DAT. Abbreviations: Avoid abbreviations unless they're self-explanatory. Use full element names, not symbols. Use hydrogen for H+. Use copper and zinc rather than Cu and Zn. Use copper(II), copper(III), etc., rather than cuprous, cupric, etc. For biomolecules, spell out the term in full wherever practical: use fibroblast growth factor, not FGF. Greek symbols: Spell out Greek symbols in full: e.g. alpha, beta, gamma. Case: GO terms are all lower case except where demanded by context, e.g. DNA, not dna. Singular/plural: Use singula, except where a term is only used in plural (eg caveolae). Be descriptive: Be reasonably descriptive, even at the risk of verbal redundancy. Remember, DBs that refer to GO terms might list only the finest-level terms associated with a particular gene product. If the parent is aromatic amino acid family biosynthesis, then the child should be aromatic amino acid family biosynthesis, anthranilate pathway, not just anthranilate pathway. Anatomical qualifiers: Do not use anatomical qualifiers in the cellular process and molecular function ontologies. For example, GO has the molecular function term DNA-directed DNA polymerase activity but neither nuclear DNA polymerase nor mitochondrial DNA polymerase. These terms with anatomical qualifiers are not necessary because annotators can use the cellular component ontology to attribute location to gene products, independently of process or function.
Synonyms When several words or phrases that could be used as the term name, one form will be chosen as term name whilst the other possible names are added as synonyms. Despite the name, GO synonyms are not always 'synonymous' in the strictest sense of the word, as they do not always mean exactly the same as the term they are attached to. Instead, a GO synonym may be broader or narrower than the term string; it may be a related phrase; it may be alternative wording, spelling or use a different system of nomenclature; or it may be a true synonym. This flexibility allows GO synonyms to serve as valuable search aids, as well as being useful for applications such as text mining and semantic matching. Having a single, broad relationship between a GO term and its synonyms is adequate for most search purposes, but for other applications such as semantic matching, the inclusion of a more formal relationship set is valuable. Thus, GO records a relationship type for each synonym, stored in OBO format flat file. Synonym types:The synonym relationship types are: term is an exact synonym (ornithine cycle is an exact synonym of urea cycle) terms are related (cytochrome bc1 complex is a related to ubiquinol-cytochrome-c reductase activity) synonym is broader than the term name (cell division is a broad synonym of cytokinesis) synonym is narrower or more precise (pyrimidine-dimer repair by photolyase is a narrow synonym of photoreactive repair) synonym is related to, but not exact, broader or narrower (virulence has synonym type of other related to term pathogenesis)
These types form a loose hierarchy: related [i] exact synonym [i] broad synonym [i] narrow synonym [i] other related synonym The default relationship is related to, as all synonyms are in some way related to the term name, but more specific relationships are assigned where possible. The synonym type other related is used where the relationship between a term and its synonym is NOT exact, narrower or broader. In some cases, broader and narrower synonyms are created in the place of new parent or child terms because some synonym strings may not be valid GO terms but may still be useful for search purposes. This may be because the synonym is the name of a gene product e.g. ubiquitin-protein ligase activity has the narrower synonym E3, as E3 is a specific gene product with ubiquitin-protein ligase activity. Adding synonyms When you add a synonym using DAG-Edit, choose a type from the pull-down selector (see the DAG-Edit user guide for more information). DAG-Edit will incorporate the synonym type into the OBO format flat file when you save. The default synonym type is the broadest, 'synonym' (equivalent to 'related' above). Number of synonyms for a term is not limited, and the same text string can be used for more than 1 GO term Add synonyms if you edit a term name but the old name is still a valid synonym; for example, if you change respiration to cellular respiration, keep respiration as a synonym. This helps other users find familiar terms. Add synonyms if the term has (or contains) a commonly used abbreviation. For example, FGF binding could be used as a synonym for fibroblast growth factor binding. Do not add a synonym if the only difference is case (e.g. start vs. START). Synonyms, like term names, are all lower case except where demanded by context (e.g. DNA, not dna). The synonyms found in GO and their relationships to the term string with which they are associated are available as a text file. Details on file format can be found in the accompanying ReadMe file. Synonym continued
Rules For Synonyms Acronyms are exactly synonymous with full name (if acronym is not used in any other sense elsewhere) 'Jargon' type phrases are exactly synonymous w full name (if phrase is not used in any other sense elsewhere) proton is exactly synonymous with hydrogen in most senses EXCEPT where hydrogen means H2 (i.e. gas) include implicit information when making decision; take into account which ontology the term is in - e.g. an entry term that ends in 'factor' is not synonymous with a molecular function. ligand is NOT exactly synonymous with binding (ligand is an entity, binding an action) XXX receptor ligand is NOT exactly synonymous with XXX (XXX is only one of the potential ligands so XXX receptor ligand is broader than XXX) XXX complex is NOT exactly synonymous with XXX (XXX is ambiguous - could describe activity of XXX) porter and transporter are NOT exactly synonymous (transporter is broader) symporter/antiporter and transporter are NOT exactly synonymous (transporter is broader)
Cross-referencing other databases General database cross references (general dbxrefs) should be used whenever a GO term has an identical meaning to an object in another database. Some ex. of common general dbxrefs in GO: Ontology DB Sample dbxref Fctn Enzyme Commission EC:3.5.1.6 Transport Protein Database TC:2.A.29.10.1 Biocatalysis/Biodegradation DB UM-BBD_enzymeID:e0310 Biocatalysis/Biodegradation DB UM-BBD_pathwayID:dcb MetaCyc Metabolic Pathway DB MetaCyc:XXXX-RXN Process MetaCyc Metabolic Pathway DB MetaCyc:2ASDEG-PWY Component None The GO.xrf_abbs file is maintained by the BioMOBY project, so to make changes to the file, you need to use their web form.
Cross-referencing other databases General database cross references (general dbxrefs) should be used whenever a GO term has an identical meaning to an object in another database. Some ex. of common general dbxrefs in GO: Ontology DB Sample dbxref Fctn Enzyme Commission EC:3.5.1.6 Transport Protein Database TC:2.A.29.10.1 Biocatalysis/Biodegradation DB UM-BBD_enzymeID:e0310 Biocatalysis/Biodegradation DB UM-BBD_pathwayID:dcb MetaCyc Metabolic Pathway DB MetaCyc:XXXX-RXN Process MetaCyc Metabolic Pathway DB MetaCyc:2ASDEG-PWY Component None The GO.xrf_abbs file is maintained by the BioMOBY project, so to make changes to the file, you need to use their web form.
Understanding relationships in GO The GO ontologies are structured as a directed acyclic graph (DAG), which means that a child (more specialized) term can have multiple parents (less specialized terms). This makes GO a powerful system to describe biology, but creates some pitfalls for curators Keeping the following guidelines in mind should help you to avoid these problems. A child term can have one of two different relationships to its parent(s): is_a or part_of. The same term can have different relationships to different parents; for example, the child 'GO term 3' may be an is_a of parent 'GO term 1' and a part_of parent, 'GO term 2': In GO, an is_a relationship means that the term is a subclass of its parent. For example, mitotic cell cycle is_a cell cycle, not confused with an 'instance' which is a specific example. For example, clogs are a subclass or is_a of shoes, while the shoes I have on my feet now are an instance of shoes. GO, like most ontologies, does not use instances. The is_a relationship is transitive, which means that if 'GO term A' is a subclass of 'GO term B', and 'GO term B' is an subclass of 'GO term C', 'GO term A' is also a subclass of 'GO term C': For example: Terminal N-glycosylation is a subclass of terminal glycosylation. Terminal glycosylation is a subclass of protein glycosylation. Terminal N-glycosylation is a subclass of protein glycosylation.
part_of in GO is more complex. There are 4 basic levels of restriction for a part_of relationship: 1st type has no restrictions - no inferences can be made from the relationship between parent and child other than that parent may have child as a part, and the child may or may not be a part of the parent. 2nd type, 'necessarily is_part', means that wherever the child exists, it is as part of the parent. To give a biological example, replication fork is part_of chromosome, so whenever replication fork occurs, it is as part_of chromosome, but chromosome does not necessarily have part replication fork. 3rd type, 'necessarily has_part', is the exact inverse of type two; wherever the parent exists, it has the child as a part, but the child is not necessarily part of the parent. For example, nucleus always has_part chromosome, but chromosome isn't necessarily part_of nucleus. 4th type, is a combination of both two and three, 'has_part' and 'is_part'. An example of this is nuclear membrane is part_of nucleus. So nucleus always has_part nuclear membrane, and nuclear membrane is always part_of nucleus. The part_of relationship used in GO is usually type two, 'necessarily is_part'. Note that part_of types 1 and 3 are not used in GO, as they would violate the true path rule. Like is_a, part_of is transitive, so that if 'GO term A' is part_of 'GO term B', and 'GO term B' is part_of 'GO term C', 'GO term A' is part_of 'GO term C': E.g., Laminin-1 is part_of basal lamina. Basal lamina is part_of basement membrane. Laminin-1 is part_of basement membrane. The ontology editing tool DAG-Edit, from version 1.411 on, allows you to specify the necessity of relationships. The part_of relationship used in GO, necessarily is_part, would correspond to part_of, [inverse] necessarily true. For more information, see the DAG-Edit user guide. For info on how these relationships are represented in the GO flat files, see the GO File Format Guide. For technical info on the relationships used in GO and OBO, see the OBO relationships ontology.
The true path rule states that "the pathway from a child term all the way up to its top-level parent(s) must always be true". One of the implications of this is that the type of part_of relationship used in GO, outlined more fully in the part_of relationship section above, is restricted to those types where a child term must always be part_of its parent. Often, annotating a new gene product reveals relationships in an ontology that break the true path rule, or species specificity becomes a problem. In such cases, the ontology must be restructured by adding more nodes and connecting terms such that any path upwards is true. When a term is added to the ontology, the curator needs to add all of the parents and children of the new term. This becomes clear with an example: consider how chitin metabolism is represented in the process ontology. Chitin metabolism is a part of cuticle synthesis in fly and is also part of cell wall organization in yeast. This was once represented in process ontology as: cuticle synthesis, [i]chitin metabolism, cell wall biosynthesis, [i]chitin metabolism, ---[i]chitin biosynthesis, ---[i]chitin catabolism Illustration The problem with this organization becomes apparent when one tries to annotate a specific gene product from one species. A fly chitin synthase could be annotated to chitin biosynthesis, and appear in a query for genes annotated to cell wall biosynthesis (and its children), which makes no sense because flies don't have cell walls. This is the revised ontology structure which ensures that the true path rule is not broken: chitin metabolism, [i]chitin biosynthesis, [i]chitin catabolism, [i]cuticle chitin metabolism ---[i]cuticle chitin biosynthesis, ---[i]cuticle chitin catabolism [i]cell wall chitin metabolism, ---[i]cell wall chitin biosynthesis, ---[i]cell wall chitin catabolism Illustration The parent chitin metabolism now has the child terms cuticle chitin metabolism and cell wall chitin metabolism, with the appropriate catabolism and synthesis terms beneath them. With this structure, all the daughter terms can be followed up to chitin metabolism, but cuticle chitin metabolism terms do not trace back to cell wall terms, so all the paths are true. In addition, gene products such as chitin synthase can be annotated to nodes of appropriate granularity in both yeast and flies, and queries will yield the expected results.
Dependent ontology terms Some GO terms imply presence of others. Examples from the process ontology include the following: If either X biosynthesis or X catabolism exists, then parent X metabolism must also exist. If regulation of X exists, then the process X must also exist. Potentially any process in the ontology can be regulated. Note: X may refer to a phenotype (for example cell size in regulation of cell size; in these cases, X should not be added to the ontology. GO nodes should aggressively avoid using species-specific definitions. Nevertheless, many functions, processes and components are not common to all life forms. Our current convention is to include any term that can apply to more than one taxonomic class of organism. Within the ontologies, there are cases where a word or phrase has different meanings when applied to different organisms. For example, embryonic development in insects is very different from embryonic development in mammals. Such terms are distinguished from one another by their definitions and by the sensu designation (sensu means 'in the sense of'), as in the term embryonic development (sensu Insecta). Nodes should be divided into sensu sub-trees where the children are or are likely to be different. Using sensu designation in a term does not exclude that term from being used to annotate species outside that designation. e.g., a 'sensu Drosophila' term might reasonably used to annotate a mosquito gene product. A GO node should never be more species-specific than any of its children. Child nodes can be at the same level of species specificity as the parent node(s), or more specific. When adding more species-specific nodes, curators should make sure that non-species-specific parents exist (or add them if necessary). E.g., take the process of sporulation. This occurs in both bacteria and fungi, but bacterial sporulation is quite a different process to fungal sporulation, so we therefore add two children to sporulation, sporulation (sensu Bacteria) and sporulation (sensu Fungi). If we now want to add a term to represent the assembly of the spore wall in fungi, we cannot just add spore wall assembly as a direct child of sporulation (sensu Fungi) as such a term could conceivably refer to the assembly of spore walls in bacteria. We have to name the child term spore wall assembly (sensu Fungi) to ensure that it is as species-specific as the parent term.
References and Evidence Every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis. The annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term. A simple controlled vocabulary is used to record evidence: IMP inferred from mutant phenotype IGI inferred from genetic interaction <database:gene_symbol[allele_symbol]> IPI inferred from physical interaction [with <database:protein_name>] ISS inferred from sequence similarity [with <database:sequence_id>] IDA inferred from direct assay IEP inferred from expression pattern IEA inferred from electronic annotation [with <database:id>] TAS traceable author statement NAS non-traceable author statement ND no biological data available RCA inferred from reviewed computational analysis IC inferred by curator [from <GO:id>]
Annotation File Format Collaborating databases export to GO a tab delimited file, known informally as a "gene association file" of links between database objects and GO terms. Despite the jargon, the database object may represent a gene or a gene product (transcript or protein). Columns in the file are described below, a table showing the columns in order, with examples, is available. The entry in the DB_Object_ID field (see below) of the association file is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB_Object_ID field) or annotations to a protein object (protein ID in DB_Object_ID field). The entry in the DB_Object_Symbol field should be a symbol that means something to a biologist, wherever possible (gene symbol, for example). It is not an ID or an accession number (the second column, DB_Object_ID, provides the unique identifier), although IDs can be used in DB_Object_Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated). The object type (gene, transcript, protein, protein_structure, or complex) listed in the DB_Object_Type field MUST match the database entry identified by DB_Object_ID. Note that DB_Object_Type refers to the database entry (i.e. does it represent a gene, protein, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is based. For example, if your database entry represents a gene, then 'gene' goes in the DB_Object_Type column, even if the annotation is to a component term relevant to the localization of a protein product of the gene. The text entered in the DB_Object_Name and DB_Object_Symbol can refer to the same database entry (recommended), or to a "broader" entity. For example, several alternative transcripts from one gene may be annotated separately, each with a unique transcript DB_Object_ID, but list the same gene symbol in the DB_Object_Symbol column.
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields: DBrefers to the database contributing the gene_association file the value must be present in the file of database abbreviations. [Database abbreviations explanation] this field is mandatory, cardinality 1 DB_Object_ID unique identifier in DB for the item being annotated this field is mandatory, cardinality 1 DB_Object_Symbol(unique and valid) symbol to which DB_Object_ID is matched can use ORF name for otherwise unnamed gene or protein if gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol this field is mandatory, cardinality 1 Qualifier flags that modify the interpretation of an annotation one (or more) of NOT, contributes_to, colocalizes_with this field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to) GOid GO identifier for the term attributed to the DB_Object_ID this field is mandatory, cardinality 1 DB:Reference one or more unique identifiers for a single source cited as an authority for the attribution of the GOid to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number. Note that only one reference can be cited on a single line in the gene_association file. If a reference has identifiers in more than one database, multiple identifiers for that reference can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database. Note that if the model organism database has an identifier for the reference, that idenitifier should always be included, even if a PubMed ID is also used. this field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. SGD:8789|PMID:2676709).
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields: Evidence: IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA this is mandatory, cardinality 1 With (or) Fromone of: DB:gene_symbol; DB:gene_symbol[allele_symbol]; DB:gene_id; DB:protein_nam; DB:sequence_id; GO:GO_id. this field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. CGSC:pabA|CGSC:pabB) . Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). For example, it can identify another gene product to which the annotated gene product is similar (ISS) or interacts with (IPI). More information on the meaning of 'with/from' column entries is available in the evidence documentation entries for the relevant codes. Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222). Note that a gene ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct. 'GO:GO_id' is used only when the evidence code is 'IC', and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the GO term(s) from which the inference is made. This field is mandatory for evidence code IC. The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with' column for ISS annotations. The 'with' column may not be used with the evidence codes IDA, TAS, NAS, or ND.
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields: Aspect one of P (biological process), F (molecular function) or C (cellular component) this field is mandatory; cardinality 1 DB_Object_Name name of gene or gene product. not mandatory, cardinality 0, 1 [white space allowed] Synonym Gene_symbol [or other text]. Strongly recommend gene synonyms are included in the gene association file, as this aids the searching of GO. this field is not mandatory, cardinality 0, 1, >1 [white space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene) DB_Object_Type what kind of thing is being annotated one of gene, transcript, protein, protein_structure, complex this field is mandatory, cardinality 1 Taxon taxonomic identifier(s). For cardinality 1, the ID of the species encoding the gene product. For cardinality 2, to be used only in conjunction with terms that have the term 'interaction between organisms' as an ancestor. The first taxon id should be that of the organism encoding the gene or gene product, and the taxon id after the pipe should be that of the other organism in the interaction. mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000) Date: on which the annotation was made; format is YYYYMMDD this field is mandatory, cardinality 1 Assigned_by The database which made the annotation one of the values in the table of database abbreviations. [Database abbreviations explanation] Used for tracking the source of an individual annotation. Default value is value entered in column 1 (DB). Value will differ from column 1 for any that is made by one database and incorporated into another. this field is mandatory, cardinality 1 Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GOid (where dbname is always GO), DB:Reference, With, Taxon (where dbname is always taxon). For GO ids, do not repeat the 'GO:' prefix (i.e. always use GO:0000000, not GO:GO:0000000)
Computational Annotation Methods This section includes descriptions of automated annotation methods used by participating databases (descriptions have been provided by each group listed). EBI | MGI | TIGR EBI GOA Electronic Annotation The large-scale assignment of GO terms to UniProt Knowledgebase entries involves electronic techniques. This strategy exploits existing properties within database entries including keywords and Enzyme Commission (EC) numbers and cross-reference to InterPro (a database of protein motifs) which are manually mapped to GO. SWISS-PROT keyword and InterPro to GO mappings are maintained in-house and shared on the GO home page for local database updates. Electronically combining these mappings with a table of matching Uniprot Knowledgebase entries generates a table of associations. For each GOA association, an evidence code, which summarizes how the association is made is provided. Associations are made electronically are labeled as 'inferred from electronic annotation' (IEA). Evelyn Camon, 2002-09-03 MGI Electronic Annotation Methods Every object in the MGI databases (markers, seqids, references, etc.) has an MGI: accession ID. See details in GO
Computational Annotation Methods TIGR ISS Annotation (Arabidopsis, T. brucei) For TIGR Arabidopsis or T. brucei annotations using 'Inferred from Sequence Similarity' (ISS) evidence, the reference is usually 'TIGR_Ath1:annotation' for Arabidopsis (author: TIGR Arabidopsis annotation team) and TIGR_Tba1:annotation for T. brucei (author: TIGR Trypanosoma brucei annotation team), which are defined as follows: name: TIGR annotation based upon multiple sources of similarity evidence description: TIGR_Ath1:annotation or TIGR_Tba1:annotation denotes a curator's interpretation of a combination of evidence. Our internal software tools present us with a great deal of evidence based domains, sequence similarities, signal sequences, paralogous proteins, etc. The curator interprets the body of evidence to make a decision about a GO assignment when an external reference is not available. The curator places one or more accessions that informed the decision in the "with" field. What this says is that we have used many sequence similarity hits, etc., to make our decision. However, we choose only 1-3 pieces of information as "with" information, as it is not practical to enter and submit many entries for each annotation. We also have internal calculations of paralogy and new domains we are identifying which have not yet been published, but which help inform our decisions.