760 likes | 850 Views
Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines. Arthur Gruber. Instituto de Ciências Biomédicas Universidade de São Paulo. AG-ICB-USP. Sequence annotation. Annotation is the process of adding information to a DNA sequence.
E N D
Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP
Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA coordinate. • Features could be repeats, genes, promoters, protein domains…….. • Features can be linked to other databases e.g. Pfam/Pubmed AG-ICB-USP
Public databases • GenBank, EMBL and DDBJ. • All databases update each other automatically AG-ICB-USP
Feature table • http://www.ncbi.nlm.nih.gov/projects/collab/FT/ • Format definition • Covers DDBJ/EMBL/GenBank • Defines all accepted annotation terms and hierarchy AG-ICB-USP
Annotation file Contains: • A header with: • Information about the sequence • Organism • Authors • References • Comments • A feature table containing • Sequence features and co-ordinates AG-ICB-USP
Header (EMBL) ID PFMAL1P4 standard; DNA; INV; 66441 BP. XX AC AL031747; XX SV AL031747.8 XX DT 24-SEP-1998 (Rel. 57, Created) DT 27-APR-2000 (Rel. 63, Last updated, Version 13) XX DE Plasmodium falciparum DNA from MAL1P4 XX KW HTG; rifin; telomere; var; var-like hypothetical protein. XX OS Plasmodium falciparum (malaria parasite P. falciparum) OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. XX RN [1] RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D., RA Quail M., Rajandream M., Barrell B.; RT ; RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases. RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome RL Trust Genome Campus, Hinxton, Cambridge CB10 1S. AG-ICB-USP
NCBI Header LOCUS PFMAL1P4 66442 bp DNA linear INV 02-DEC-2004 DEFINITION Plasmodium falciparum DNA from MAL1P4, complete sequence. ACCESSIONAL031747 AL844501 VERSIONAL031747.9 GI:23477012 KEYWORDSHTG; rifin; telomere; var; var-like hypothetical protein. SOURCEPlasmodium falciparum 3D7 ORGANISMPlasmodium falciparum 3D7 Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. REFERENCE 1 AUTHORSHall,N., Pain,A., Berriman,M., Churcher,C., Harris,B., Harris,D., TITLESequence of Plasmodium falciparum chromosomes 1, 3-9 and 13 JOURNALNature 419 (6906), 527-531 (2002) PUBMED 12368867 REFERENCE 2 AUTHORSOliver,K., Pain,A., Berriman,M., Bowman,S., Churcher,C., Harris,B., Harris,D., Lawson,D., Quail,M., Rajandream,M., Hall,N. and Barrell,B. TITLEDirect Submission JOURNAL Submitted (24-SEP-1998) P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK COMMENTOn Oct 2, 2002 this sequence version replaced gi:7670004. For more information about this sequence or the Malaria Project, see http://www.sanger.ac.uk/Projects/P_falciparum. AG-ICB-USP
Feature • Region of DNA that was annotated with a key/qualifier • Keys: CDS, intron, miscellaneous, etc. • Qualifier: notes or extra-information about a feature i.e. exon(key)/gene=“adh”(qualifier) AG-ICB-USP
Feature keys misc_difference misc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_region old_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal AG-ICB-USP
Feature qualifier Additional information about a feature /note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label /allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text" AG-ICB-USP
Features (EMBL) AG-ICB-USP
Features (NCBI) FEATURES Location/Qualifiers source 1..66442 /organism="Plasmodium falciparum 3D7" /mol_type="genomic DNA" /isolate="3D7" /db_xref="taxon:36329" /chromosome="1" repeat_region 1..583 /note="telomeric repeat" repeat_region 584..1641 /note="14bp repeat" gene join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="synonyms: PFA0005w, VAR" CDS join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="Subtelomeric var gene Pfam hit to PF03011 Similar to Plasmodium falciparum VaR, mal1p4.01 vaR SWALL:Q9NFB6 (EMBL:AL031747) (2163 aa) fasta scores: E(): 0, 100% id in 2163 aa" /codon_start=1 /product="erythrocyte membrane protein 1 (PfEMP1)" /protein_id="CAB89209.1" /db_xref="GI:7670005" /db_xref="GOA:Q9NFB6" /db_xref="UniProtKB/TrEMBL:Q9NFB6" /translation="MVTQSSGGGAAGSSGEEDAKHVLDEFGQQVYNEKVEKYANSKIY KEALKGDLSQASILSELAGTYKPCALEYEYYKHTNGGGKGKRYPCTELGEKVEPRFSDTLGGQCTNK KIEGNKYIKGKDVGACAPYRRLHLCSHNLESIQ AG-ICB-USP
CDS features • CDS stands for coding sequence and is used to denote genes and pseudogenes. • These features are automatically translated on submission and the protein added to the protein databases. AG-ICB-USP
/note • Note field contains all the evidence for a gene call……..plus anything else. • Similarity (fasta or blast) • Domain/motif information (Pfam, TMHMM, etc.) • Unusual features (repeats, aa richness) AG-ICB-USP
/product • The name of the gene product eg. Alcohol dehydrogenase • Unless there is proof we must qualify... • Putative • Possible • Always be conservative!… eg. Putative dehydrogenase dehyrogenase like protein • Only piece of annotation added to the protein databases. AG-ICB-USP
Naming protocols • Hypothetical protein unknown function and no homology • Conserved hypothetical protein unknown function WITH homology • Alcohol dehydrogenase like looks a bit like it, but may not be. • Putative alcohol dehydrogenase probably a alcohol dehydrogenase • Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism. AG-ICB-USP
/gene • The gene name • eg ADH1 • Only transfer a gene name if it is meaningful • Never transfer a gene name like PfB0024. • Is it a gene family? make sure two genes have the same name. AG-ICB-USP
Transitive Annotation • AKA annotation catastrophe • Junk in = Junk out • Mis-annotations spread through incorrect database submissions. AG-ICB-USP
How can we standardize the annotation terms? AG-ICB-USP
Through a dynamic controlled vocabulary AG-ICB-USP
So what does that mean? From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. is part of
Ontology Structure Directed Acyclic Graph (DAG) - multiple parentage allowed cell membrane chloroplast mitochondrial chloroplast membrane membrane
GO topology • The ontologies are structured as directed acyclic graphs • Similar to hierarchies but differ in that a more specialized term (child) can be related to more than one less specialized term (parent). • For example, hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. AG-ICB-USP
True Path Violations Create Incorrect Definitions ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus Part_of relationship chromosome
True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Is_a relationship Mitochondrial chromosome
True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus A mitochondrial chromosome is not part of a nucleus! Part_of relationship chromosome Is_a relationship Mitochondrial chromosome
True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus chromosome mitochondrion Part_of relationship Part_of relationship Is_a relationship Nuclear chromosome Mitochondrial chromosome
GO Definitions: Each GO term has 2 Definitions A definition written by a biologist: necessary & sufficient conditions written definition (not computable) Graph structure: necessary conditions formal (computable)
Term-term relationship • is_a • The is_a relationship is a simple class-subclass relationship, where A is_a B means that A is a subclass of B • For example, nuclear chromosome is_a chromosome. GO:0043232 : intracellular non-membrane-bound organelle GO:0005694 : chromosome GO:0000228 : nuclear chromosome AG-ICB-USP
Term-term relationship • part_of • C part_of D means that whenever C is present, it is always a part of D, but C does not always have to be present • For example, periplasmic flagellum part_of periplasmic space GO:0044464 : cell part GO:0042995 : cell projection GO:0019861 : flagellum GO:0009288 : flagellin-based flagellum GO:0055040 : periplasmic flagellum GO:0042597 : periplasmic space GO:0055040 : periplasmic flagellum AG-ICB-USP
Current Ontologies • Molecular function: tasks performed by gene product • Biological process: broad biological goals accomplished by ordered assemblies of molecular functions • Cellular component: subcellular structures, locations and macromolecular complexes AG-ICB-USP
Search result for toxin AG-ICB-USP
Relationships in GO • “is-a” • “part of” AG-ICB-USP
GO paths to terms AG-ICB-USP
GO definitions AG-ICB-USP
Pyruvate dehydrogenase AG-ICB-USP
Why the interest in GO? • Universal ontology • Functional classification scheme with many different levels in a DAG • Widespread interest from scientific community • Already mappings to SP keywords and gene products-annotation on some organisms AG-ICB-USP
GO Evidence codes • Experimental Evidence Codes • EXP: Inferred from Experiment • IDA: Inferred from Direct Assay • IPI: Inferred from Physical Interaction • IMP: Inferred from Mutant Phenotype • IGI: Inferred from Genetic Interaction • IEP: Inferred from Expression Pattern • Computational Analysis Evidence Codes • ISS: Inferred from Sequence or Structural Similarity • ISO: Inferred from Sequence Orthology • ISA: Inferred from Sequence Alignment • ISM: Inferred from Sequence Model • IGC: Inferred from Genomic Context • RCA: inferred from Reviewed Computational Analysis • Author Statement Evidence Codes • TAS: Traceable Author Statement • NAS: Non-traceable Author Statement • Curator Statement Evidence Codes • IC: Inferred by Curator • ND: No biological Data available • Automatically-assigned Evidence Codes • IEA: Inferred from Electronic Annotation • Obsolete Evidence Codes • NR: Not Recorded AG-ICB-USP
Current Mappings to GO • Consortium mappings -MGD, SGD, FlyBase • Swiss-Prot keywords • EC numbers • InterPro entries • Medline ID • Commercial companies -CompuGen, Proteome AG-ICB-USP
EC number-to-GO AG-ICB-USP
SP keyword-to-GO AG-ICB-USP
GO doesn’t cover… • Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are. • Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene. • Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology). • Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of cellular components, including cell types. AG-ICB-USP
Sequence Ontology • The four major aspects of the complete Sequence Ontology are: • located sequence features for objects that can be located on sequence in coordinates, • sequence attributes for describing the properties of features, • consequences of mutation for the annotation of the effects of a mutation • chromosome variation to describe large scale variations AG-ICB-USP