1 / 76

Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines

Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines. Arthur Gruber. Instituto de Ciências Biomédicas Universidade de São Paulo. AG-ICB-USP. Sequence annotation. Annotation is the process of adding information to a DNA sequence.

thuyet
Download Presentation

Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

  2. Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA coordinate. • Features could be repeats, genes, promoters, protein domains…….. • Features can be linked to other databases e.g. Pfam/Pubmed AG-ICB-USP

  3. Public databases • GenBank, EMBL and DDBJ. • All databases update each other automatically AG-ICB-USP

  4. Feature table • http://www.ncbi.nlm.nih.gov/projects/collab/FT/ • Format definition • Covers DDBJ/EMBL/GenBank • Defines all accepted annotation terms and hierarchy AG-ICB-USP

  5. Annotation file Contains: • A header with: • Information about the sequence • Organism • Authors • References • Comments • A feature table containing • Sequence features and co-ordinates AG-ICB-USP

  6. Header (EMBL) ID PFMAL1P4 standard; DNA; INV; 66441 BP. XX AC AL031747; XX SV AL031747.8 XX DT 24-SEP-1998 (Rel. 57, Created) DT 27-APR-2000 (Rel. 63, Last updated, Version 13) XX DE Plasmodium falciparum DNA from MAL1P4 XX KW HTG; rifin; telomere; var; var-like hypothetical protein. XX OS Plasmodium falciparum (malaria parasite P. falciparum) OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. XX RN [1] RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D., RA Quail M., Rajandream M., Barrell B.; RT ; RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases. RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome RL Trust Genome Campus, Hinxton, Cambridge CB10 1S. AG-ICB-USP

  7. NCBI Header LOCUS PFMAL1P4 66442 bp DNA linear INV 02-DEC-2004 DEFINITION Plasmodium falciparum DNA from MAL1P4, complete sequence. ACCESSIONAL031747 AL844501 VERSIONAL031747.9 GI:23477012 KEYWORDSHTG; rifin; telomere; var; var-like hypothetical protein. SOURCEPlasmodium falciparum 3D7 ORGANISMPlasmodium falciparum 3D7 Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. REFERENCE 1 AUTHORSHall,N., Pain,A., Berriman,M., Churcher,C., Harris,B., Harris,D., TITLESequence of Plasmodium falciparum chromosomes 1, 3-9 and 13 JOURNALNature 419 (6906), 527-531 (2002) PUBMED 12368867 REFERENCE 2 AUTHORSOliver,K., Pain,A., Berriman,M., Bowman,S., Churcher,C., Harris,B., Harris,D., Lawson,D., Quail,M., Rajandream,M., Hall,N. and Barrell,B. TITLEDirect Submission JOURNAL Submitted (24-SEP-1998) P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK COMMENTOn Oct 2, 2002 this sequence version replaced gi:7670004. For more information about this sequence or the Malaria Project, see http://www.sanger.ac.uk/Projects/P_falciparum. AG-ICB-USP

  8. Feature • Region of DNA that was annotated with a key/qualifier • Keys: CDS, intron, miscellaneous, etc. • Qualifier: notes or extra-information about a feature i.e. exon(key)/gene=“adh”(qualifier) AG-ICB-USP

  9. Feature keys misc_difference misc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_region old_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal AG-ICB-USP

  10. Feature qualifier Additional information about a feature /note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label /allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text" AG-ICB-USP

  11. Features (EMBL) AG-ICB-USP

  12. Features (NCBI) FEATURES Location/Qualifiers source 1..66442 /organism="Plasmodium falciparum 3D7" /mol_type="genomic DNA" /isolate="3D7" /db_xref="taxon:36329" /chromosome="1" repeat_region 1..583 /note="telomeric repeat" repeat_region 584..1641 /note="14bp repeat" gene join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="synonyms: PFA0005w, VAR" CDS join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="Subtelomeric var gene Pfam hit to PF03011 Similar to Plasmodium falciparum VaR, mal1p4.01 vaR SWALL:Q9NFB6 (EMBL:AL031747) (2163 aa) fasta scores: E(): 0, 100% id in 2163 aa" /codon_start=1 /product="erythrocyte membrane protein 1 (PfEMP1)" /protein_id="CAB89209.1" /db_xref="GI:7670005" /db_xref="GOA:Q9NFB6" /db_xref="UniProtKB/TrEMBL:Q9NFB6" /translation="MVTQSSGGGAAGSSGEEDAKHVLDEFGQQVYNEKVEKYANSKIY KEALKGDLSQASILSELAGTYKPCALEYEYYKHTNGGGKGKRYPCTELGEKVEPRFSDTLGGQCTNK KIEGNKYIKGKDVGACAPYRRLHLCSHNLESIQ AG-ICB-USP

  13. CDS features • CDS stands for coding sequence and is used to denote genes and pseudogenes. • These features are automatically translated on submission and the protein added to the protein databases. AG-ICB-USP

  14. /note • Note field contains all the evidence for a gene call……..plus anything else. • Similarity (fasta or blast) • Domain/motif information (Pfam, TMHMM, etc.) • Unusual features (repeats, aa richness) AG-ICB-USP

  15. /product • The name of the gene product eg. Alcohol dehydrogenase • Unless there is proof we must qualify... • Putative • Possible • Always be conservative!… eg. Putative dehydrogenase dehyrogenase like protein • Only piece of annotation added to the protein databases. AG-ICB-USP

  16. Naming protocols • Hypothetical protein unknown function and no homology • Conserved hypothetical protein unknown function WITH homology • Alcohol dehydrogenase like looks a bit like it, but may not be. • Putative alcohol dehydrogenase probably a alcohol dehydrogenase • Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism. AG-ICB-USP

  17. /gene • The gene name • eg ADH1 • Only transfer a gene name if it is meaningful • Never transfer a gene name like PfB0024. • Is it a gene family? make sure two genes have the same name. AG-ICB-USP

  18. Transitive Annotation • AKA annotation catastrophe • Junk in = Junk out • Mis-annotations spread through incorrect database submissions. AG-ICB-USP

  19. How can we standardize the annotation terms? AG-ICB-USP

  20. Through a dynamic controlled vocabulary AG-ICB-USP

  21. AG-ICB-USP

  22. So what does that mean? From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. is part of

  23. Ontology Structure Directed Acyclic Graph (DAG) - multiple parentage allowed cell membrane chloroplast mitochondrial chloroplast membrane membrane

  24. GO topology • The ontologies are structured as directed acyclic graphs • Similar to hierarchies but differ in that a more specialized term (child) can be related to more than one less specialized term (parent). • For example, hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. AG-ICB-USP

  25. True Path Violations Create Incorrect Definitions ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus Part_of relationship chromosome

  26. True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Is_a relationship Mitochondrial chromosome

  27. True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus A mitochondrial chromosome is not part of a nucleus! Part_of relationship chromosome Is_a relationship Mitochondrial chromosome

  28. True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus chromosome mitochondrion Part_of relationship Part_of relationship Is_a relationship Nuclear chromosome Mitochondrial chromosome

  29. GO Definitions: Each GO term has 2 Definitions A definition written by a biologist: necessary & sufficient conditions written definition (not computable) Graph structure: necessary conditions formal (computable)

  30. Term-term relationship • is_a • The is_a relationship is a simple class-subclass relationship, where A is_a B means that A is a subclass of B • For example, nuclear chromosome is_a chromosome. GO:0043232 : intracellular non-membrane-bound organelle GO:0005694 : chromosome GO:0000228 : nuclear chromosome AG-ICB-USP

  31. Term-term relationship • part_of • C part_of D means that whenever C is present, it is always a part of D, but C does not always have to be present • For example, periplasmic flagellum part_of periplasmic space GO:0044464 : cell part GO:0042995 : cell projection GO:0019861 : flagellum GO:0009288 : flagellin-based flagellum GO:0055040 : periplasmic flagellum GO:0042597 : periplasmic space GO:0055040 : periplasmic flagellum AG-ICB-USP

  32. Current Ontologies • Molecular function: tasks performed by gene product • Biological process: broad biological goals accomplished by ordered assemblies of molecular functions • Cellular component: subcellular structures, locations and macromolecular complexes AG-ICB-USP

  33. AG-ICB-USP

  34. Search result for toxin AG-ICB-USP

  35. Relationships in GO • “is-a” • “part of” AG-ICB-USP

  36. GO paths to terms AG-ICB-USP

  37. GO definitions AG-ICB-USP

  38. Pyruvate dehydrogenase AG-ICB-USP

  39. Why the interest in GO? • Universal ontology • Functional classification scheme with many different levels in a DAG • Widespread interest from scientific community • Already mappings to SP keywords and gene products-annotation on some organisms AG-ICB-USP

  40. GO Evidence codes • Experimental Evidence Codes • EXP: Inferred from Experiment • IDA: Inferred from Direct Assay • IPI: Inferred from Physical Interaction • IMP: Inferred from Mutant Phenotype • IGI: Inferred from Genetic Interaction • IEP: Inferred from Expression Pattern • Computational Analysis Evidence Codes • ISS: Inferred from Sequence or Structural Similarity • ISO: Inferred from Sequence Orthology • ISA: Inferred from Sequence Alignment • ISM: Inferred from Sequence Model • IGC: Inferred from Genomic Context • RCA: inferred from Reviewed Computational Analysis • Author Statement Evidence Codes • TAS: Traceable Author Statement • NAS: Non-traceable Author Statement • Curator Statement Evidence Codes • IC: Inferred by Curator • ND: No biological Data available • Automatically-assigned Evidence Codes • IEA: Inferred from Electronic Annotation • Obsolete Evidence Codes • NR: Not Recorded AG-ICB-USP

  41. Current Mappings to GO • Consortium mappings -MGD, SGD, FlyBase • Swiss-Prot keywords • EC numbers • InterPro entries • Medline ID • Commercial companies -CompuGen, Proteome AG-ICB-USP

  42. AG-ICB-USP

  43. AG-ICB-USP

  44. AG-ICB-USP

  45. InterPro-to-GO

  46. EC number-to-GO AG-ICB-USP

  47. SP keyword-to-GO AG-ICB-USP

  48. GO doesn’t cover… • Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are. • Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene. • Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology). • Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of cellular components, including cell types. AG-ICB-USP

  49. Sequence Ontology • The four major aspects of the complete Sequence Ontology are: • located sequence features for objects that can be located on sequence in coordinates, • sequence attributes for describing the properties of features, • consequences of mutation for the annotation of the effects of a mutation • chromosome variation to describe large scale variations AG-ICB-USP

More Related