1 / 74

Bioinformatics and Genome Annotation

Bioinformatics and Genome Annotation. Shane C Burgess. http://www.agbase.msstate.edu/. NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY July 17, 2000.

virgil
Download Presentation

Bioinformatics and Genome Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics and Genome Annotation Shane C Burgess http://www.agbase.msstate.edu/

  2. NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY July 17, 2000 Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

  3. Biocomputing:computational biology & bioinformatics Gene Ontology Consortium members

  4. Dr Fiona McCarthy Dr Susan Bridges Dr Teresia Buza Dr Nan Wang Cathy Grisham Dr Divya Pedinti Philippe Chouvarine Lakshmi Pillai

  5. Sequencing is getting cheaper Cost of human or similar sized genome Source: Richard Gibbs, Baylor College of Medicine and biocomputing becomes more of an issue.

  6. Complexity • Sequence itself and from all it’s compatriots and assorted microbes • SNPs • Transcripts (all of them…don’t forget alternative splicing, starts) • CNVs • Epigenetic changes to DNA • Proteome (expression, epigenetics, PTMs, location, flux, enzyme kinetics) • Metabolites • Phenotypes • Drugs • B. Statistical. 1.Multiple testing problem. 2. Search space • Both have potential computationally-intensive solutions (Monte Carlo/Resampling/ Permutation/Bootstrap and target/decoy). • C. Information: publications are no longer the sole source of “valid” or “legitimate” information. • Trusted databases and not just publications used as research sources; not just data but also community annotations etc • D. Biocomputing issues: LOCAL--storage, compute power (CPUs days), RAM; DISTANT– linking, data movement, cyberinfrastucture (hard, soft and human). • E. How and who?

  7. Titus Brown, Mich. SU

  8. Storage costs A. Simple Storage Service (S3) e.g. Amazon. For the first 50 TB = 15 US cents/Gb ($7,500/50 TB) plus pay for data transfer and operations. VS Buy, store and scale as needed e.g. Web Object Scaler (WOS) Immediate or “longer” term solution Putting Genomes in the Cloud. Making data sharing faster, easier and more scalable. By M. May, May 18, 2010.

  9. 10 Gigabits (Gb)/second

  10. Annotation: Nomenclature, Structural & Functional Structural Annotation: Open reading frames (ORFs) predicted during genome assembly predicted ORFs require experimental confirmation Functional Annotation: annotation of gene products = Gene Ontology (GO) annotation initially, predicted ORFs have no functional literature and GO annotation relies on computational methods (rapid) functional literature exists for many genes/proteins prior to genome sequencing Gene Ontology annotation does not rely on a completed genome sequence Nomenclature

  11. Livestock Gene Nomenclature: Jim Reecy et al., International Society for Animal Genetics from 26th – 30th July 2010, Edinburgh Chicken Gene Nomenclature • 1995: chicken gene nomenclature will follow HGNC guidelines • 2007: chicken biocurators begin assigning standardized nomenclature • 2008: first CGNC report; NCBI begins using standardized nomenclature & CGNC links • 2010: first dedicated chicken gene nomenclature biocurator; NCBI/AgBase/Marcia Miller – structural annotation & nomenclature for MHC regions (chr 16) • Chicken gene nomenclature database – UK & US databases sharing and co-coordinating data.

  12. http://edit-genenames.roslin.ac.uk/ Available via BirdBase & AgBase

  13. Experimental Structural genome annotationProteogenomic mapping

  14. Problems with Current Structural Annotation Methods • EST evidence is biased for the ends of the genes • Computational gene finding programs • Misidentify some, and especially short, genes, genes. • Overlook exons • Incorrectly demarcate gene boundaries, especially splice junctions

  15. Proteogenomic Mapping • Combines genomic and proteomic data for structural annotation of genomes • First reported by Jaffe et al. at Harvard in 2004 in bacteria • McCarthy et al. 2006 first applied in chicken (one of the first uses in a eukaryote; the other two in human). • Improves genome structural annotation based on expressed protein evidence • Confirms existence of predicted protein-coding gene • Identifies exons missed by gene finder • Corrects incorrect boundaries of previously identified genes • Identifies new genes that the gene finding programs missed

  16. CCV genome was sequenced in 1992 But only 12 of predicted 76 ORFs confirmed to exist as proteins. Confirmed 37/76. Identified 17 novel ORFs that were not predicted.

  17. Structural Annotation of the Chicken Genome • Location of genes on the genome • Computational gene finding programs such as Gnomen (NCBI) based on Markov Models and also use • ESTs • Known proteins • Sequence conservation

  18. ePST Generation Process Peptide nucleotide sequence chromosome Map peptide nucleotide sequence to chromosome

  19. Generate ePST (expressed PeptideSequence Tags) from peptides matching genome only Confirm predicted protein-coding gene Biological Sample Trypsin Digestion LC ESI-MS/MS Data Search against protein Database Search against genome translated in 6 reading frame Peptide matches Peptide matches Correction / validation of genome annotation Novel protein-coding gene

  20. ePST Generation Process Peptide nucleotide sequence Stop codon chromosome Locate first downstream in-frame stop codon or canonical splice junction

  21. ePST Generation Process Peptide nucleotide sequence Stop codon chromosome Locate upstream canonical splice junction or in-frame stop

  22. ePST Generation Process Peptide nucleotide sequence Stop codon chromosome Start codon Find 1st start codon between in-frame stop and peptide

  23. ePST Generation Process chromosome Use splice junction or in-frame start as beginning of ePST

  24. ePST Generation Process chromosome ePST coding nucleotide sequence Translate Expressed Peptide Sequence Tag (ePST) amino acid sequence

  25. Functional annotation

  26. No. x 106 25000 18 16 20000 14 12 15000 10 8 10000 6 4 5000 2 0 0 70 75 80 85 90 95 00 05 No. ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 ‘08 ‘09 YEAR

  27. Functional Understanding Canonical and other Networks Ontologies GO Cellular Component GO Biological Process GO Molecular Function BRENDA Pathway Studio 5.0 Ingenuity Pathway Analyses Cytoscape Interactome Databases

  28. Biological interpretation Gene Ontology Network Modeling Derived Implied Physiology (= Cellular Component + Biological Process + Molecular Function)

  29. What is the Gene Ontology? “a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing” the de facto standard for functional annotation assign functions to gene products at different levels, depending on how much is known about a gene product is used for a diverse range of species structured to be queried at different levels, eg: find all the chicken gene products in the genome that are involved in signal transduction zoom in on all the receptor tyrosine kinases human readable GO function has a digital tag to allow computational analysis of large datasets COMPUTATIONALLY AMENABLE ENCYCLOPEDIA OF GENE FUNCTIONS AND THEIR RELATIONSHIPS

  30. GO is the “encyclopedia” of gene functions captured, coded and put into a directed acyclic graph (DAG) structure. In other words, by collecting all of the known data about gene product biological processes, molecular functions and cell locations, GO has become the master “cheat-sheet” for our total knowledge of the genetic basis of phenotype. Because every GO annotation term has a unique digital code, we can use computers to mine the GO DAGs for granular functional information. Instead of having to plough through thousands of papers at the library and make notes and then decide what the differential gene expression from your microarray experiment means as a net affect, the aim is for GO to have all the biological information captured and then retrieve it and compile it with your quantitative gene product expression data and provide a net affect.

  31. Use GO for……. • Determining which classes of gene products are over-represented or under-represented. • Grouping gene products. • Relating a protein’s location to its function. • Focusing on particular biological pathways and functions (hypothesis-testing).

  32. “GO Slim” Many people use “GO Slims” which capture only high-level terms which are more often then not extremely poorly informative and not suitable for hypothesis-testing. In contrast, we need to use the deep granular information rich data suitable for hypothesis-testing

  33. Sourcing displaying GO annotations: secondary and tertiary sources.

  34. GO Consortium: Reference Genome Project • Limited resources to GO annotate gene products for every genome • rely on computational GO annotations • most robust method is to transfer GO between orthologs • Reference genome project: goal is to produce a “gold standard” manually biocurated GO annotation dataset for orthologous genes • 12 reference genomes – chicken is only agricultural species • Chicken RGP contributions provided via USDA CSREES MISV-329140 http://www.geneontology.org/GO.refgenome.shtml

  35. RGP & Taxonomy checks • Transferring GO annotation between orthologs requires: • determining orthologs – computational prediction followed by manual curation • developing ‘sanity’ checks to ensure transferred functions make sense phylogenetically (eg. no lactating chickens!)

  36. Further taxon checking comments may be added here, or contact the AgBase database.

  37. ‘sanity’ check & GOC QC AgBase Quality Checks & Releases AgBase Biocurators ‘sanity’ check AgBase biocuration interface AgBase database GO analysis tools Microarray developers ‘sanity’ check UniProt db QuickGO browser GO analysis tools Microarray developers EBI GOA Project ‘sanity’ check: checks to ensure all appropriate information is captured, no obsolete GO:IDs are used, etc. ‘sanity’ check & GOC QC Public databases AmiGO browser GO analysis tools Microarray developers GO Consortium database

  38. Comparing AgBase & EBI-GOA Annotations 14,000 computational manual - sequence 12,000 manual - literature 10,000 Gene Products annotated 8,000 Complementary to EBI-GOA: Genbank proteins not represented in UniProt & EST sequences on arrays 6,000 4,000 2,000 0 AgBase EBI-GOA AgBase EBI-GOA Chick Chick Cow Cow Project

  39. EBI GOA Contribution to GO Literature Biocuration AgBase Chicken 97.82% EBI-IntAct Roslin HGNC < 0.50% UCL-Heart project MGI Cow Reactome 88.78% < 1.50%

  40. INPUT: functional genomics data (e.g. Microarray data) ArrayIDer GORetriever gene products with GO annotations gene products with NO GO annotations GOanna BLAST output gene products with orthologs and GO annotations GAQ Score Manual interpretation of GOanna output GOanna2ga GA2GEO comprehensive GO annotation gene products with NO orthologs OR with orthologs but NO GO annotations (existing GO analysis programs) data visualization biocurated annotations from literature or specialist knowledge Biocuration from literature GOModeler GOSlimViewer NO literature or specialist knowledge that can be used to make GO annotations Generic: qualitative data presentation. Analysis can only be changed if user has programming skills Specific: user-defined, hypothesis-driven, quantitative data presentation must wait on experimental evidence or new electronic inference

  41. 2010 GO Training Opportunities - on site training by request/interest - webinar: notification via ANGENMAP & GO discussion groups To request a workshop contact Fiona McCarthy fmccarthy@cvm.msstate.edu OR agbase@cse.msstate.edu

  42. Workshop Surveys uncertain strongly agree disagree agree 200 strongly disagree Annual I would recommend this workshop Cumulative 150 I am confident I can get GO questions answered 100 I am confident in using GO for modeling No. of people 50 Topics were well explained 0 Topics covered were relevant 2007 2008 2009 Year workshops offered 10 20 30 40 50 60 % of respondents GO training 2009 Workshop hosts: ISU – Dr Susan Lamont NCSU – Dr Hsiao-Ching Liu

More Related