390 likes | 505 Views
Curiosity, the Internet, & Some Biology Knowledge = Undergraduate Contributions to Genomics Brad Goodner Hiram College. 8 undergrad authors involved in PFGE mapping & transposon mutagenesis. 11 undergrad authors involved in library construction, gap closure, and extensive annotation
E N D
Curiosity, the Internet, & SomeBiology Knowledge = Undergraduate Contributions to GenomicsBrad GoodnerHiram College
8 undergrad authors involved in PFGE mapping & transposon mutagenesis 11 undergrad authors involved in library construction, gap closure, and extensive annotation >200 undergrads involved in annotation 6 undergrad authors involved in PFGE mapping, transposon mutagenesis, & extensive annotation & comparative genomics 450 undergrads from 7 institutions involved in genome annotation
Why Genomics? So many questions! So many great tools! Too few researchers! As of 3/07/2012: DomainCompletedOngoing Archaea 152 214 Bacteria 2843 7969 Eucarya 173 2385 (+ 1970 metagenome samples) So much data! *** So much for me & my students to do!!!***
Basics of a Genome Project The Sequence is Not the End of the Road Overlaps in Small Pieces to Form Contigs Genome 8-20X Sequencing Coverage Gap Closure Join Large Pieces into Sequenced Genome Random Pieces Shotgun Genomic Libraries Sequencing without Cloning Annotation Functional Genomics
10 kb 0 kb 20 kb Annotation Pipeline • Gene finding & operon prediction • Blast & global sequence alignments • Protein domain prediction • Protein localization prediction • Functional prediction • Functional call, linkage to experimental data, & testable hypotheses (community involvement)
Why Involve Students in Genome Annotation? • Most democratic of biology subdisciplines • Sequence data now crucial to not only understanding gene/protein function, but tied to medicine, agriculture, biotechnology, and our basic understanding of evolution • In most automated genome annotations - 35% of gene annotations are wrong in some way & things get missed • The logic of bioinformatic algorithms illustrate key principles of biology
Why Annotate with Students? Most of what we know comes from a relatively small subset of life’s diversity (models). To what extent do these models adequately reflect genomic diversity?
Why Annotate with Students? Genomic Encyclopedia of Bacteria & Archaea (GEBA) is a massive JGI genome sequencing effort to fill in many of the missing or under-sampled branches of the Bacteria & Archaea trunks on the Tree of Life. *T.P. Curtis, W.T. Sloan, and J.W. Scannell. 2002. Estimating prokaryotic diversity and its limits. Proc Natl Acad Sci USA 99: 10494-10499.
Why Annotate with Students? First 56 GEBA genomes* filled in several missing or under-sampled branches of the Bacteria trees & showed that there is a lot of genomic diversity out there to be discovered. * D. Wu, P. Hugenholtz, K. Mavromatis, et al., 2009. A phylogeny-driven genomic encyclopedia of Bacteria and Archaea. Nature 462: 1056-1060.
Making a DifferenceUndergrads & Gene/Genome Annotation • Genes as phylogenetic data • Finding genes • Basic gene information • Pathway/process-driven questions • Hypothetical genes • Genome-wide questions • Comparative genomics • Metagenomics
Finding GenesMistakes are Rarer, but Still Possible Which of these ORF’s is biologically real? One? Two? More? None? But not all!(?) Similarity-based gene calls Ab initio gene calls
Finding GenesSimilarity-based Gene Calls Similarity comparisons Only take those that score above a predetermined threshold (% identity, E value) Database of known genes from other organisms Similarity-based methods can miss: small ORFs that are real Novel ORFs
Finding GenesAb initio Gene Calls A few known or highly probable genes Train a model on frequency of single nucleotides, dimers, trimers, … N-mers found in real genes Run putative genes through model & only take those that score high (higher probability that gene is real) Ab initio can miss: small ORFs that are real Real ORFs that came from elsewhere
10 kb 0 kb 20 kb Basic Gene InformationCorrect Start Codon? PlanctomycesL MIDKVAKDSEMIGIVDYGMGNLRSVQKGFEKVGSTAHIVSTPAEIAAAD Rhodopirellula MITIVDYQMGNLRSVQKAVERSGVEAEITSDASQIAAAE Pelobacter MIVIIDYGMGNLRSVQKGFEKVGYSARVTDDPAVVAQAD Desulfuromonas MITIIDYGMGNLRSVQKGFEKVGYTAQVTDDPRVVEKAE Blastopirellula MITIIDYQMGNLRSVQKAIEKVGHQAVISSDAQEIAQAD PlanctomycesM MITIVDYGMGNLRSVQKAFEKVGAEAEICADPDKIAKAS Heliobacterium MIAIIDYGMGNLRSVQKGLEKAGYAGFVTSDPEAVRSAP Geobacter MIAIIDYGMGNLRSVQKGFERIGFAAEVTADPARILAAE
Pathway/Process-driven QuestionsStudents as Agents of Discovery Example: Looking for genes encoding F0 & F1 components of ATP Synthase in the aerobic N-fixer Azotobacter vinelandii Found 2 operons At Hiram, we typically use a pathway/process-driven annotation approach that is tied to course topics (e.g., gene structure, genome organization, lateral gene transfer); it leads to a much richer annotation tied to biological knowledge for the organism
ebgadBCAI The two operons have different gene orders & evolutionary histories! Bd/e1ACBag
ebgadBCAI The two operons have different gene orders & evolutionary histories! What is the role of two ATP synthase operons in a highly aerobic organism that carries out the very O-sensitive process of N fixation? How common is it to find >1 ATP synthase operon? bd/e1ACBag
6 undergrad authors (5 from Hiram, 1 from SPU) participated in PFGE mapping, transposon mutagenesis, and extensive annotation • 153 undergrads acknowledged for their participation in deep genome annotation as part of courses
Looked for protein domain found in the A subunit of ATP synthase (Pfam00119) • About 150 genomes (~ 7%) have > 2 copies of the A subunit gene • Examples: • almost all of genus Burkholderia have 2 different operons, one on chromosome 1 and another on chromosome 2 Pathway/Process-driven QuestionsOne Finding Leads to New Questions How common is it to find >1 ATP synthaseoperon? Pelobacter has 3 operons (1 split in 2 pieces), with 2 operons due to a clear duplication (74-100% identity) and the other clearly different - Why have this redundancy and diversity?
Pathway/Process-driven QuestionsStudents Finding Holes in Annotation Looking for 10 genes of glycolysis & 3 additional genes for gluconeogenesis Agrobacterium & Chromohalobacter genomes lack FBPase There are 6 different protein families that have FBPase activity There must be a 7th way (& maybe 8th way) as well
Pathway/Process-driven QuestionsStudents Finding Potential Redundancy Gene IDProtein NameKey DomainsNearby Genes of Interest glnA glutamine synthetase GS & glnL next adenylation domains Atu0193 Glutamine synthetase GS domains FAD-oxidoreductase next Atu0602 Glutamine synthetase GS domains FAD-oxidoreductase next, in operon with zwf, pgl, edd Atu1770 Glutamine synthetase type I GS & glnB upstream adenylation domains Atu2142 Glutamine synthetase GS domains amino acid permease upstream Atu2416 Glutamine synthetase type II GS domains GS translation inhibitor Atu4230 Glutamine synthetase type III GS domains gltB
10 kb 0 kb 20 kb Hypothetical GenesThe Need to Bring a Lot of Information Together % with functional % without prediction, Genome# ORFspredictionbut with similarity E. coli K12 DH10B 4126 84.8 15.2 Conexibacter woesei DSM14684 5950 74.4 25.4 Staphylococcus aureus JH9 2753 73.4 26.6 Agrobacterium tumefaciens C58 5402 64.4 34.9 Solibacter usitatus Ellin6076 7940 61.7 37.5 Vibrio cholerae O1 el tor 3835 59.6 40.2 Planctomyces limnophilus 4304 53.6 34.8 What about transmembrane domains, conserved small domains (e.g., PFAMs), etc.?
GEBA Genomes Some Real Opportunities US Dept. of Energy Joint Genome Institute First 56 GEBA genomes* filled in several missing or under-sampled branches of the Bacteria tree & showed that there is a lot of genomic diversity out there to be discovered. * D. Wu, P. Hugenholtz, K. Mavromatis, et al., 2009. A phylogeny-driven genomic encyclopedia of Bacteria and Archaea. Nature 462: 1056-1060.
Division Fusobacteria - found in soils & aquatic habitats, but more so inside animals - only a few have been cultured - best known is Fusobacterium from our oral cavity
Genome-level QuestionsStreptobacillus moniliformis& Rat Bite Fever
Genome-level QuestionsStreptobacillus moniliformis& Rat Bite Fever S. moniliformis: fastidious non-motile facultative anaerobe fermentative Rat bite fever: hemorrhagic rash fever migratory polyarthritis 1.66 Mbp 10.7 Kbp 1568 genes 1511 ORF’s
Genome-level QuestionsStreptobacillus moniliformis& Rat Bite Fever S. moniliformis: fastidious non-motile facultative anaerobe fermentative 1.66 Mbp 10.7 Kbp 1568 genes 1511 ORF’s No catalase, but 1 SOD No genes for flagellar components
Genome-level QuestionsStreptobacillus moniliformis& Rat Bite Fever S. moniliformis: fastidious non-motile facultative anaerobe fermentative How does it make a living? ATP ADP ETC
Genome-level QuestionsStreptobacillus moniliformis& Rat Bite Fever S. moniliformis: fastidious non-motile facultative anaerobe fermentative lactate How does it make a living? ATP ADP ETC
Genome-level QuestionsStreptobacillus moniliformis& Rat Bite Fever purines DNA, RNA pyrimidines uracil uracil ? fatty acids ? polar amino acids PROTEINS sugars glutamate amino acids branched-chain amino acids ClpXP dipeptides, oligopeptides
Genome-level QuestionsStreptobacillus moniliformis& Rat Bite Fever Type II secretion YES Type III secretion NO Type IV secretion YES (likely conjugation) Type VI secretion NO A D H E S I O N ? 25 members of OM trimeric YadA domain protein family Rat bite fever: hemorrhagic rash fever migratory polyarthritis Connective tissue/ECM: hyaluronan chondroitin heparin sulfate collagen hyaluronate lyase polysaccharide lyases heparinase sulfatase 2 pullulanases 2 peptidase 32 collagenases 5 IgA endopeptidases 2 peptidase 32 collagenases O-sialoglycoprotein endopeptidase Phospholipase D Host defenses: (Ig’s, antimicrobial peptides, immunomodulators, etc.)
From What Can You Isolate a Metagenomic DNA Sample? Almost any environmental sample you can imagine! HiramGenomicsStore.com
Why Isolate Metagenomic DNA? Culturing the microbes in any environmental sample will only recover a small percentage of the organisms CULTURING Data from Annual Rev. Micro. 39: 321-46. HiramGenomicsStore.com
Why Isolate Metagenomic DNA? As long as we can break open the toughest cells, then we can represent the entire sample in our isolated metagenomic DNA DNA ISOLATION HiramGenomicsStore.com
MetagenomicsLots of Interesting Questions Sequence all the DNA (unbiased, but expensive) OR Isolate by PCR and sequence 1 or more conserved genes that act as measures of evolutionary history & diversity (e.g., 16S/18S rRNA gene found in all organisms) OR Use PCR to look for specific groups of microbes HiramGenomicsStore.com