Understanding Genomics: Unlocking the Secrets of Life

Scotty Merrell B4137 295-1584 dmerrell@usuhs.mil

Structural Genomics (What is Genomics?)

Genomics defined: "the study of functions and interactions of all the genes in the genome, including their interactions with environmental factors" Has led to a new scientific vocabulary: transcriptome, proteome, secretome, virulome, metabolome

Why the huge interest in Genomics? • Provides comprehensive list of genes (and the proteins they encode) for the entire organism • Provides starting point from which a genome wide understanding of systems • and networks can be initiated • Provides a global picture of genome organization • Allows for the identification of genes families, their distribution between phylogenetic • lineages, and permits insight into gene and genomic evolution on an unprecedented scale • Permits comparison of the global genetic composition of different organisms that • occupy the same niche/different niches • Provides an inventory of genes required for housekeeping function • -----understanding differences in genetic basis of these functions in different • phylogenetic lineages is central to understanding life itself

Practical applications of data generated by genomics: Comprehensive study of microbial pathogenesis and the interaction between pathogens and their hosts Identification of sensitive and specific molecular targets suitable for microbial identification, typing, and for use as markers of anti-microbial resistance Discovery of microbial molecular markers associated with substantial variance in the risk and severity of disease Selection of potential candidates for the rational development of new therapeutic agents and vaccines Identification of genes encoding systems that are unique to bacteria or a particular pathogen

The creation of the field of Genomics was made possible by the development of new technologies that made it possible to sequence entire genomes.

The old way, aka “back when I was a kid” Primer, Nucleotides Polymerase Radionucleotides Dideoxy termination system: While the DNA polymerase will add a dideoxynucleotide complementary to the template strand, it cannot further extend that product after the addition of a dideoxynucleotide. This biochemistry is used to produce populations of products specifically terminated at either A, G, C or T residues. These are labeled in some way and visualized after separation by electrophoresis.

This figure shows the structure of a dideoxynucleotide (notice the H atom attached to the 3' carbon). Also depicted in this figure are the ingredients for a Sanger reaction. Notice the different lengths of labeled strands produced in this reaction dATP

One method for labeling is to use radioactive nucleotides (P32 or P33 or S35) to label the oligonucleotide primer. Four reactions are performed (one each for A,G,C and T), and electrophoresed side by side in a denaturing polyacrylamide gel. The products are separated by size at base resolution and the sequence read from the pattern of bands on the gel.

GATC GATC

Today: The availability of multiple dyes with different emission spectra led to the development of the four-dye - one-lane system. Four aliquots of primer end-labeled with the four different dyes are used to perform the A,G,C and T reactions. These are pooled and run in a single lane of a gel. The sequencer reads the gel by using a spectrophotometer to distinguish between the different dye spectra, and thus the different bases. This system has been further improved by the development of dye-labeled terminators (dideoxynucleotides) that will simultaneously terminate and fluorescently tag a product. These reactions can be performed in a single tube, and run in a single lane. Currently, the four-dye systems can routinely read >600 bases/lane, and the four-lane one-dye systems can read over 1kb per reaction.

The two newest sequencing techniques include: Pyrosequencingis a method of DNA sequencing based on the "sequencing by synthesis" principle developed initially by Mostafa Ronaghi and co-workers in the late 1990s, then further by Biotage. The method is based on a chemiluminescentenzymaticreaction, which is triggered when a molecular recognition event occurs. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it. Each time a nucleotide, A, C, G or T is incorporated into the growing chain a cascade of enzymatic reactions is triggered which results in a light signal. 454 Sequencing is a massively-parallel sequencing-by-synthesis (SBS) system capable of sequencing roughly 20 megabases of raw DNA sequence per 4.5-hour run of their current sequencing machine, the GS20. The system relies on fixing nebulized and adapter-ligated DNA fragments to small DNA-capture beads in a water-in-oil emulsion. The DNA fixed to these beads is then amplified by PCR. Finally, each DNA-bound bead is placed into a ~44 μm well on a PicoTiterPlate, a fiber optic chip. A mix of enzymes such as polymerase, sulfurase, and luciferase are also packed into the well. The PicoTiterPlate is then placed into the GS20 for sequencing.At this stage, the four nucleotides (TAGC) are washed in series over the PicoTiterPlate. During the nucleotide flow, each of the hundreds of thousands of beads with millions of copies of DNA is sequenced in parallel.

So you want to sequence your favorite bug: How would you do this? (what do you need?)

So you want to sequence your favorite bug: Shotgun Sequencing The shotgun part comes from the way the clone is prepared for sequencing: it is randomly sheared into small pieces (usually about 1 kb) and subcloned into a "universal" cloning vector. The library of subfragments is sampled at random, and a number of sequence reads generated (using a universal primer directing sequencing from within the cloning vector). These sequence reads are then assembled into contigs, and the complete sequence of the clone generated.

Sequencing reactions are performed with a universal primer on a random selection of the clones in the shotgun library. Genomic DNA is sheared or restricted to yield random fragments of the required size. These sequencing reads are assembled in to contigs, identifying gaps (where there is no sequence available) and single-stranded regions (where there is sequence for only one strand). The fragments are cloned in a universal vector The gaps and single-stranded regions are then targeted for sequencing to produce the full sequenced molecule.

Where we are today: The Comprehensive Microbial Resource (CMR) contains 401 organisms: 384 completed genomes, 17 incomplete; 28 Archaea, 3 Viruses and 353 Bacteria. + Human, Mice, some plants etc. http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl

Requirements for this? Culturability vs Nonculturability What about microbes that we Don’t know how to grow

Microbial Diversity: Venter, J.C. et al. Environmental Genome Shotgun Sequencing of the Sargasso Sea. Published online in Science March 4, 2004. (this was done without culturing the bacteria, etc.)

In the Sargasso Sea, they found 1800 species of microbes, including 150 new species of bacteria, and over 1.2 million new genes. Although they don’t know what most of these genes do, the research is a first step to understanding more about life in the Sargasso Sea and the larger ocean. It also highlights the fact that we know relatively little about microbial diversity. it’s estimated that we’ve been able to culture less than 1% of microbes.

More requirements for shotgun sequencing? Gene must be clonable! What about genes that are toxic?

Once sequence completed, what now? ATGAAAAGATTAGAAACTTTGGAATCCATTTTAGAGCGCTTGAGAATGTCTATCAAAAAAAACGGACTCAAAAATTCAAAACAGAGAGAAGAAGTGGTGAGCGTTTTGTATCGCAGCGGCACACACCTAAGCCCTGAAGAAATCACGCATTCTATCCGCCAAAAGGACAAAAACACTAGCATTTCTTCAGTCTATCGCATTTTGAATTTCTTAGAAAAAGAAAATTTTATCTGTGTTTTAGAAACTTCAAAAAGCGGTCGGCGCTATGAAATTGCGGCTAAAGAACACCATGATCACATCATTTGTTTGCATTGCGGTAAGATCATTGAATTTGCAGACCCTGAAATTGAAAACCGCCAGAATGAAGTCGTTAAAAAATATCAAGCCAAGCTGATTAGCCATGACATGAAAATGTTTGTGTGGTGTAAAGAATGCCAAGAGAGTGAATGTTAA

Annotation and assigning gene function based on homology 1. Finding ORFs (frame, start, stop) http://www.tigr.org/tigr-scripts/CMR2/GenePage.spl?locus=HP1027

You’ve found an ORF---Now what? Look for homologs (hopefully with a known function) Tool: BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) (blastn---nucleotide vs nucleotide blastp---protein vs protein blastx--translated query vs protein cdart---shows conserved domains etc…

Steps in the Blast algorithm (Blastp) 1.sequence is filtered to remove low complexity regions 2.list of words of length 3 in the query protein sequence is made ( length 11-12 for DNA sequences). 3.words are evaluated for matches with any other combination of 3 amino amino acids using Blosum 62 scoring matrix as default. Matches of PQG to PEG would score 15, to PRG 14, to PSG 13 and to PQA 12 4.For DNA words, a match score of +5 and a mismatch score of -4 is used corresponding to the changes expected in sequences separated by a PAM distance of 40 5.a cutoff score T called a neighborhood word score threshold is selected to reduce the number of matches\ 6.the above procedure is repeated for each 3-letter word in the query sequence. For a sequence of length 250 amino acids, the total number of words to search for is approximately 50 x 250 = 12,500. 7.words organized into an efficient search tree for comparing them rapidly to the database sequences. 8.each database sequence is scanned for an exact match to one of the 50 high scoring amino acid words corresponding to the first query sequence position 9.in Blast2 or gapped Blast, short matched regions called HSPs or high scoring segment pairs lying on the same diagonal and within a certain distance of each other are extended in each direction as long as the score keeps rising. 10.HSPs of score greater than a cutoff score S are kept. 11.in earlier versions of Blast and some of the later ones, the statistical significance of each HSP score is determined and if two or more HSP regions are found, thereby providing additional evidence that the query and database sequences are related, these scores will be combined to form a combined score. 12.in Blast 2, a local gapped alignment of the sequences is made and the significance of the score is determined

blast output Putative conserved domains have been detected, click on the image below for detailed results

Sequences producing significant alignments: (bits) Value gi|16766184|ref|NP_461799.1| (NC_003197) protein tyrosine p... 1068 0.0 gi|16761655|ref|NP_457272.1| (NC_003198) tyrosine phosphata... 992 0.0 gi|13096377|pdb|1G4U|S Chain S, Crystal Structure Of The Sa... 761 0.0 gi|16974849|pdb|1JYO|E Chain E, Structure Of The Salmonella... 207 2e-52 gi|809148|pdb|1YPT|B Chain B, Protein-Tyrosine Phosphatase ... 95 1e-18 gi|1943402|pdb|1YTW| Yersinia Ptpase Complexed With Tungst... 95 1e-18 gi|1353120|sp|P08538|YOPH_YERPS PROTEIN-TYROSINE PHOSPHATAS... 95 2e-18 gi|10955583|ref|NP_052424.1| (NC_002120) Yop effector YopH ... 95 2e-18 gi|14579369|gb|AAK69246.1|AF336309_41 (AF336309) Yop effect... 95 2e-18 gi|16082755|ref|NP_395201.1| (NC_003131) putative protein-t... 94 2e-18 gi|79206|pir||S01054 virulence protein Yop2b - Yersinia pse... 94 2e-18 gi|1065228|pdb|1YTS| Molecule: Yersinia Protein Tyrosine P... 91 3e-17 gi|464498|sp|P34137|PTP1_DICDI PROTEIN-TYROSINE PHOSPHATASE... 62 2e-08 gi|348540|pir||A44267 protein-tyrosine-phosphatase (EC 3.1.... 62 2e-08 gi|15077066|gb|AAK83052.1|AF288366_2 (AF288366) ADP-ribosyl... 58 2e-07 gi|16082697|ref|NP_395143.1| (NC_003131) putative outer mem... 57 3e-07 gi|10955586|ref|NP_052427.1| (NC_002120) Yop effector YopE ... 57 3e-07 gi|141105|sp|P08008|YOPE_YERPS OUTER MEMBRANE VIRULENCE PRO... 57 5e-07 gi|155548|gb|AAA27674.1| (M34280) virulence determinant (yo... 56 9e-07 gi|5572701|dbj|BAA82559.1| (AB019126) sPTPR2B [Ephydatia fl... 49 1e-04 gi|2120612|pir||JC6026 ADP-ribosyltransferase (EC 2.4.2.-) ... 49 1e-04 gi|809147|pdb|1YPT|A Chain A, Protein-Tyrosine Phosphatase (Yersinia) (E.C.3.1.3.48) (Yop51,Pasteurella X,Ptpase,Yop51delta162) (Catalytic Domain, Residues 163 - 468) Mutant With Cys 235 Replaced By Arg (C235r) Length = 305 Score = 95.1 bits (235), Expect = 1e-18 Identities = 66/212 (31%), Positives = 103/212 (48%), Gaps = 17/212 (8%) Query: 340 GKPVALAGSYPKNTPDALEAHMKMLLEKECSCLVVLTSEDQMQAKQ--LPPYFRGSYTFG 397 G +A YP + LE+H +ML E L VL S ++ ++ +P YFR S T+G Sbjct: 89 GNTRTIACQYPLQS--QLESHFRMLAENRTPVLAVLASSSEIANQRFGMPDYFRQSGTYG 146 Query: 398 EVHTNSQKVSSASQGEAI--DQYNMQL-SCGEKRYTIPVLHVKNWPDHQPLPS--TDQLE 452 + S+ G+ I D Y + + G+K ++PV+HV NWPD + S T L Sbjct: 147 SITVESKMTQQVGLGDGIMADMYTLTIREAGQKTISVPVVHVGNWPDQTAVSSEVTKALA 206 Query: 453 YLADRVKNSNQN-----GAPGRSSSDKHLPMIHCLGGVGRTGTMAAALVLKDNPHSNL-- 505 L D+ + +N G+ + K P+IHC GVGRT + A+ + D+ +S L Sbjct: 207 SLVDQTAETKRNMYESKGSSAVADDSKLRPVIHCRAGVGRTAQLIGAMCMNDSRNSQLSV 266 Query: 506 EQVRADFRDSRNNRMLEDASQF-VQLKAMQAQ 536 E + + R RN M++ Q V +K + Q Sbjct: 267 EDMVSQMRVQRNGIMVQKDEQLDVLIKLAEGQ 298

gi|809147|pdb|1YPT|A Chain A, Protein-Tyrosine Phosphatase (Yersinia) (E.C.3.1.3.48) (Yop51,Pasteurella X,Ptpase,Yop51delta162) (Catalytic Domain, Residues 163 - 468) Mutant With Cys 235 Replaced By Arg (C235r) Length = 305 Score = 95.1 bits (235), Expect = 1e-18 Identities = 66/212 (31%), Positives = 103/212 (48%), Gaps = 17/212 (8%) Query: 340 GKPVALAGSYPKNTPDALEAHMKMLLEKECSCLVVLTSEDQMQAKQ--LPPYFRGSYTFG 397 G +A YP + LE+H +ML E L VL S ++ ++ +P YFR S T+G Sbjct: 89 GNTRTIACQYPLQS--QLESHFRMLAENRTPVLAVLASSSEIANQRFGMPDYFRQSGTYG 146 Query: 398 EVHTNSQKVSSASQGEAI--DQYNMQL-SCGEKRYTIPVLHVKNWPDHQPLPS--TDQLE 452 + S+ G+ I D Y + + G+K ++PV+HV NWPD + S T L Sbjct: 147 SITVESKMTQQVGLGDGIMADMYTLTIREAGQKTISVPVVHVGNWPDQTAVSSEVTKALA 206 Query: 453 YLADRVKNSNQN-----GAPGRSSSDKHLPMIHCLGGVGRTGTMAAALVLKDNPHSNL-- 505 L D+ + +N G+ + K P+IHC GVGRT + A+ + D+ +S L Sbjct: 207 SLVDQTAETKRNMYESKGSSAVADDSKLRPVIHCRAGVGRTAQLIGAMCMNDSRNSQLSV 266 Query: 506 EQVRADFRDSRNNRMLEDASQF-VQLKAMQAQ 536 E + + R RN M++ Q V +K + Q Sbjct: 267 EDMVSQMRVQRNGIMVQKDEQLDVLIKLAEGQ 298

This procedure is conducted for every ORF in a newly sequenced genome and all the putative genes get sorted into different functional groups. Gene Role # ofGenes % out of1586 Genes 1Amino acid biosynthesis42 2.64% 2Biosynthesis of cofactors, prosthetic groups, and carriers57 3.59% 3Cell envelope102 6.43% 4Cellular processes125 7.88% 5Central intermediary metabolism24 1.51% 6DNA metabolism90 5.67% 7Energy metabolism99 6.24% 8Fatty acid and phospholipid metabolism25 1.57% 9Hypothetical proteins - Conserved185 11.6% 10Hypothetical Proteins495 31.2% 11Mobile and extrachromosomal element functions17 1.07% 12Protein fate42 2.64% 13Protein synthesis98 6.17% 14Purines, pyrimidines, nucleosides, and nucleotides38 2.39% 15Regulatory functions25 1.57% 16Transcription10 0.63% 17Transport and binding proteins88 5.54% 18Unknown function24 1.51%

This can be represented diagrammatically

The two V. cholerae chromosomes Circular representation of the V. cholerae genome. The two chromosomes, large and small, are depicted. From the outside inward: the first and second circles show predicted protein-coding regions on the plus and minus strand, by role, according to the color code in Fig. 1 (unknown and hypothetical proteins are in black). The third circle shows recently duplicated genes on the same chromosome (black) and on different chromosomes (green). The fourth circle shows transposon-related (black), phage- related (blue), VCRs (pink) and pathogenesis genes (red). The fifth circle shows regions with significant 2 values for trinucleotide composition in a 2,000-bp window. The sixth circle shows percentage G+C in relation to mean G+C for the chromosome. The seventh and eighth circles are tRNAs and rRNAs, respectively.

The ability to sequence the entire genome of an organism • has fueled a revolution in science • genomics provides a huge amount of data • Vast sequence data has fueled new, large-scale, high through-put, technologies • New technologies are revolutionizing (for better or worse) experimental strategies • Experiments commonly designed to examine an organisms phenotype on a • genome-wide or system wide scale (holistic operation of biological systems) • Approach will influence the way biological questions are phrased: • “What is the function of this protein?” To “What role does the sequence play in • one or more biological processes operational under X conditions?” • Old method: phenotype to genotype • New method: genotype to phenotype

Properties of an ideal gene classification system: • Group genes together that share a common ancestor • Provide scaffolding for study of distribution of genes between organisms, distant phylogenetic lineages • * has practical application (identify genes encoding biochemical pathways unique to bacteria) • Provide rapid functional annotation framework for new genome sequences Ancestral gene X1 Bug A X1 Bug B X1 Genes X1 in A and B are orthologs Genes encode same function Gene duplication X1 and X2 X1 and X2 are paralogs Paralogs free to evolve new functions Note: X1 in A and B are also homologs: A gene similar in structure and evolutionary origin to a gene in another species

Types of structural information you can get from Genomics and annotation Helicobacter pylori 26695: Pseudo-2D Gel

%GC (why might this be important/interesting)

Hydrophobicity The GES scale is used to identify nonpolar transbilayer helices. The curve is the average of a residue-specific hydrophobicity scale over a window of 20 residues. When the line is in the upper half of the frame (positive), it indicates a hydrophobic region and when it is in the lower half (negative), a hydrophilic region. In the graph below the X-axis represents the length of the protein in amino acids (aa), while the Y-axis represents the GES score. The blue line shows the GES pattern of the entire protein, while the two dashed red lines represent the putative (lower line) and certain (upper line) cutoffs for potential membrane spanning domains.

Predicted Secondary Structure

Genome Region comparison

Sequenced organisms have large differences in the size of their genome the number of genes encoded therein and the constitution of the ORFs that are coded for.

Genome size is effected by environment

Wide range in genome sizes within a single phylogenetic lineage, suggests that these genomes • are dynamic and in constant flux • Because the vast majority of a bacterial chromosome consists of coding sequences, changes in • genome size reflect differences in gene content. The variation among bacterial genome sizes, • ranging from 0.6 to 9 Mb, reflects differences in biochemical capabilities and, hence, in the range • of environments available to particular microbial lineages. • What is the source of genome variability? • Insight into mechanisms of genome evolution provided by genomics

Principles and features of Horizontal Gene Transfer (HGT) • First recognized in multi-drug resistant pathogenic bacteria • HGT is the non-vertical transmission of genetic material • Mechanisms of HGT • Transduction • Transformation • Conjugation • Maintenance of HT loci • Episomal replication • Homologous recombination • Illegitimate recombination • Integration (catalyzed by phage and IS element integrases/resolvases • Features of HGT: • * HT loci have limited distribution within a single phylogenetic lineage • * HT loci encode phenotypes associated with unrelated species • * Sequence composition of HT loci is most similar to the composition • of the donor genome • - AT richness • - codon usage

V. cholerae virulence gene expression Pathogenicity island

HGT is very common Fig. 1. Distribution of horizontally transferred DNA in the E. coli MG1655 chromosome. Within each centisome, each bar denotes a continuous segment of transferred DNA containing one or more ORFs; and the length of each bar represents its size rounded to the nearest 500 bp. Features of transferred regions, such as duration in the chromosome and the identification of repeated and mobile elements, follow the notation presented in the key. The age of each continuous segment of DNA was inferred from the ages of genes successfully analyzed by back-amelioration; segments lacking genes of known age are shown in black, and no segment comprised genes with significantly different ages. Positions of the replication origin (oriC) and terminus (terC), as well as the identity of the specific tRNA loci found to be adjacent to a horizontally transferred region, are noted on the left of the open bar representing the MG1655 chromosome. The nomenclature for phage and IS elements, and for genes of known function contained within a particular transferred segment, are shown within the corresponding bar. The identities of insertion sequences are noted except as follows: adjacent IS911/(fragment)/IS3 are located within minute 5; adjacent IS3/IS600 are located within minute 8; and adjacent IS2/IS30 are located within minute 31.

Figure 2 Distribution of horizontally acquired (foreign) DNA in sequenced bacterial genomes. Lengths of bars denote the amount of protein-coding DNA. For each bar, the native DNA is blue; foreign DNA identifiable as mobile elements, including transposons and bacteriophages, is yellow, and other foreign DNA is red. The percentage of foreign DNA is noted to the right of each bar. 'A' denotes an Archaeal genome.

Bacterial genomes are mosaics of ancestral and HT genes • Bacterial genomes are continually sampling new genes (the nature of bacterial genetic innovation) • HT loci frequently associated with mobile genetic units • HT loci rarely confer beneficial trait to the host • However, long term survival of HT loci in host genome dependent on the ability to confer • a beneficial trait to the host • * i.e. a gain-of-function mutation (rare) • * Gain-of-function mutations usually requires multiple genes encoding whole systems • * HT loci comprising operons have best chance for success (V. cholerae) • Gain-of-function mutations may permit occupation of a new niche • * HT loci contribute to speciation

Features of genome deletions • Non-reversible • Deletions cannot involve essential loci; target non-essential loci • * essential nature of a locus depends on selective pressures • encountered in the environment • May maintain or disrupt genome synteny (gene order) • * disruption of synteny involves rearrangement of sequence(s) • What is the force driving deletions? Are small genomes more fit • than larger genomes?

The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Graphic depiction of syntenic fragments and lost regions in the genome of the reconstructed ancestor and in Buchnera. Syntenic fragments are color-coded based on position in the ancestor. Lost regions occurring between syntenic fragments are gray. The Reconstruction was on the basis of the phylogenetic distribution of gene orthologs among fully sequenced relatives of Escherichia coli and Buchnera. RESULTS: The reconstructed ancestral genome contained 2,425 open reading frames (ORFs). The Buchnera genome, containing 564 ORFs, consists of 153 fragments of 1-34 genes that are syntenic with reconstructed ancestral regions. On the basis of this reconstruction, 503 genes were eliminated within syntenic fragments, and 1,403 genes were lost from the gaps between syntenic fragments, probably in connection with genome rearrangements.

Part of a syntenic fragment from Buchnera and the ancestor (same as E. coli for this region). Deleted loci are white in the ancestor; orthologous genes are color-coded. Genes shifted up in the figure are oriented forward in the genome; genes shifted down are oriented backwards.

Doubling times of bacteria under laboratory conditions do not correlate with genome size. Data are for 22 species for which doubling times were available in the literature, and include bacteria from ten major taxonomic divisions.

Understanding Genomics: Unlocking the Secrets of Life