DNA sequencing: bench to bedside and beyond By: Clyde A. Hutchison III

DNA sequencing: bench to bedside and beyondBy: Clyde A. Hutchison III Christy M. Bogard

Introduction • A look at where we have been, where we are currently, and where we are going

The Sequence Concept in Biology • The importance of sequencing of biological macromolecules was first demonstrated with Sanger’s studies of insulin. • His work showed that amino acids residues are joined in linear polypeptides to form proteins. • Sanger recognized that his research did not and could not make any claims into the arrangement of the residues. • Consequently, when Watson and Crick proposed the double helix DNA structure, they pointed out that their structure placed no constraints on the sequence of a DNA molecule, only that it suggested a mechanism for sequence replication.

Delays in DNA Sequencing • 15 years elapsed between discovery of the DNA double helix structure and the first experimental determination of a DNA sequence. • Different DNA molecule structures were chemically too similar to easily separate. • DNA chains were of much greater length than their protein counterparts, making complete sequencing seem unapproachable. • 20 different amino acids of proteins were of wide variety, making them easier to separate; only 4 nucleotides seem to make the problem of separation more complex. • No base-specific DNAases were known.

Progression with RNA • RNA molecules, while similar in structure, did not share all of these drawbacks. • Transfer RNA (tRNA) were small and individual types could be purified. • RNAases with base specificity were known. • Escherichia coli (E.coli) alanine tRNA first nucleic acid molecule to be sequenced. • Models for tRNA structure could be deduced by assuming base pairing analogous to that found in DNA double helix.

From RNA to DNA • The genome of the bacteriophage ΦX174 was the first DNA molecule to be purified to homogeneity. This was accomplished through the process of Equilibrium buoyant density. • ΦX is a single-stranded circular molecule; phage lambda DNA, which is a linear molecule with cohesive ends, was the first molecule to be successfully sequenced. • Wu and Kaiser measured incorporation of radiolabeled nucleotides by E.coli DNA polymerase in reactions that extended the 3` termini to fill in the complementary cohesive end sequences.

Early Days of DNA Sequencing • Discovery of type II restriction enzymes by Smith and coworkers. These enzymes recognized and cleaved DNA at specific, short nucleotide sequences (4-6 bp), providing a way to cut large DNA molecules into smaller pieces that could be separated by size using gel electrophoresis.

Gel-Based Sequencing Methods • 1975: Sanger introduced ‘plus and minus’ method for DNA sequencing, which used polyacrylamide gels to separate the products of primed synthesis by DNA polymerase in order of increasing chain length. • Problem of determining the length of a homopolymer runs – must be estimated. • Maxam and Gilbert developed a method that used polyacrylamide gels to resolve bands that terminated at each base through the target sequence. Cleave at purine reactions, pyrimidine reactions, and preferencial at A and C.

Gel-Based Sequencing Methods • 1977: Sanger develops the ‘dideoxy method’. Use chain-terminating nucleotide analogs rather than subsets of the four natural dNTPs to cause base-specific termination of primed DNA synthesis. Unlike ‘plus and minus’ method, bands were produced at each nucleotide in a run.

Sequences, Sequences, Sequences • Unlike amino acid sequences, ΦX DNA sequences could be interpreted in terms of genetic code. • Analysis of mutations in genes identified by traditional phage genetics combined with amino acids allowed phage genes to be located on the DNA sequence. • For the first time, DNA sequence identified long open reading frames that could be assigned to genes identified by traditional methods. • It was clear that significant portions of the genome were translated by more than one reading frame to produce two different proteins.

Sequences, Sequences, Sequences • With the introduction of gel-based sequencing methods, the rate of DNA sequencing advanced. • Likewise, the useful read length of dideoxy sequencing increased from about 100 to about 400. • The use of very thin sequencing gels • 35S labeling of DNA, which gives sharper bands than 32P due to lower energy emitting beta particles

The Birth of Bioinformatics • Beginning with ΦX, the management and analysis of sequence data became a major undertaking. • McCallum wrote the first problems to help with compilation and analysis of DNA sequences • Compiled and numbered the complete sequence • Allowed editing of previously compiled sequence • Searched the sequence for specific short sequences or families of sequences (restriction sites) • Translated the sequence in all reading frames

The Birth of Bioinformatics • Dayhoff established • protein sequence database • First collection of nucleotide sequence information • NIH creation of GenBank • Methods of aligning and comparison followed: • FASTA and BLAST made it practical to identify genes in a new sequence by comparing it with sequences currently in the database.

Automated Sequencing Factories • 1986: Caltech and ABI published the first report of automation of DNA sequencing – data could be collected directly to a computer without autoradiography of sequencing gel. • A differently labeled primer was used in each of the four dideoxy sequencing reactions. • Reactions combined and electrophoresed in single polyacrylamide tube gel that when passed by detector could distinguish each of the four colors.

Automated Sequencing Factories • NIH: Venter set up sequencing facility with 6 automated sequencers and 2 Catalyst robots. • 1992: Venter established The Institute for Genomic Research (TIGR) to expand sequencing operation to 30 sequencers and 17 robots. • First factory with teams dedicated to different steps in the sequencing process. • Data analysis added to each phase to quickly detect and correct sequencing problems.

Cellular Genomes • 1995: Craig Venter’s group (TIGR) reported complete genome sequence of two bacterial species: Haemophilus Influenzae and Mycoplasma Genitalium. • H. influenzae gave the first glimpse of complete information set for a living organism. • M. genitalium sequence showed us an approximation to the minimal set of genes required for cellular life. • H. influenzae introduced the whole genome shotgun (WGS) method for sequencing cellular organisms. • DNA randomly fragmented and cloned. Clones sequenced at random and reassembled by computer.

Cellular Genomes • Adoption of ‘paired ends’ strategy is perhaps most important improvement to shotgun sequence. • The automated sequencing procedure used on H. influenzae used melted double-stranded DNA as template whereas the HCMV project had to use single-stranded vectors. With double-stranded templates, one could sequence each clone from both ends. • TIGR assembler – designed to handle thousands of sequence reads involved in even the smallest cellular genome projects.

Cellular Genomes • Advancements led to steady stream of completed genome sequencing. • E.coli • Bacillus subtilis • Saccharomyces cerevisiae • Caenorhabditis elegans • Drosophila melanogaster • Eventually Humans • 1996: ABI introduced the first commercial DNA sequencer that used capillary electrophoresis rather than slab gel • 1998: ABI Prism 3700 with 96 capillaries

Sequencing the Human Genome • 1985: Robert Sinsheimer formally organized a meeting on human genome sequencing at University of California, Santa Cruz • 1985: DeLisi and Smith commissioned first Santa Fe conference funded by DOE to study the feasibility of a Human Genome Initiative • 1990: DOE and NIH present 5 year US Genome Project plan to Congress – 15 years and ~$3 bil • The publicly funded effort became an international collaboration between sequencing centers in US, Europe, and Japan – each focusing on a particular region of the genome.

Sequencing the Human Genome • 1999: Human Genome Project celebrated passing the billion base-pair mark and the first complete human chromosome completely sequenced – chr22. • 2000: Clinton and Blair publicly announce draft versions of the human genome sequence • 2001: public draft human genome sequences published in Science and Nature.

Next Generation Seq. Technology • Newly-emerging methods are, for the first time, challenging the supremacy of the dideoxy method. • Massively parallel – the number of sequence reads from a single experiment is vastly greater • Pyrosequencing: shotgun sequencing of whole genomes without cloning in E.coli or any host cell. • Solexa technology: uses chain-terminating nucleotides to make chain termination a reversible process. • Nanopore Sequencing

Genomic Medicine • All disease has a genetic basis, whether in genes inherited by the affected individual, environmentally induced genetic changes, or genes of a pathogen and their interaction with those individuals infected. • Sequencing of the human genome and major pathogens is beginning to have an impact • Diagnosis, treatment, and prevention of disease • Potential targets of drug therapy and vaccine candidates • Predicted era of personalized medicine

Metagenomics • We lack a comprehensive view of the genetic diversity on Earth because only a very small fraction of microbes found in nature have been grown in pure culture. • Metagenomics focuses on isolating DNA directly from environmental samples and sequenced, without attempting to culture the organisms from which it comes. • Metagenomics currently be applied to study microbial populations in many environments, such as the human gut.

Looking to the Future • The amount of nucleotide sequences in databases has increased logarithmically by nine orders of magnitude from 1965 to 2005. This is an average doubling time of about 16 months.

Interpreting our growth • Inflections in the curve correspond to technical innovations, suggesting we are the on the verge of the next generation of massively parallel sequencers. • It appears possible that methods for collecting sequence data could soon outstrip our capacity to adequately analyze that data, making fundamental advances in computation and bioinformatics essential to our continued progress.

Till Next Week…

A bit of trivia: • 1953: Double-helix structure: • Watson was 24yr old postdoc fellow. • Crick was still a grad student in mid-30s. • Brains are not enough; remember your courage! • “Once you get your courage up and believe that you can do important things, then you can” ~Hamming~ • Always Believe and Doubt your Hypothesis • Believe enough to move forward, doubt enough to find the errors

DNA sequencing: bench to bedside and beyond By: Clyde A. Hutchison III