1.24k likes | 3.36k Views
DNA sequencing: methods. I. Brief history of sequencing II. Sanger dideoxy method for sequencing III. Sequencing large pieces of DNA VI. The “$1,000 dollar genome”. On WebCT -- “The $1000 genome” -- review of new sequencing techniques by George Church. Why sequence DNA?.
E N D
DNA sequencing: methods I. Brief history of sequencing II. Sanger dideoxy method for sequencing III. Sequencing large pieces of DNA VI. The “$1,000 dollar genome” On WebCT -- “The $1000 genome” -- review of new sequencing techniques by George Church
Why sequence DNA? • All genes available for an organism to use -- a very important tool for biologists • Not just sequence of genes, but also positioning of genes and sequences of regulatory regions • New recombinant DNA constructs must be sequenced to verify construction or positions of mutations • Etc.
History of DNA sequencing MC chapter 12
Methods of sequencing • Sanger dideoxy (primer extension/chain-termination) method: most popular protocol for sequencing, very adaptable, scalable to large sequencing projects • Maxam-Gilbert chemical cleavage method: DNA is labelled and then chemically cleaved in a sequence-dependent manner. This method is not easily scaled and is rather tedious • Pyrosequencing: measuring chain extension by pyrophosphate monitoring
for dideoxy sequencing you need: Single stranded DNA template A primer for DNA synthesis DNA polymerase Deoxynucleoside triphosphates and dideoxynucleotide triphosphates
Primers for DNA sequencing • Oligonucleotide primers can be synthesized by phosphoramidite chemistry--usually designed manually and then purchased • Sequence of the oligo must be complimentary to DNA flanking sequenced region • Oligos are usually 15-30 nucleotides in length
DNA templates for sequencing: • Single stranded DNA isolated from recombinant M13 bacteriophage containing DNA of interest • Double-stranded DNA that has been denatured • Non-denatured double stranded DNA (cycle sequencing)
One way for obtaining single-stranded DNA from a double stranded source--magnets
Reagents for sequencing: DNA polymerases • Should be highly processive, and incorporate ddNTPs efficiently • Should lack exonuclease activity • Thermostability required for “cycle sequencing”
Sanger dideoxy sequencing--basic method Single stranded DNA 5’ 3’ 3’ 5’ a) Anneal the primer
Sanger dideoxy sequencing: basic method 5’ Direction of DNA polymerase travel b) Extend the primer with DNA polymerase in the presence of all four dNTPs, with a limited amount of a dideoxy NTP (ddNTP) 3’
DNA polymerase incorporates ddNTP in a template-dependent manner, but it works best if the DNA pol lacks 3’ to 5’ exonuclease (proofreading) activity
Sanger dideoxy sequencing: basic method 5’ 3’ T T T T 3’ 5’ ddATP in the reaction: anywhere there’s a T in the template strand, occasionally a ddA will be added to the growing strand ddA ddA ddA ddA
How to visualize DNA fragments? • Radioactivity • Radiolabeled primers (kinase with 32P) • Radiolabelled dNTPs (gamma 35S or 32P) • Fluorescence • ddNTPs chemically synthesized to contain fluors • Each ddNTP fluoresces at a different wavelength allowing identification
Analysis of sequencing products: Polyacrylamide gel electrophoresis--good resolution of fragments differing by a single dNTP • Slab gels: as previously described • Capillary gels: require only a tiny amount of sample to be loaded, run much faster than slab gels, best for high throughput sequencing
DNA sequencing gels: old school Different ddNTP used in separate reactions Analyze sequencing products by gel electrophoresis, autoradiography Radioactively labelled primer or dNTP in sequencing reaction
cycle sequencing: denaturation occurs during temperature cycles 94°C:DNA denatures 45°C: primer anneals 60-72°C: thermostable DNA pol extends primer Repeat 25-35 times Advantages: don’t need a lot of template DNA Disadvantages: DNA pol may incorporate ddNTPs poorly
Animation of cycle sequencing: see http://www.dnai.org/ Click on: “manipulation” “techniques” “sorting and sequencing”
An automated sequencer The output
Current trends in sequencing: It is rare for labs to do their own sequencing: --costly, perishable reagents --time consuming --success rate varies Instead most labs send out for sequencing: --You prepare the DNA (usually plasmid, M13, or PCR product), supply the primer, company or university sequencing center does the rest --The sequence is recorded by an automated sequencer as an “electropherogram”
BREAK UP THE GENOME, PUT IT BACK TOGETHER Assemble sequences by matching overlaps ~160 kbp BAC sequence ~1 kbp BAC overlaps give genome sequence
Sequencing large pieces of DNA:the “shotgun” method • Break DNA into small pieces (typically sizes of around 1000 base pairs is preferable) • Clone pieces of DNA into M13 • Sequence enough M13 clones to ensure complete coverage (eg. sequencing a 3 million base pair genome would require 5x to 10x 3 million base pairs to have a reliable representation of the genome) • Assemble genome through overlap analysis using computer algorithms, also “polish” sequences using mapping information from individual clones, characterized genes, and genetic markers • This process is assisted by robotics
Sequencing done by TIGR (Maryland) and The Sanger Institute (Cambridge, UK) “Here we report an analysis of the genome sequence of P. falciparum clone 3D7, including descriptions of chromosome structure, gene content, functional classification of proteins, metabolism and transport, and other features of parasite biology.”
Sequencing strategy A whole chromosome shotgun sequencing strategy was used to determine the genome sequence of P. falciparum clone 3D7. This approach was taken because a whole genome shotgun strategy was not feasible or cost-effective with the technology that was available at the beginning of the project. Also, high-quality large insert libraries of (A - T)-rich P. falciparum DNA have never been constructed in Escherichia coli, which ruled out a clone-by-clone sequencing strategy. The chromosomes were separated on pulsed field gels, and chromosomal DNA was extracted…
The shotgun sequences were assembled into contiguous DNA sequences (contigs), in some cases with low coverage shotgun sequences of yeast artificial chromosome (YAC) clones to assist in the ordering of contigs for closure. Sequence tagged sites (STSs)10, microsatellite markers11,12 and HAPPY mapping7 were also used to place and orient contigs during the gap closure process. The high (A /T) content of the genome made gap closure extremely difficult7–9. Chromosomes 1–5, 9 and 12 were closed, whereas chromosomes 6–8, 10, 11, 13 and 14 contained 3–37 gaps (most less than 2.5 kb) per chromosome at the beginning of genome annotation. Efforts to close the remaining gaps are continuing.
Methods: Sequencing, gap closure and annotation The techniques used at each of the three participating centres for sequencing, closure and annotation are described in the accompanying Letters7–9. To ensure that each centres’ annotation procedures produced roughly equivalent results, the Wellcome Trust Sanger Institute (‘Sanger’) and the Institute for Genomic Research (‘TIGR’) annotated the same100-kb segment of chromosome 14. The number of genes predicted in this sequence by the two centres was 22 and 23; the discrepancy being due to the merging of two single genes by one centre. Of the 74 exons predicted by the two centres, 50 (68%) were identical, 9 (2%) overlapped, 6 (8%) overlapped and shared one boundary, and the remainder were predicted by one centre but not the other. Thus 88% of the exons predicted by the two centres in the 100-kb fragment were identical or overlapped.
The $1000 dollar genome Venter Foundation (2003): The first group to produce a technology capable of a $1000 human genome will win $500,000 … X - Prize Foundation: no, $5 - 20 million … National Institutes of Health (2004): $70 million grant program to reach the $1000 genome
Previous sequencing techniques: one DNA molecule at a time Needed: many DNA molecules at a time -- arrays One of these: “pyrosequencing” Cut a genome to DNA fragments 300 - 500 bases long Immobilize single strands on a very small plastic bead (one piece of DNA per bead) Amplify the DNA on each bead to cover each bead to boost the signal Separate each bead on a plate with up to 1.6 million wells
Sequence by DNA polymerase -dependent chain extension, one base at a time in the presence of a reporter (luciferase) Luciferase is an enzyme that will emit a photon of light in response to the pyrophosphate (PPi) released upon nucleotide addition by DNA polymerase Flashes of light and their intensity are recorded
Extension with individual dNTPs gives a readout A B The readout is recorded by a detector that measures position of light flashes and intensity of light flashes A B
25 million bases in about 4 hours From www.454.com APS = Adenosine phosphosulfate
Height of peak indicates the number of dNTPs added This sequence: TTTGGGGTTGCAGTT
DNA sequencing: methods I. Brief history of sequencing II. Sanger dideoxy method for sequencing III. Sequencing large pieces of DNA VI. The “$1,000 dollar genome” On WebCT -- “The $1000 genome” -- review of new sequencing techniques by George Church
Introduction to bioinformatics • Making biological sense of DNA sequences • Online databases: a brief survey • Database in depth: NCBI • What is BLAST? • Using BLAST for sequence analysis • “Biology workbench”, etc. www.ncbi.nlm.nih.gov www.tigr.org http://workbench.sdsc.edu
There’s plenty of DNA to make sense of http://www.genomesonline.org/ (2006)
Making sense of genome sequences: • Genes • Protein-coding • Where are the open reading frames? • What are the ORFs most similar to? (What is the function/structure/evolution history?) • RNA • Non-genes • Regulation: promoters and factor-binding sites • Transactions: replication, repair, and segregation, DNA packaging (nucleosomes)
Sequence output Raw data Computer calls GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
atgttgtatttgtctgaagaaaataaatccgtatccactccttgccctcctgataagattatctttgatgcagagaggggggagtacatttgctctgaaactggagaagttttagaagataaaattatagatcaagggccagagtggagggccttcacgccagaggagaaagaaaagagaagcagagttggagggcctttaaacaatactattcacgataggggtttatccactcttatagactggaaagataaggatgctatgggaagaactttagaccctaagagaagacttgaggcattgagatggagaaagtggcaaattagaatgttgtatttgtctgaagaaaataaatccgtatccactccttgccctcctgataagattatctttgatgcagagaggggggagtacatttgctctgaaactggagaagttttagaagataaaattatagatcaagggccagagtggagggccttcacgccagaggagaaagaaaagagaagcagagttggagggcctttaaacaatactattcacgataggggtttatccactcttatagactggaaagataaggatgctatgggaagaactttagaccctaagagaagacttgaggcattgagatggagaaagtggcaaattaga What does this sequence do? Could it encode a protein?
Where are the potential starts (ATG) and stops (TAA, TAG, TGA)? Which reading frame is correct? ORF map = ATG = stop codon Reading frame #1 appears to encode a protein
Cautions in ORF identification • Not all genes initiate with ATG, particularly in certain microbes (archaea) • What is the shortest possible length of a real ORF? 50 amino acids? 25 amino acids? Cut-off is somewhat arbitrary. • In eukaryotes, ORFs can be difficult to identify because of introns • Are there other sequences surrounding the ORF that indicate it might be functional? • promoter sequences for RNA polymerase binding • Shine-Dalgarno sequences for ribosome binding?
What is the function of the sequenced gene? Classical methods: -- mutate gene, characterize phenotype for clues to function (genetics) -- purify protein product, characterize in vitro (biochemistry) Comparison to previously characterized genes: -- genes sequences that have high sequence similarity usually have similar functions -- if your gene has been previously characterized (using classical methods) by someone else, you want to know right away! (avoid duplication of labor)
NCBI NCBI home page --Go to www.ncbi.nlm.nih.gov for the following pages Pubmed: search tool for literature--search by author, subject, title words, etc. All databases: “a retrieval system for searching several linked databases” BLAST: Basic Local Alignment Sequence Tool OMIM: Online Mendelian Inheritance in Man Books: many online textbooks available Tax Browser: A taxonomic organization of organisms and their genomes Structure: Clearinghouse for solved molecular structures
What does BLAST do? Searches chosen sequence database and identifies sequences with similarity to test sequence Ranks similar sequences by degree of homology (E value) Illustrates alignment between test sequence and similar sequences
Alignment of sequences: The principle: two homologous sequences derived from the same ancestral sequence will have at least some identical (similar) amino acid residues Fraction of identical amino acids is called “percent identity” Similar amino acids: some amino acids have similar physical/chemical properties, and more likely to substitute for each other--these give specific similarity scores in alignments Gaps in similar/homologous sequences are rare, and are given penalty scores
Homology of proteins Homology: similarity of biological structure, physiology, development, and evolution, based on genetic inheritance Homologous proteins: statistically similar sequence, therefore similar functions (often, but not always…) Alignment of TFB and TFIIB sequences
High sequence similarity correlates with functional similarity enzymes Non-enzymes 40-20% identity: fold can be predicted by similarity but precise function cannot be predicted (the 40% rule)
Programs available for BLAST searches Protein sequence (this is the best option) blastp--compares an amino acid query sequence against a protein sequence database tblastn--compares a protein query sequence against a nucleotide sequence database translated in all reading frames DNA sequence blastn--compares a nucleotide query sequence against a nucleotide sequence database blastx--compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastx--compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.