280 likes | 508 Views
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector (plasmid, BAC), PCR genome mapping. relative locations of genes are established by following inheritance patterns.
E N D
recombinant DNA technology • DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) • DNA cloning – vector (plasmid, BAC), PCR • genome mapping relative locations of genes are established by following inheritance patterns visual appearance of a chromosome when stained and examined under a microscope the order and spacing of the genes, measured in base pairs sequence map
genetic markers • polymorphic (alternative alleles) • restriction fragment length polymorphisms (RFLPs) • some restriction sites exist as two alleles • simple sequence length polymorphisms (SSLPs) • repeat sequences, minisatellites (repeat unit up to 25 bp), microsatellites (repeat unit of 2-4 bp) • single nucleotide polymorphisms (SNPs, pron.: “snips”) • Positions in a genome where some individuals have one nucleotide and others have a different nucleotide RFLP SSLP
DNA sequencing • Sanger method, chain-termination method, developed 1974, Nobel prize in chemistry 1980 • The key principle: use of dideoxynucleotide triphosphates (ddNTPs) as DNA chain terminators. dNTP ddNTP source: http://openwetware.org/wiki/BE.109:Bio-material_engineering/Sequence_analysis
Shotgun sequencing • Current technology can only reliably sequence a short stretch – a ‘read’ is typically ~1000 bp. • However genomes are large. The sequence of a long DNA molecule has to be constructed from a series of shorter sequences. • This is done by breaking(cleaving by restriction endonuclease) the molecule into fragments, determining their sequences, and using a computer to search for overlaps and build up the master sequence • This shotgun sequencing is the standard approach for sequencing small prokaryotic genomes. • But is much more difficult with larger eukaryotic genomes, as it can lead to errors when repeatsare analyzed. • human genome is repeat-rich, >50% repeats (50-500 kpb duplicated regions with >98% identity)
Target Copies Shotgun Sequence each short piece Sequence assembly Consensus Finalizing (directed read) source: slides by Martin Farach-Colton
source: Brown T. A. , Genomes. 2nd ed. http://www.ncbi.nlm.nih.gov/books/NBK21129/
Human genome project (HGP) • Determine the sequence of haploid human genome • Govermentally funded (DOE) • Began in 1990, working draft published in 2001, complete in 2003, last chromosome finished in 2006 • Cost: $3 billion • Whose genome was sequenced? • The “reference genome” is a composite from several people who donated blood samples.
Celera - competition begins • In 1998, a similar, privately funded quest was launched by the American researcher Craig Venter, and his company Celera Genomics. • The $300,000,000 Celera effort was intended to proceed at a faster pace and at a fraction of the cost. • Celera wanted to patent identified genes. • Celera promised to release data annually (while the HGP daily). However, Celera would, unlike HGP, not permit free redistribution or scientific use of the data. • HGP was compelled to release (7.7. 2000) the first draft of the human genome before Celera for this reason.
How did it end? • March 2000 – president Clinton announced that the genome sequence could not be patented, and should be made freely available to all researchers. • The statement sent Celera's stock plummeting and dragged down the biotechnology-heavy Nasdaq. The biotechnology sector lost about $50 billion in two days. • Celera and HGP annouced jointly the draft sequence in 2000. • The drafts covered about 83% of the genome. • Improved drafts were announced in 2003 and 2005, filling in to ≈92% of the sequence currently.
Human genome • 3 billions bps, ~20 000 – 25 000 genes • Only 1.1 – 1.4 % of the genome sequence codes for proteins. • State of completion: • best estimate – 92.3% is complete • problematic unfinished regions: centromeres, telomeres (both contain highly repetitive sequences), some unclosed gaps • It is likely that the centromeres and telomeres will remain unsequenced until new technology is developed • Genome is stored in databases • Primary database – Genebank (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide) • Additional data and annotation, tools for visualizing and searching • UCSCS (http://genome.ucsc.edu) • Ensembl (http://www.ensembl.org)
Hierarchical genome shotgun – HGS • Hierarchical genome shotgun, hierarchical shotgun sequencing, clone-by-clone sequencing, map-based shotgun sequencing, clone contig sequencing • Adopted by HGP • Strategy “map first, sequence second” • Create physical map • Divide chromosomes to smaller fragments. • Order (map) them to correspond to their respective locations on the chromosomes. • Determine the base sequence of each of the mapped fragments.
Hierarchical genome shotgun – HGS • Map genome • As genetic markers (landmarks), short tagged sites (STS) were used (200 to 500 base pair DNA sequence that has a single occurrence in the genome) • Copy target DNA • Make BAC library • cleave (partial cleavage by restriction endonuclease) all target DNA copies randomly, insert these sub-clones into BACs • Physically map all BACs • Find a subset of BACs that cover target DNA • minimal tiling path • Shotgun sequence only BACs at minimal tailing path • Divide BACs into fragments (ultrasound or pressure), do plasmid cloning, reconstruct BAC sequence • Fill in gaps between BACS • Merge into consensus sequence The sequenced sub-clones are linked up to produce the DNA contig, which is the de-coded version of the original source DNA. As this method progresses, larger and larger contigs will be produced, until a single ordered contig of the genome is achieved.
http://www.nature.com/scitable/content/idealized-representation-of-the-hierarchical-shotgun-sequencing-48221http://www.nature.com/scitable/content/idealized-representation-of-the-hierarchical-shotgun-sequencing-48221
Minimal tiling path A collection of overlapping bacterial artificial chromosome (BAC) clones. The clones outlined in red, which provide a minimal tiling path across the corresponding genomic region, are selected for sequencing.
Coverage • As it was shown, individual nucleotides are represented more than with one read. • Coverage is the average number of reads representing a given nucleotide in the reconstructed sequence. • Let’s say that for a source strand of length G = 100 Kbp we sequence R = 1 500 reads of average legth L = 500. • Thus, we collect N = RL = 750 Kbps of data. • So we have sequenced on average every bp in the source N/G = 7.5 times. • The coverage is 7.5X • Coverage in HGS adopted by HGP was 8X.
Whole genome shotgun – WGS • Adopted by Celera. • De facto application of shotgun to large genome. Never done before on such a large scale. • Expensive and time consuming mapping is not performed. • Each piece of DNA is cut into smaller fragments. Each fragment is sequenced first, and then overlapping sequences are joined together to create the contig. • To achieve enough of accuracy, higher coverage (20X) had to be used. • Crucial for the assembly was development of new algorithms.
Genome assembly • Aligning and merging short fragments of DNA sequence in order to reconstruct the original (loger) sequence. • reads – typically 500-1000 bp, merge them into contigs, arrange/merge contigs into scaffolds • scaffold – a series of contigs that are in the right order but are not necessarily connected in one continuous stretch of sequence source: Xiong, Essential Bioinformatics
Genome assembly • Can be very computationally intensive when dealt at the whole genome level. • Major challenges: • sequence errors – can be corrected by drawing consensus sequence from an alignment of multiple overlapped sequences • contamination by bacterial vectors – can be removed using filtering programs prior to assembly • repeats – RepeatMasker (http://www.repeatmasker.org/) can be used to detect and mask repeats
Forward-reverse constraint • Common constraint to avoid errors due to the repeats: forward-reverse constraint • Sequence is generated from both ends of a clone → distance between the two opposing fragments of a clone is fixed to a certain range (defined by a clone length). When the constraint is applied, even when one of the fragments has a perfect match with a repetitive element outside the range, it is not able to be moved to that location to cause missassembly. constraint no constraint red fragment is misassembled source: Xiong, Essential Bioinformatics
Sequence assemblers • base calling – convert raw/processed data from a sequencing instrument into sequences and scores • individual bases have scores, reflect the likelihood the base is correct/incorrect • in capillary sequencing, identify the sequence from chromatogram source: Lee SH, Vigliotti VS, Pappu S., J Clin Pathol. 2010, 63(3) 235-9 PMID: 19858529
PHRED • Base caller • Reads DNA sequence chromatogram files and analyzes the peaks to call bases. • base quality score – PHRED examines peaks around each base call to assign a score q to each base call that is logarithmically linked to the error propability P: , typically taken q = 20 corresponding to 1% error probability (99% correct base calling) • Phred is a two-step process: • Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases • Application: Given new reads and quality statistics, predict the quality for each of the bases.
PHRAP • Sequence assembly • Takes PHRED base-call files with quality scores as input. • Aligns individual fragments in a pairwise fashion. The base quality information is taken into account during the pairwise alignment.
Personal human genomes • Personal genomes had not been sequenced in the Human Genome Project to protect the identity of volunteers who provided DNA samples. • Following personal genomes are available by July 2011: • Japanese male (2010, PMID: 20972442) • Korean male (2009, PMID: 19470904) • Chinese male (2008, PMID: 18987735) • Nigerian male (2008, PMID: 18987734) • J. D. Watson (2008, PMID: 18421352) • J. C. Venter (2007, PMID: 17803354) • HGP sequence is haploid, however, the sequence maps for Venter and Watson for example are diploid, representing both sets of chromosomes.