350 likes | 507 Views
Making Sense of Genomes. Interpreting genomes. DNA was discovered in mid- to late- 1800’s during biochemical investigations of proteins as a phosphorus-rich substance (nuclein, because isolated from white cell nuclei)
E N D
Interpreting genomes • DNA was discovered in mid- to late- 1800’s during biochemical investigations of proteins as a phosphorus-rich substance (nuclein, because isolated from white cell nuclei) • Chromatin was coined to describe colored components of cell nuclei after staining • Chromosome coined in 1888 describes threads of stainable material found withn the nucleus, in 1930’s nculein became desoxyribose nucleic acid..later DNA
Animal genome size variation • 0.04 placozoan to ~133 pg for lungfish (3300X difference) • Current dataset is skewed towards vertebrates (66%) • Some hagfish undergo chromatin loss as large fragments of genome present in germline are eliminated in somatic line
Invertebrate genomes • At most, 1% have been looked at • Terrestrial snails and slugs have 2X larger genomes than freshwater relatives • Some annelid groups exhibit huge differences (120X range) unrelated to polyploidy • No information on tapeworm or tarantula – potential senior project.
Mechanisms for alterations in animal genome sizes • Insertion-deletion mechanism for genome reduction • Selfish DNA and spread of transposable elements • Accumulation of pseudogenes • Introns (clear lack in Fugu (pufferfish) • Chromosome-level events • Aneuploidy (duplication or loss of individual chromosomes), Segments break off and fuse, chromosome 2 fused together in humans, separate in apes, Unequal crossing over in meiosis, and unequal sister chromatid exchange during gamete formation • Polyploidy • Duplication of entire chromosome set • Satellite DNA • Mini (9-100 bp; 15 bp mostly), short (3-5 bp), micro (1-5 bp; ignore overlap with short), copies of rDNA
Plants are well-known polyploids • Allopolyploidy – combination of genetically distinct chromosome sets • Autopolyploidy – multiplication of one basic set of chromosomes • Wheat is an allohexaploid containing three distinct sets of chromosomes from three different diploid species of goat-grass.
Genome size and phenotypes • As cells become larger, surface to volume ratio changes (affecting exchange rate with environment) • Transcription is affected by cell size • Body size function of cell number not cell size • Metabolism trend of smaller genomes and cell size with larger genomes (ie. In birds, flightless birds have larger genomes)
Reasons for correlation • Obviously a physical constraint, large genomes need more room • DNA acts as a nucleoskeleton around which nucleus is assembled? • Observe proportionate change in cell size in response to polyploidization • Cell size is presumably due to the nature of the DNA as well as amount
$1000 genome • Race for the prize • Methods • YouTube1 • Whose genome in the databases? • Venter – writing the code
A G C T A G C A T C C G T A T Capillary and Slab gel electrophoresis use a modified Sanger technology with fluorescent dyes Typical reads of 500-750 nt on an hour timescale. Variation depending on sequencer.
Innovations in DNA sequencing • Sequencing by synthesis • Cot-based analysis • Chip-based analysis, hybridization • Single molecule linear read, RNA polymerase • Nanopore technology • Different nucleotides = Different change in electric signal
Free Solution Electrophoresis • Possibly will improve separation time (no matrix) without losing read length • Label DNA molecules with friction increasing molecule such as streptavidin • Currently can read 100 bp, a long way to go…
Who needs electrophoresis? • Pyrosequencing • MALDI-TOF Mass Spectrometry • Sequencing by Hybridization • Massively Parallel Signature Sequencing • A testimony to innovative molecular biology • Single molecule methods
Pyrosequencing • Real-time sequencing measuring release of PPi during DNA synthesis • Has been of particular use for SNP analysis • First of four deoxynucleotide triphosphates added to reaction, when correct one incorporated Ppi is released and measured using ATP sulfurylase-coupled ATP synthesis and luciferase – wash and repeat
Put the sequencing reactions through a mass spectrometer Spectra of the C- and G- terminated oligonucleotides Current limit ~100 bp, Facilitated by sensitivity and high-throughput loading
Shotgun sequencing – 2 approaches • Hierarchical shotgun approach • Generating an overlapping set of intermediate-sized (e.g. bacterial artificial chromosomes with 200 KB inserts) clones, and keeping a map of that (it took 2 yrs for mapping e-coli) • Subjecting each of these clones to shotgun sequencing, and using the map to get the whole sequence. • Whole-genome shotgun (WGS) approach • Generating sequence reads directly from a whole-genome library • Using computational techniques to reassemble in one step. • Used for Drosophila melanogaster (fruit fly) and by Celera Genomics (formed 1998) for human genome.
Original DNA • Break DNA into random fragments (8-10X Coverage) Overview of “Shotgun” Genomic Sequencing
Cloning vectors • 2-5 kb in pUC or M13 • 5-50 kb in phage or cosmid • 30-100 kb in P1 bacteriophage • 60-300 kb in BAC • 60-2000 kb in YAC
Overview of Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) • Amplify fragments in a vector and sequence 500-700 bases in from each end Base calling performed by Phred software: http://www.phrap.org/ http://www.genome.org/cgi/reprint/8/3/175.pdf
Phred Software • Calls bases in four phases: • Predicting peaks (ideal locations) • Locating observed peaks • Matching observed to predicted • Finding missing peaks • http://www.genome.org/cgi/reprint/8/3/186.pdf • http://www.genome.org/cgi/reprint/8/3/175.pdf
Errors in Sequencing Reads • Each base call is assigned a quality score: • q = -10 x log10(p) {Higher quality scores correspond to low error probabilities; } Errors are associated with peak vicinity, use the following parameters in error probability determination on a TRAINING SET: Peak spacing Uncalled/called ration (two window sizes) Peak resolution Result in a look-up table inherent to Phred software
Common Sources of Sequencing Errors • The first fifty or so peaks of a trace are noisy and unevenly spaced due to anomalous migration of short DNA fragments, and unreacted dye-primer and dye-terminator molecules. • Near the end of the trace, peaks become less evenly spaced due to less accurate trace processing, less well resolved as diffusion effects increase, and also #labeled molecules decrease. • Compressions – most common in GC-rich regions when bases near the end of a single-stranded fragment bind to a complementary region forming a hairpin (migrates more rapidly than expected) • Dye-terminator sequencing method helps resolve compressions, but has own problems: “About 85% of high quality dye terminator errors resulted from a missing G peak following an A, or a missing A folling a T,…” Ewing and Green, 1998.
Overview of Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) • Amplify fragments in a vector and sequence 500-700 bases in from each end • Assemble fragments of sequence that have been read: Contig 1 Contig 2