1 / 36

Genome Sequencing and genome viewers

Genome Sequencing and genome viewers. 6/10/ 2014. Lecture outline . Overview of genome sizes and organization Overview of sequencing a genome Next generation sequencing High-throughput experiments by sequencing Genome browsers. Genome sizes. Genome sequencing.

clio
Download Presentation

Genome Sequencing and genome viewers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Sequencing and genome viewers 6/10/2014

  2. Lecture outline • Overview of genome sizes and organization • Overview of sequencing a genome • Next generation sequencing • High-throughput experiments by sequencing • Genome browsers

  3. Genome sizes

  4. Genome sequencing • Approach depends on the size, complexity and available data for a given organism • De novo sequencing • Re-sequencing for annotation • Sequencing to identify variations • Sample prep – can it be cultured?

  5. Types of questions • How many genes? • How many functional genetic elements • miRNAs, ncRNAs • What’s different about this genome compared to another one? • Virulence differences in pathogenic organisms • What is the cause of this particular phenotype? • What taxonomic groups are represented in this population of bacteria, viruses or fungi?

  6. Genetic maps • Chromosomal banding patterns • Stain with Giemsa (G-banding pattern) Chromosomes are numbered based on size Giemsa binds to phosphate groups & attaches to regions that are AT rich Dark regions heterchromatic, late replicating and AT rich Lighter regions euchromatic, early replicating and GC rich

  7. Chromosome nomenclature p (petite) = short arm q (queue) = long arm Bands are numbered going away from centromere 4q21.1 represents chromosome 4, long arm 2ndband, 1st sub-band and 1st sub-sub-band

  8. DNA sequencing – Overview • Gel electrophoresis • Predominant in 1980s • Whole genome strategies • Physical mapping (BAC clones) • Walking • Shotgun sequencing • Capillary sequencing machines • Computational fragment assembly • Next generation technologies • Polony based sequencing • Novel assembly techniques Cost/base for DNA sequence

  9. Traditional approach • Shear the very large genome into smaller chunks • Clone in vectors that can support large inserts • Digest and separate on high resolution gel to determine the clone overlap • Pick minimum number of clones • Shotgun sequence each clone • Read the traces and assemble • Make the gene calls • Load it into a genome viewer

  10. BAC library in DNA sequencing

  11. Shotgun sequencing

  12. Paired reads vs single reads • Single reads • M13 clones • robotic template prep • Paired reads • Plasmids, cosmids, BACs Gap closure!! Prefer 3-10 mate pairs per gap Inserts of different, but known sizes

  13. Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequence multiple alignment of reads in contig 1. Find overlapping reads 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT.. Target: 30X coverage or >30 high quality reads per base

  14. Assembled into chromosomes • Refseq nomenclature: • NT: genomic sequence of complete gene • NC: chromosome • NM: mRNA sequence • NP: protein sequence Assembly: completed genome, multiple assemblies

  15. Calling the genes • De novo computer algorithms • Identify coding sequences by GC content • Start and stop sites • Intron/exon boundaries • Comparison with other known genes • EST libraries

  16. Sanger method MishaAngrist

  17. Sanger sequencing reached its technical limits • Only modestly parallel (394 lanes/machine) • Long read lengths (500-900 bp) & >99.9% correct • Need to clone the DNA to obtain enough for sequencing reaction • At SLU: cost for typical Sanger sequencing is $5-6/sample with reliable 500 bp of sequence

  18. NGS sequencing • Polony: discrete clonal amplifications of a single DNA molecule, grown in a gel matrix. The clusters can then be individually sequenced, producing short reads • Polony-based or cluster-based sequencing is the basis of most second generation sequencers Typical NGS workflow: Library construction to add adapters to sequence Template CLONAL amplification (on a bead or chip) Massively PARALLEL sequencing

  19. IlluminaHiSeq and miSeq • 100 – 200 bp read lengths • Available locally with MoGene and Cofactor Genomics • GTAC (Wash U) has HiSeq 2000 which has 50bp single end reads and 100 bp paired-end reads • Why not use this for all sequencing? • Cost is ~300-400/library and ~$1100/lane of sequencing • Generate Gb of data per run

  20. Ion Torrent – measures pH changes Done on a semi-conductor chip

  21. Ion Torrent workflow

  22. Illuminavs Ion Torrent • Illumina has greater capacity but longer run times • Latest versions of both have read lengths ~200 bp • SLU will soon have an Ion Torrent machine • Costs?? • Can do single- or pair-end reads • Paired end are 2X cost for library construction, but necessary for de novo genome assembly

  23. Bioinformatics challenges • Each flow cell in the IlluminaHiseq 2000 can generate a billion bases of sequence • Raw read files are Tb in size • Processed read files are several 700-800 Mb • Alignment files 150-300 Mb • Assembly of millions of short (75-100 bp) reads into vertebrate genome • Need high-performance compute (HPC) cluster for vertebrate sized genomes

  24. Sequencing has become a standard technique • RNA sequencing for expression • ChIP sequencing for TF site identification • DNA sequencing for variants • Identification of populations/genetic changes in highly variable viruses and bacteria • Metagenomics • Identification of unknown/non-culturable communities of bacteria/viruses/fungi

  25. Why RNAseq over microarray? • Technical variation is less • Do not need a sequenced genome • Greater dynamic range of expression • Detect transcript isoforms • Identify novel transcripts • Identify non-coding RNAs

  26. Differential expression

  27. Potential transcript isoform

  28. Potential novel transcript

  29. Genome viewers • Provide genomic context for genes • View different tracks of information • Variants, alternative transcripts, GC rich regions • Export images of the gene models • Export genomic sequence for specific regions • Both browsers (UCSC Genome and Ensembl) provide interface for complex queries

  30. UCSC Genome viewer

  31. Alternative tracks

  32. Ensembl genome browser

  33. Configure tracks Ensembl viewer UCSC viewer

  34. Issues with genome browsers • Different data depending on assembly • Sometimes older assemblies have more data tracks • The rate at which data gets added (and verified) with a new assembly depends on staffing/funding and priorities. • In the case of C. elegans, the Ensembl browser is more up-to-date • For the purpose of our exercises, you can use whatever browser you choose, just be sure to note which one & which assembly you used. • I may recommend one over the other because of certain tool availability

  35. This week in lab • Tutorials on using UCSC and Ensembl genome browsers • Comparing gene structures using genome browsers • Exporting variation data from genomic regions using genome browsers • Designing virtual gene knock-outs

More Related