460 likes | 633 Views
Intro to Next Generation Sequencing. Nick Loman and James Hadfield. http:// omicsmaps.com /. Koboldt et al., 2010 (Figure 3). Bench work to build libraries and sequence. Clean up and QA reads. Alignments to Genome or Transcriptome. Analysis of Alignments. Koboldt et al., 2010.
E N D
Intro to Next Generation Sequencing
Nick Loman and James Hadfield http://omicsmaps.com/
Bench work to build libraries and sequence Clean up and QA reads Alignments to Genome or Transcriptome Analysis of Alignments
Koboldt et al., 2010 Sample Contamination Tumor-normal switches Sample mix-ups Run quality Library chimeras
GCTACGGCATTCAGGCATCAGGCATTAGCAG GGCATTCAGGGATCAGGCATTAGC-> <-CATGGCATTCAGGGATCAGGCATT <-GCCATGGCATTCAGGGATCAGGC CATTCAGGGATCAGGCATTAGCAG-> GGCATTCAGGGATCAGGCATTAGC-> CATTCAGGGATCAGGCATTAGCAG-> GGCATTCAGGGATCAGGCATT-> <-GGATCAGGCATTAGCAG <-GATCAGGCATTAGCAG <-GGATCAGGCATTAGCAG
FASTQ Example For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores. • FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
SAM (Sequence Alignment/Map) • It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format • SAM is the output of aligners that map reads to a reference genome • Tab delimited w/ header section and alignment section • Header sections begin with @ (are optional) • Alignment section has 11 mandatory fields • BAM is the binary format of SAM http://samtools.sourceforge.net/
Mandatory Alignment Fields http://samtools.sourceforge.net/SAM1.pdf
Alignment Examples Alignments in SAM format http://samtools.sourceforge.net/SAM1.pdf
Valid BED files chr1 86114265 86116346 nsv433165 chr2 1841774 1846089 nsv433166 chr16 2950446 2955264 nsv433167 chr17 14350387 14351933 nsv433168 chr17 32831694 32832761 nsv433169 chr17 32831694 32832761 nsv433170 chr18 61880550 61881930 nsv433171 chr1 16759829 16778548 chr1:21667704 270866 - chr1 16763194 16784844 chr1:146691804 407277 + chr1 16763194 16784844 chr1:144004664 408925 - chr1 16763194 16779513 chr1:142857141 291416 - chr1 16763194 16779513 chr1:143522082 293473 - chr1 16763194 16778548 chr1:146844175 284555 - chr1 16763194 16778548 chr1:147006260 284948 - chr1 16763411 16784844 chr1:144747517 405362 +
GVF format ##gff-version 3 ##gvf-version 1.02 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10090 ##genome-build NCBI MGSCv36 ##assembly-name MGSCv36 ##assembly-accession GCF_000001635.15 ##file-date 2011-11-18 # Study_accession: Combined studies on MGSCv36 # Display_name: Combined studies on MGSCv36 # Study_description: Combined studies on MGSCv36 chr1 dbVarcopy_number_variation 90044442 90114410 . . . ID=nsv433533;Name=nsv433533;Start_range=.,90044442;End_range=90114410,. chr4 dbVarcopy_number_variation 121483931 121646639 . . . ID=nsv433534;Name=nsv433534;Start_range=.,121483931;End_range=121646639,. chr9 dbVarcopy_number_variation 109128634 109146964 . . . ID=nsv433535;Name=nsv433535;Start_range=.,109128634;End_range=109146964,. chr17 dbVarcopy_number_variation 30240627 30614866 . . . ID=nsv433536;Name=nsv433536;Start_range=.,30240627;End_range=30614866,. chr17 dbVarcopy_number_variation 30983722 31036099 . . . ID=nsv433537;Name=nsv433537;Start_range=.,30983722;End_range=31036099,. chr17 dbVarcopy_number_variation 34907088 34962504 . . . ID=nsv433538;Name=nsv433538;Start_range=.,34907088;End_range=34962504,.
Derived data http://www.ncbi.nlm.nih.gov/dbvar http://www.ebi.uk/dgva http://www.ncbi.nlm.nih.gov/snp
Trace Organization SRA Organization seq1 FASTA Experiments Quality Chromatogram Experimental info Samples Sample Sequences and Qualities seq2 FASTA Quality Chromatogram Experimental info Sample
Era of NGS Explosion FASTQ Era Bits/Base Era As of April 10, 2012 SRA contains less bytes then bases
New CycleDecision Circle Increases the number of data series • BAM and similar formats containing both raw reads and alignments become primary output of raw sequencing Compression By Reference reduces sizes of other data series New compression algorithms New sets of tradeoffs
Analyzing New Compression MethodData from 1000 Genome Project
Science 1 July 2011: Vol. 333 no. 6038 pp. 53-58 DOI: 10.1126/science.1207018
Li et al., 2011 Fig. 2
Kleinman et al., 2012 Fig 1
Kleinman et al., 2012 Table 1
Lin et al., 2012 Fig 1
Lin et al., 2012 Fig 2
Pickrell et al., 2012 Fig 1
Li et al, 2012 Fig 1
Li et al., 2012 Fig 2
Li et al., 2012 Fig 3
Li et al, 2012 Fig 4