1 / 66

Next-generation sequencing: informatics & software aspects

Next-generation sequencing: informatics & software aspects. Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009. New sequencing technologies…. … offer vast throughput … & many applications. 100 Gb. Illumina/Solexa , AB/ SOLiD sequencers.

vaughan
Download Presentation

Next-generation sequencing: informatics & software aspects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009

  2. New sequencing technologies…

  3. … offer vast throughput … & many applications 100 Gb Illumina/Solexa, AB/SOLiD sequencers (10-50Gb in 25-100 bp reads) 10 Gb 1 Gb Roche/454 pyrosequencer (100Mb-1Gb in 200-450bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

  4. Genome resequencing for variation discovery SNPs short INDELs structural variations • the most immediate application area

  5. Genome resequencing for mutational profiling Organismal reference sequence • likely to change “classical genetics” and mutational analysis

  6. De novo genome sequencing Lander et al. Nature 2001 • difficult problem with short reads • promising, especially as reads get longer

  7. Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) • natural applications for next-gen. sequencers

  8. Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 • high-throughput, but short reads pose challenges

  9. Transcriptome sequencing: expression profiling Cloonan et al. Nature Methods, 2008 Jones-Rhoads et al. PLoS Genetics, 2007 • high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

  10. … & enable personal genome sequencing

  11. IND (ii) read mapping (iv) SV calling (iii) SNP and short INDEL calling IND (i) base calling (v) data viewing, hypothesis generation The re-sequencing informatics pipeline REF

  12. The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

  13. Base error characteristics vary 454 Illumina

  14. Read lengths vary 25-60 (variable) 25-50 (fixed) 25-100(fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

  15. Sequence traces are machine-specific Base calling is increasingly left to machine manufacturers

  16. Representationalbiases

  17. Fragment duplication

  18. … and they give you the picture on the box Read mapping is like a jigsaw 2. Read mapping …you get the pieces… Uniquepieces are easier to place than others…

  19. Multiply-mapping reads • “Traditional” repeat maskingdoes not capture repeats at the scale of the read length • Reads from repeats cannot be uniquely mapped back to their true region of origin

  20. Dealing with multiple mapping

  21. Paired-end (PE) reads fragment length: 1 – 10kb fragment length: 100 – 600bp PE reads are now the standard for whole-genome short-read sequencing Korbelet al. Science 2007

  22. Gapped alignments (for INDELs)

  23. The MOSAIK read mapper Michael Strömberg • gapped mapper • option to report multiple map locations • aligns 454, Illumina, SOLiD, Helicos reads • works with standard file formats (SRF, FASTQ, SAM/BAM)

  24. Alignment post-processing • quality value re-calibration • duplicate fragment removal

  25. Data storage requirements

  26. Alignment visualization • too much data – indexed browsing • too much detail – color coding, show/hide

  27. SNP calling: old problem, new data sequencing error polymorphism

  28. Allele calling in next-gen data SNP New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection INS

  29. SNP calling in multi-sample read sets -----a----- -----a----- -----c----- -----c----- P(G1=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(G1=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(G1=ac|B1=aacc; Bi=aaaac;Bn= cccc) P(B1=aacc|G1=aa) P(B1=aacc|G1=cc) P(B1=aacc|G1=ac) -----a----- -----a----- -----a----- -----a----- -----c----- Prior(G1,..,Gi,.., Gn) P(Gi=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(Gi=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(Gi=ac|B1=aacc; Bi=aaaac;Bn= cccc) P(Bi=aaaac|Gi=aa) P(Bi=aaaac|Gi=cc) P(Bi=aaaac|Gi=ac) -----c----- -----c----- -----c----- -----c----- P(Bn=cccc|Gn=aa) P(Bn=cccc|Gn=cc) P(Bn=cccc|Gn=ac) P(Gn=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(Gn=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(Gn=ac|B1=aacc; Bi=aaaac;Bn= cccc) “genotype likelihoods” “genotype probabilities” P(SNP)

  30. Trio sequencing • the child inherits one chromosome from each parent • there is a small probability for ade novo (germ-line or somatic) mutation in the child

  31. Alignment visualization • too much data – indexed browsing • too much detail – color coding, show/hide

  32. Standard data formats SRF/FASTQ GLF/VCF SAM/BAM

  33. Human genome polymorphism projects common SNPs

  34. Human genome polymorphism discovery

  35. The 1000 Genomes Project Pilot 1 Pilot 2 Pilot 3

  36. 1000G Pilot 3 – exon sequencing • Targets: • 1K genes / 10K targets • Capture: • Solid / liquid phase • Sequencing: • 454 / Illumina • SE / PE • Data producers: • Baylor • Broad • Sanger • Wash. U. • Informatics methods: • Multiple read mapping & • SNP calling programs

  37. Coverage varies

  38. On/off target capture ref allele*: 45% non-ref allele*: 54% Target region SNP (outside target region)

  39. Fragment duplication – revisited

  40. Reference allele bias ref allele*: 54% non-ref allele*: 45% (*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346

  41. SNP calling findings • based on a method comparison / testing exercise • 80 samples drawn from the 4 Centers • read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)

  42. Overlap between call sets Broad calls BC calls # SNP calls: # dbSNPs: % dbSNPs: Ts/Tv ratio: 413 24 5.81% 0.23 452 172 38.05% 1.21 2,296 1,862 81.10% 3.40

  43. The 1000G Structural Variation Discovery Effort

  44. Structural variation detection Feuket al. Nature Reviews Genetics, 2006

  45. SV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

  46. Read Depth: good for big CNVs Detection Approaches Reference Sample • Paired-end: all types of SV Lmap • Split-Readsgood break-point resolution read contig • deNovo Assembly~ the future SV slides courtesy of Chip Stewart, Boston College

  47. Read depth (RD)

  48. Statistical & systematic biases

  49. Single molecule sequencing? GC Bias Coverage bias

  50. CN = 3 CN = 2 CN = 1 RD resolution Illumina observed read counts (per kb) density (log10) expected read counts ( per kb)

More Related