1 / 55

Next-generation sequencing: the informatics angle

Next-generation sequencing: the informatics angle. Gabor T. Marth Boston College Biology Department. Next-generation sequencing. Illumina, AB/SOLiD short-read sequencers. 10 Gb. (5-15Gb in 25-70 bp reads). 1 Gb. 454 pyrosequencer. (100-400 Mb in 200-450 bp reads).

ealey
Download Presentation

Next-generation sequencing: the informatics angle

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department

  2. Next-generation sequencing Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in 25-70 bp reads) 1 Gb 454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

  3. Individual human resequencing

  4. Whole-genome mutational profiling

  5. Expression analysis

  6. Technologies

  7. Roche / 454 system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads

  8. Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • very high throughput • read properties are very close to traditional capillary sequences • low INDEL error rate

  9. AB / SOLiD system 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 1 3 2 0 T • fixed-length short-reads • very high throughput • 2-base encoding system • color-space informatics

  10. Helicos / Heliscope system • short-read sequencer • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2-pass template sequencing

  11. Data characteristics

  12. Read length 20-60 (variable) 25-50 (fixed) 25-70 (fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

  13. Paired fragment-end reads • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) • instrumental for structural variation discovery

  14. Representational biases “dispersed” coverage distribution • this affects genome resequencing (deeper starting read coverage is needed) • will have major impact is on counting applications

  15. Amplification errors early amplification error gets propagated into every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls

  16. Read quality

  17. Error rate (Solexa)

  18. Error rate (454)

  19. Per-read errors (Solexa)

  20. Per read errors (454)

  21. Applications

  22. Genome resequencing for variation discovery SNPs short INDELs structural variations • the most immediate application area

  23. Genome resequencing for mutational profiling Organismal reference sequence • likely to change “classical genetics” and mutational analysis

  24. De novo genome sequencing Lander et al. Nature 2001 • difficult problem with short reads • promising, especially as reads get longer

  25. Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) • natural applications for next-gen. sequencers

  26. Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 • high-throughput, but short reads pose challenges

  27. Transcriptome sequencing: expression profiling Cloonan et al. Nature Methods, 2008 Jones-Rhoads et al. PLoS Genetics, 2007 • high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

  28. Analysis software

  29. IND (ii) read mapping (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling IND (i) base calling (vi) data validation, hypothesis generation Individual resequencing REF

  30. The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

  31. 1. Base calling base sequence base quality value sequence

  32. Base quality value calibration

  33. Recalibrated base quality values (Illumina)

  34. … and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Problem is, some pieces are easier to place than others…

  35. Strategies to deal with non-unique mapping

  36. 0.8 0.19 0.01 Mapping probabilities (qualities) read

  37. Paired-end read alignments • PE sequences are now the “norm” for genome sequencing • Paired-end read alignments helps unique read placement

  38. Gapped alignments • Gapped alignments: allow mapping reads with insertion or deletion errors, and reads with bona fide INDEL alleles • The ability to map reads with INDEL errors also improves the certainty of unique mapping

  39. 3. SNP and short-INDEL discovery • capillary sequences: • either clonal • or diploid traces

  40. SNP and short-INDEL discovery (II) SNP New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection INS

  41. New demands on SNP calling

  42. Rare alleles in 100s / 1,000s of samples

  43. More samples or deeper coverage / sample?

  44. A/C C/C A/A Determining genotype directly from sequence AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA individual 3

  45. Navigation bar Fragment lengths in selected region Depth of coverage in selected region 4. Structural variation discovery software

  46. 5. Data visualization (assembly viewers) • software development • data validation • hypothesis generation

  47. New analysis tools are needed • Tailoring existing tools for specialized applications (e.g. read mappers for transcriptome sequencing) • Analysis pipelines and viewers that focus on the essential results e.g. the few mutations in a mutant, or compare 1000 genome sequences (but hide most details) • Work-bench style tools to support downstream analysis

  48. Data storage and data standards

  49. What level of data to store? traces base quality values base-called reads images

  50. Data standards • different data storage needs (archival, transfer, processing) often poses contradictory requirements (e.g. normalized vs. non-normalized storage of assembly, alignment, read, image data) • even different analysis goals often call for different optimal storage / data access strategies (e.g. paired-end read analysis for SV detection vs. SNP calling) • requirements include binary formats, fast sequential and / or random access, and flexible indexing (e.g. an entire genome assembly can no longer reside in RAM)

More Related