1 / 72

Informatics methods for next-generation DNA sequencing data analysis

Informatics methods for next-generation DNA sequencing data analysis. Gabor T. Marth Boston College Biology Department Boston College Biology Seminar October 14, 2008. Mission: analysis of genetic variations. Insertion-deletion polymorphisms. Single-base substitutions (SNPs).

Download Presentation

Informatics methods for next-generation DNA sequencing data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informatics methods for next-generation DNA sequencing data analysis Gabor T. Marth Boston College Biology Department Boston College Biology Seminar October 14, 2008

  2. Mission: analysis of genetic variations Insertion-deletion polymorphisms Single-base substitutions (SNPs) Epigenetic variations (e.g. changes in methylation / chromatic structure) Structural variations including large-scale chromosomal rearrangements

  3. Next-generation sequencing Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in 25-70 bp reads) 1 Gb 454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

  4. Technologies

  5. Roche / 454 system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads

  6. Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • very high throughput • read properties are very close to traditional capillary sequences

  7. AB / SOLiD system 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 1 3 2 0 T • fixed-length short-reads • very high throughput • 2-base encoding system • color-space informatics

  8. Helicos / Heliscope system • short-read sequencer • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2-pass template sequencing

  9. Read lengths are short 20-60 (variable) 25-50 (fixed) 25-70 (fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

  10. Base error rates are low Illumina 454

  11. Next-gen sequencing enables new applications • organismal resequencing & de novo sequencing • transcriptome sequencing for transcript discovery and expression profiling Ruby et al. Cell, 2006 Jones-Rhoades et al. PLoS Genetics, 2007 • epigenetic analysis (e.g. DNA methylation) Meissner et al. Nature 2008

  12. Tools for genome resequencing

  13. Whole-genome mutational profiling

  14. Large-scale individual human resequencing

  15. IND (ii) read mapping (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling IND (i) base calling (vi) data validation, hypothesis generation The resequencing informatics pipeline REF

  16. The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

  17. diverse chemistry & sequencing error profiles 1. Base calling base sequence base quality (Q-value) sequence

  18. 454 pyrosequencer error profile • multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal the majority of errors are INDELs

  19. 454 base quality values • the native 454 base caller assigns too low base quality values

  20. PYROBAYES: determine base number

  21. PYROBAYES: Performance • better correlation between assigned and measured quality values • higher fraction of high-quality bases

  22. … and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Unique pieces are easier to place than others…

  23. Non-uniqueness of reads confounds mapping • RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length • Reads from repeats cannot be uniquely mapped back to their true region of origin

  24. Strategies to deal with non-unique mapping 0.8 0.19 0.01 • mapping to multiple loci requires the assignment of alignment probabilities read • Non-unique read mapping: optionally eitheronly report uniquely mapped readsorreport all map locations for each read (mapping quality values for all mapped reads are being implemented)

  25. Paired-end reads help unique read placement PE • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency MP • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • PE reads are now the standard for genome resequencing

  26. MOSAIK

  27. Error types are very different Illumina 454

  28. Gapped alignments

  29. Aligning multiple read types together ABI/capillary 454 FLX • Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics 454 GS20 Illumina

  30. MOSAIK is one of the fastest read mappers

  31. sequencing error polymorphism 3. Polymorphism / mutation detection

  32. SNP calling with Bayesian mathematics

  33. deep alignments of 100s / 1000s of individuals • trio sequences New challenges for SNP calling

  34. Allele discovery is a multi-step sampling process Samples Reads Population

  35. Capturing the allele in the sample

  36. Allele calling in deep sequence data aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac

  37. More samples or deeper coverage / sample? …or deeper coverage from fewer samples? Shallower read coverage from more individuals … simulation analysis by Aaron Quinlan

  38. Analysis indicates a balance

  39. SNP calling in trios • the child inherits one chromosome from each parent • there is a small probability for a mutation in the child

  40. P=0.86 SNP calling in trios aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac father mother aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.79 child

  41. 4. Structural variation discovery

  42. SV events from PE read mapping patterns DNA reference pattern LM LF LM ~ LF+Ldel & depth: low Deletion Ldel Tandemduplication LM ~ LF-Ldup & depth: high Ldup LM ~ LF+LT1LM~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2 LM LM Translocation LT2 LT1 LM LM ~ +Linv & ends flipped LM ~ -Linv depth: normal Inversion Linv un-paired read clusters & depth normal Insertion Lins LM ~LF+LT & depth: normal& cross-paired read clusters Chromosomaltranslocation LT

  43. Deletion: Aberrant positive mapping distance

  44. Copy number estimation from depth of coverage

  45. Spanner – a hybrid SV/CNV detection tool Navigation bar Fragment lengths in selected region Depth of coverage in selected region

  46. 5. Data visualization • aid software development: integration of trace data viewing, fast navigation, zooming/panning • facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays • promote hypothesis generation: integration of annotation tracks

  47. Data visualization

  48. Our software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release

  49. Data mining projects

  50. Whole genome SNP discovery in Illumina data C. elegans reference genome (Bristol, N2 strain) Bristol, N2 strain (3 ½ machine runs) Pasadena, CB4858 (1 ½ machine runs) • goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes

More Related