1 / 41

Informatics challenges and computer tools for sequencing 1000s of human genomes

Informatics challenges and computer tools for sequencing 1000s of human genomes. Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008. Large-scale individual human resequencing. Next-gen sequencers offer vast throughput….

ranit
Download Presentation

Informatics challenges and computer tools for sequencing 1000s of human genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008

  2. Large-scale individual human resequencing

  3. Next-gen sequencers offer vast throughput… Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in 25-70 bp reads) 1 Gb 454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

  4. IND (ii) read mapping (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling IND (i) base calling (vi) data validation, hypothesis generation The resequencing informatics pipeline REF

  5. The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

  6. 1. Base calling base sequence base quality (Q-value) sequence • early manufacturer-supplied base callers were imperfect • third party software made substantial improvements • machine manufacturers are now focusing more on base calling

  7. … and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Larger, more unique pieces are easier to place than others…

  8. Next-gen reads are generally short 20-60 (variable) 25-50 (fixed) 25-70 (fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

  9. Base error rates are low Illumina 454

  10. Strategies to deal with non-unique mapping

  11. 0.8 0.19 0.01 read Mapping probabilities (qualities)

  12. Error types are very different Illumina 454

  13. Gapped alignments

  14. MOSAIK • fast • accurate • gapped • versatile (short + long reads)

  15. 3. SNP and short-INDEL calling • deep alignments of 100s / 1000s of individuals • trio sequences

  16. Allele discovery is a multi-step sampling process Samples Reads Population

  17. Capturing the allele in the sample

  18. Allele calling in the reads number of individuals allele call in read base quality

  19. How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac

  20. The need for accurate data…

  21. … and realistic base quality values

  22. Recalibrated base quality values (Illumina)

  23. More samples or deeper coverage / sample? …or deeper coverage from fewer samples? Shallower read coverage from more individuals … simulation analysis by Aaron Quinlan

  24. Analysis indicates a balance

  25. SNP calling in trios • the child inherits one chromosome from each parent • there is a small probability for a mutation in the child

  26. P=0.86 SNP calling in trios aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac father mother aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.79 child

  27. 4. Structural variation discovery DNA reference pattern LM LF LM ~ LF+Ldel & depth: low Deletion Ldel Tandemduplication LM ~ LF-Ldup & depth: high Ldup LM ~ LF+LT1LM~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2 LM LM Translocation LT2 LT1 LM LM ~ +Linv & ends flipped LM ~ -Linv depth: normal Inversion Linv un-paired read clusters & depth normal Insertion Lins LM ~LF+LT & depth: normal& cross-paired read clusters Chromosomaltranslocation LT Read pair mapping pattern (breakpoint detection)

  28. Copy number estimation Depth of read coverage

  29. Deletion: Aberrant positive mapping distance

  30. Tandem duplication: negative mapping distance

  31. Het deletion “revealed” by normalization Chip Stewart Saturday poster session

  32. 5. Data visualization • software development • data validation • hypothesis generation

  33. Summary • Next-generation sequencing is a boon for large-scale individual human resequencing • Basic data mining tools are getting applied and tested in the 1000 Genomes Project • There is still a lot of fine-tuning to do • A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes

  34. Credits Michael Stromberg Chip Stewart Aaron Quinlan Michele Busby Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang Several postdoc positions are available… … mail marth@bc.edu

  35. Software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release

  36. Positions Several postdoc positions are available… mail marth@bc.edu

  37. A/C C/C A/A Individual genotype directly from sequence AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA individual 3

  38. Genotyping from primary sequence data

  39. Most reads contain no or few errors

  40. Paired-end reads help unique read placement PE • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency MP • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007

  41. How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac P=0.08 P=0.82 aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac

More Related