410 likes | 575 Views
Informatics challenges and computer tools for sequencing 1000s of human genomes. Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008. Large-scale individual human resequencing. Next-gen sequencers offer vast throughput….
E N D
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008
Next-gen sequencers offer vast throughput… Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in 25-70 bp reads) 1 Gb 454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length
IND (ii) read mapping (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling IND (i) base calling (vi) data validation, hypothesis generation The resequencing informatics pipeline REF
The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers
1. Base calling base sequence base quality (Q-value) sequence • early manufacturer-supplied base callers were imperfect • third party software made substantial improvements • machine manufacturers are now focusing more on base calling
… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Larger, more unique pieces are easier to place than others…
Next-gen reads are generally short 20-60 (variable) 25-50 (fixed) 25-70 (fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]
Base error rates are low Illumina 454
0.8 0.19 0.01 read Mapping probabilities (qualities)
Error types are very different Illumina 454
MOSAIK • fast • accurate • gapped • versatile (short + long reads)
3. SNP and short-INDEL calling • deep alignments of 100s / 1000s of individuals • trio sequences
Allele discovery is a multi-step sampling process Samples Reads Population
Allele calling in the reads number of individuals allele call in read base quality
How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac
More samples or deeper coverage / sample? …or deeper coverage from fewer samples? Shallower read coverage from more individuals … simulation analysis by Aaron Quinlan
SNP calling in trios • the child inherits one chromosome from each parent • there is a small probability for a mutation in the child
P=0.86 SNP calling in trios aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac father mother aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.79 child
4. Structural variation discovery DNA reference pattern LM LF LM ~ LF+Ldel & depth: low Deletion Ldel Tandemduplication LM ~ LF-Ldup & depth: high Ldup LM ~ LF+LT1LM~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2 LM LM Translocation LT2 LT1 LM LM ~ +Linv & ends flipped LM ~ -Linv depth: normal Inversion Linv un-paired read clusters & depth normal Insertion Lins LM ~LF+LT & depth: normal& cross-paired read clusters Chromosomaltranslocation LT Read pair mapping pattern (breakpoint detection)
Copy number estimation Depth of read coverage
Het deletion “revealed” by normalization Chip Stewart Saturday poster session
5. Data visualization • software development • data validation • hypothesis generation
Summary • Next-generation sequencing is a boon for large-scale individual human resequencing • Basic data mining tools are getting applied and tested in the 1000 Genomes Project • There is still a lot of fine-tuning to do • A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes
Credits Michael Stromberg Chip Stewart Aaron Quinlan Michele Busby Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang Several postdoc positions are available… … mail marth@bc.edu
Software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release
Positions Several postdoc positions are available… mail marth@bc.edu
A/C C/C A/A Individual genotype directly from sequence AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA individual 3
Paired-end reads help unique read placement PE • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency MP • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007
How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac P=0.08 P=0.82 aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac