Informatics challenges and computer tools for sequencing 1000s of human genomes

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008

Large-scale individual human resequencing

Next-gen sequencers offer vast throughput… Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in 25-70 bp reads) 1 Gb 454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

IND (ii) read mapping (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling IND (i) base calling (vi) data validation, hypothesis generation The resequencing informatics pipeline REF

The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

1. Base calling base sequence base quality (Q-value) sequence • early manufacturer-supplied base callers were imperfect • third party software made substantial improvements • machine manufacturers are now focusing more on base calling

… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Larger, more unique pieces are easier to place than others…

Next-gen reads are generally short 20-60 (variable) 25-50 (fixed) 25-70 (fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

Base error rates are low Illumina 454

Strategies to deal with non-unique mapping

0.8 0.19 0.01 read Mapping probabilities (qualities)

Error types are very different Illumina 454

Gapped alignments

MOSAIK • fast • accurate • gapped • versatile (short + long reads)

3. SNP and short-INDEL calling • deep alignments of 100s / 1000s of individuals • trio sequences

Allele discovery is a multi-step sampling process Samples Reads Population

Capturing the allele in the sample

Allele calling in the reads number of individuals allele call in read base quality

How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac

The need for accurate data…

… and realistic base quality values

Recalibrated base quality values (Illumina)

More samples or deeper coverage / sample? …or deeper coverage from fewer samples? Shallower read coverage from more individuals … simulation analysis by Aaron Quinlan

Analysis indicates a balance

SNP calling in trios • the child inherits one chromosome from each parent • there is a small probability for a mutation in the child

P=0.86 SNP calling in trios aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac father mother aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.79 child

4. Structural variation discovery DNA reference pattern LM LF LM ~ LF+Ldel & depth: low Deletion Ldel Tandemduplication LM ~ LF-Ldup & depth: high Ldup LM ~ LF+LT1LM~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2 LM LM Translocation LT2 LT1 LM LM ~ +Linv & ends flipped LM ~ -Linv depth: normal Inversion Linv un-paired read clusters & depth normal Insertion Lins LM ~LF+LT & depth: normal& cross-paired read clusters Chromosomaltranslocation LT Read pair mapping pattern (breakpoint detection)

Copy number estimation Depth of read coverage

Deletion: Aberrant positive mapping distance

Tandem duplication: negative mapping distance

Het deletion “revealed” by normalization Chip Stewart Saturday poster session

5. Data visualization • software development • data validation • hypothesis generation

Summary • Next-generation sequencing is a boon for large-scale individual human resequencing • Basic data mining tools are getting applied and tested in the 1000 Genomes Project • There is still a lot of fine-tuning to do • A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes

Credits Michael Stromberg Chip Stewart Aaron Quinlan Michele Busby Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang Several postdoc positions are available… … mail marth@bc.edu

Software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release

Positions Several postdoc positions are available… mail marth@bc.edu

A/C C/C A/A Individual genotype directly from sequence AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA individual 3

Genotyping from primary sequence data

Most reads contain no or few errors

Paired-end reads help unique read placement PE • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency MP • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007

How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac P=0.08 P=0.82 aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac

Informatics challenges and computer tools for sequencing 1000s of human genomes