Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy1 Joint work with Sanjiv Dinakar1, Yozen Hernandez2, Ion Mandoiu1, and Yufeng Wu1 1CSE Department, University of Connecticut 2Department of Computer Science, Hunter College

Outline • Introduction • Single SNP Genotype Calling • Multilocus Genotyping Problem • HMM-Posterior Algorithm • Experimental Results • Conclusion

Next Generation Sequencing (NGS) • New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing • More improvements expected in quest for $1,000 genome Roche / 454 Genome Sequencer FLX 100 Mb/run, 400bp reads Illumina / Solexa Genetic Analyzer 1G 1000 Mb/run, 35bp reads Applied Biosystems SOLiD 3000 Mb/run, 25-35bp reads

NGS Applications Besides reducing costs of de novo genome sequencing, NGS has found many more apps: Resequencing, transcriptomics (RNA-Seq), gene regulation (non-coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, … NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] Thousands more individual genomes to be sequenced as part of 1000 Genomes Project

Challenges in Medical Applications of Sequencing Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements) Requires accurate determination of both alleles at variable loci This is limited by coverage depth due to random nature of shotgun sequencing For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips has shown only 75-80% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08] [Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

Do Heuristic Inputs Help? We propose methods incorporating two additional sources of information: Quality scores reflecting uncertainty in sequencing data Allele frequency and linkage disequilibrium (LD) info extracted from reference panels such as Hapmap Experiments on a subset of the James Watson 454 reads show that our methods yield improved genotyping accuracy Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6-fold mapped read coverage is achieved by our methods using less than 1/4 of the reads

Basic Notations • Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded) • SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous Mapped reads with allele 0 012100120 Inferred genotypes Mapped reads with allele 1 Sequencing errors

Prior Methods for Calling SNP Genotypes from Read Data • Prior methods are all based on allele coverage • [Levy et al 07] require that each allele be covered by at least 2 reads in order to be called • [Wheeler et al 08] use hypothesis testing based on the binomial distribution • To call a heterozygous genotype must have each allele covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01 • [Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k

Incorporating Base Call Uncertainty Let ridenote the set of mapped reads covering SNP locus i and ci=| ri| For a read r in ri, r(i) denotes the allele observed at locus i If qr(i) is the phred quality score of r(i), the probability that r(i) is incorrect is given by The probability of observing read set riconditional on having genotype Gi is then given by:

Single SNP Genotype Calling Applying Bayes’ formula: Where are allele frequencies inferred from a representative panel

… F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn R1,1 … R1,c R2,1 … R2,c Rn,1 … Rn,c n 1 2 Probabilistic Model for Multilocus Case HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]

Model Training Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

Multilocus Genotyping Problem • GIVEN: • Shotgun read sets r=(r1, r2, … , rn) • Quality scores • Trained HMM models representing LD in populations of origin for mother/father • FIND: • Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r) Remark: maxgP(g | r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem HMM-Posterior Algorithm Experimental Results Conclusion

HMM-Posterior Decoding Algorithm For each i = 1..n, compute Return

Forward-Backward Computation of Posterior Probabilities fi … … hi f’i … … h’i gi r1,c ri,1 ri,c r1,1 … … Rn,1 … Rn,c 1 n i

Implementation Details Forward recurrences: Backward recurrences are similar

Runtime • Direct implementation gives O(m+nK4) time, where • m = number of reads • n = number of SNPs • K = number of founder haplotypes in HMMs • Runtime reduced to O(m+nK3) by reusing common terms: where

Read Data • Subset of James Watson’s 454 reads • 74.4 million reads with quality scores (of 106.5 million reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/ • Average read length ~265 bp

Read Mapping Procedure Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04] Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90) Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels) Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates: FP rate: 0.37% FN rate: 21.16%

Read Mapping Results • Average coverage by mapped reads of Hapmap SNPs was 5.64x • Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints

Haplotype and Genotype Data CEU genotypes from latest Hapmap release (23a) were dowloaded fromhttp://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/ Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm Duplicate Affymetrix 500k SNP genotypes were downloaded fromftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

Accuracy Comparison (Homozygous Genotypes)

Accuracy Comparison (Heterozygous Genotypes)

Accuracy Comparison (All Genotypes)

Accuracy at Varying Coverages (All Genotypes)

Outline • Introduction • Single SNP Genotype Calling • Multilocus Genotype Problem (MGP) • HMM-Posterior Algorithm • Experimental Results • Conclusion

Conclusion • Exploiting “heuristic inputs” such as quality scores and population allele frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data • LD information extracted from a reference panel gives highest benefit • Relatively small gain from incorporating quality scores may be due in part to the poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08] • Although our evaluation is on 454 reads, the methods are well-suited for short read technologies • Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)

Acknowledgments • This work was supported in part by NSF (awards IIS-0546457, DBI-0543365, and CCF-0755373) and by the University of Connecticut Research Foundation

Questions?

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads