Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu University of Connecticut

Outline • HMM model of haplotype diversity • Applications • Phasing • Error detection • Imputation • Genotype calling from low-coverage sequencing data • Conclusions

Single Nucleotide Polymorphisms • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • High density in the human genome:  1  107 SNPs out of total 3  109 base pairs … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

011100110 001000010 021200210 + Haplotypes and Genotypes • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype

Sources of Haplotype Diversity: Mutation The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.

Sources of Haplotype Diversity: Recombination

Haplotype Structure in Human Populations

… F1 F2 Fn H1 H2 Hn HMM Model of Haplotype Frequencies • Fi = founder haplotype at locus i, Hi = observed allele at locus i • P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data • For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm • Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]

h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 Genotype Phasing

Maximum Likelihood Genotype Phasing … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn • Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=gP(h1|M)P(h2|M)

Computational Complexity • [KMP08] Cannot approximate maxh1+h2=gP(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP • [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)

Genotyping Errors • A real problem despite advances in technology & typing algorithms • 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] • Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] • In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) • Many errors remain undetected • As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]

Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 1 h3 0 1 1 1 0 0 h4 0 1 1 1 0 0 h1 0 1 0 1 0 1 h2 Child 0 2 2 1 0 2 0 1 1 1 0 0 h1 0 0 0 1 0 1 h3 Likelihood of best phasing for original trio T Likelihood Sensitivity Approach to Error Detection in Trios

Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 0 h’ 3 0 1 1 1 0 1 h’ 4 0 1 0 1 0 1 h’1 0 1 1 1 0 0 h’2 Child 0 2 2 1 0 2 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3 Likelihood of best phasing for modified trio T’ Likelihood Sensitivity Approach to Error Detection in Trios ? Likelihood of best phasing for original trio T

Likelihood Sensitivity Approach to Error Detection in Trios Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 Child 0 2 2 1 0 2 ? • Large change in likelihood suggests likely error • Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

Alternate Likelihood Functions • [KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP • Efficiently Computable Likelihood Functions • Viterbi probability • Probability of Viterbi Haplotypes • Total Trio Probability

Comparison with FAMHAP (Children)

Comparison with FAMHAP (Parents)

Genome-Wide Association Studies • Powerful method for finding genes associated with complex human diseases • Large number of markers (SNPs) typed in cases and controls • Disease causal SNPs unlikely to be typed directly • Significant statistical power gained by performing imputation of untyped Hapmap genotypes [WTCCC’07]

HMM Based Genotype Imputation • Train HMM using the haplotypes from related Hapmap or small cohor typed at high density • Probability of missing genotypes given the typed genotype data  gi is imputed as

Experimental Results • Estimates of the allele 0 frequency based on Imputation vs. Illumina 15k

Experimental Results • Accuracy and missing data rate for imputed genotypes at different thresholds

Ultra-High Throughput Sequencing • New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing Roche / 454 Genome Sequencer FLX 100 Mb/run, 400bp reads Illumina / Solexa Genetic Analyzer 1G 1000 Mb/run, 35bp reads Applied Biosystems SOLiD 3000 Mb/run, 25-35bp reads

Probabilistic Model … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn R1,1 … R1,c R2,1 … R2,c Rn,1 … Rn,c n 1 2

Model Training • Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father • P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise where is the probability that read r has an error at locus I  Conditional probabilities for sets of reads are given by:

Multilocus Genotyping Problem • GIVEN: • Shotgun read sets r=(r1, r2, … , rn) • Base quality scores • HMMs for populations of origin for mother/father • FIND: • Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r)

Posterior Decoding Algorithm For each i = 1..n, compute Return • Joint probabilities can be computed using a forward-backward algorithm: • Direct implementation gives O(m+nK4) time, where • m = number of reads • n = number of SNPs • K = number of founder haplotypes in HMMs • Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

Genotyping Accuracy on Watson Reads

Conclusions HMM model of haplotype diversity provides a powerful framework for addressing central problems in population genetics & genetic epidemiology Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice Highly scalable runtime (linear in #SNPs and #individuals/reads) Software available at http://www.engr.uconn.edu/~ion/SOFT/

Acknowledgements • Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc • NSF funding (awards IIS-0546457 and DBI-0543365)

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology

Presentation Transcript

Hidden Markov Models

Hidden Markov Models

Bioinformatic Applications of Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Hidden Markov Models

Hidden Markov Models

Hidden Markov models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Applications of Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Applications of Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology