310 likes | 414 Views
Application of Probabilistic ILP II, FP6-508861 www.aprill.org. Constrained Hidden Markov Models for Population-based Haplotyping. Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki.
E N D
Application of Probabilistic ILP II, FP6-508861 www.aprill.org Constrained Hidden Markov Models for Population-based Haplotyping Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki
Outline • Population-based haplotype reconstruction • Infer haplotypes from genotypes: reconstruct hidden phase of genetic data • Important problem in biology/medicine: e.g. disease association studies • An approach using constrained HMMs • Sparse markov chains to represent conserved haplotype fragments • HMM model that can be learned directly from genotype data • Experimental results
Human Genome and SNPs SNP (marker) SNP (marker) SNP (marker) ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... ...GATATTCGTACGGATGTTTCCA... ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... ...GATGTTCGTACTGATGTCTCCA... Individuals 1 2 3 4 5 6 DNA Sequence
Haplotypes AGT GTC AGT AGT GTC GTC Haplotypes SNP SNP SNP AGT GTC AGT AGT GTC GTC Individuals 1 2 3 4 5 6 DNA Sequence
Haplotypes 101 010 101 101 010 010 Haplotypes SNP SNP SNP 101 010 101 101 010 010 Individuals 1 2 3 4 5 6 DNA Sequence
Why Haplotypes? • Haplotypes • define our genetic individuality • contribute to risk factors of complex diseases (e.g., diabetes) • Disease Association Studies (Gene Mapping): • find genetic difference between a case and a control population • Identifying SNPs responsible for disease might help find a cure • Also useful for • Linkage disequilibrium studies: Summarize genetic variation • Understanding evolution of human populations
WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown) {0,1} {0,1} {0} {0,1} {1} The problem: Haplotypes not directly observable . 1 . . . 1 . . . 0 . . . 0 . . . 1 . . 0 . . . 0 . . . 0 . . . 1 . . . 1 . Paternal Maternal
Population-based Haplotype Reconstruction • Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair • Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium) haplotype pair genotype 1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 0 {0,1} {0,1} {0,1} {1} {0,1} {1} {0} {0,1} 1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 {0,1} {0,1} {1} {1} {1} {1} {0} {1} {0,1} {0,1} {0,1} {1} {0,1} {1} {0} {1} Individual 2 Individual 3 … Individual 1
Haplotype Reconstruction Problem (CS Perspective) Input: A set G of genotypes Output: A set H of corresponding haplotype pairs such that
Population-based Haplotype Reconstruction • Given a model M for the distribution of haplotypes, can infer most likely resolution: Hardy-Weinberg equilibrium • Need to estimate this model from available genotype data
Prior Work on Haplotype Reconstruction • Competitive application domain for several years: many systems developed • characterized by the statistical model and learning/reconstruction algorithms employed • Special-purpose statistical models • Approximate Coalescent (PHASE 2001,2003,2005) • Block-based (Gerbil 2004,2005) • Variable-length MC (HaploRec 2004,2006) • Founder-based (HIT 2005) • Local clusters (fastPHASE 2006)
Prior Work on Haplotype Reconstruction • Special-purpose learning/reconstruction algorithms • MCMC variant • Approximate EM + partition ligation • … • Our approach: • Model haplotypes using (sparse) markov chains • Natural extension to a Hidden Markov Model on genotypes • Directly learnable from genotype data (standard Baum-Welsh)
Constrained HMMs for haplotyping • Modeling haplotypes • Standard markov chain • More general: order k markov chain Path for haplotype 0,1,1,0
Constrained HMMs for haplotyping • Modeling genotypes • Hidden phase (order of pair): Hidden Markov Model • States: pairs of states of the underlying markov chain (state of the maternal/paternal sequence) • Output symbol: unordered pair • Path in the model: sample two haplotypes, output corresponding genotype • Have to enforce Hardy-Weinberg equilibrium • Parameter tying constraints on transition probabilities • Algorithms • Learning: standard Baum-Welsh • Reconstruction of most likely haplotype pair: Viterbi
Constrained HMMs for haplotyping • Example: paths for genotype {0,1},{1},{0,1},{0}
Sparse Markov Modeling (SpaMM) • Higher-order models (long history) needed: exponential size of model • However, out of the possible history blocks, only few occur in data (conserved fragments) • Idea: Sparse model, iterative structure learning algorithm to identify conserved fragments (Apriori-style) Initialize first-order-model() em-training( ) repeat regularize-and-extend( ) em-training( ) until
SpaMM Model (order 1) • Iteration: extend order of model by 1, prune unlikely parts • Avoids combinatorial explosion of model size • Initial model: standard markov chain of order 1
SpaMM Model (order 2) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size
SpaMM Model (order 3) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size
SpaMM Model (order 4) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size
SpaMM Model (order 5) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size
SpaMM Model (order 6) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size
SpaMM Model (final) • Final model: Model structure encodes conserved fragments • Concise representation of all haplotypes with non-zero probability
Experimental Evaluation • Real world population data • Correct haplotypes have been inferred from trios • Daly dataset: 103 SNP markers for 174 individuals • Yoruba population: 100 datasets, 500 SNP markers each, 60 individuals • Problem Setting: • Given the set of genotypes, algorithm outputs most likely haplotype pairs • Difference to real haplotype pairs is measured in switch distance (# recombinations needed to transform pairs, normalized)
Results: Haplotype Reconstruction • Many well-engineered systems • Smart priors, averaging over several random restarts of EM, ... • SpaMM: proof-of-concept implementation, not tuned
Results: Haplotype Reconstruction • PHASE most accurate, then fastPHASE, then SpaMM • however, PHASE too slow for long maps • SpaMM beats fastPHASE without averaging • overall, competitive accuracy
Results: Runtime • Runtime in seconds for phasing 100 markers (log. scale) • SpaMM scales linearly in #markers • like fastPHASE, HaploRec, HIT • unlike PHASE, Gerbil
Results: Genotype imputation • Most haplotyping methods can also predict missing genotype values • for SpaMM, can be read off Viterbi path
Results: Genotype imputation • fastPHASE best known method • Again, SpaMM beats fastPHASE without averaging
Conclusions • SpaMM: new haplotyping method • sparse Markov chains to encode conserved haplotype fragments • Constrained HMM for modeling genotypes • Apriori-style structure learning algorithm • Simple, accurate, interpretable output • Future work • Accuracy can probably be improved using standard techniques (EM random restarts, averaging, ...)