170 likes | 338 Views
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture. Paper by E. Xing, K. Sohn, M. Jordan and Y. Teh, ICML 2006. Duke University Machine Learning Group Presented by Kai Ni August 24, 2006. Outline. Background Dirichlet Processe mixture
E N D
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Paper by E. Xing, K. Sohn, M. Jordan and Y. Teh, ICML 2006 Duke University Machine Learning Group Presented by Kai Ni August 24, 2006
Outline • Background • Dirichlet Processe mixture • Hierarchical Dirichlet Process mixture • Application on haplotype inference
Motivation • Problem – Uncovering the haplotypes of single nucleotide polymorphisms (SNP) within and between populations. • Methods – Coalescence, finite and infinite mixtures, and maximal parsimony. • Application • Biological and medical analysis; • Genetic demography study.
Background • A SNP haplotype is a list of alleles at contiguous sites in a local region of a single chromosome. A haplotype is inherited as a unit. • For diploid organisms, two haplotypes go together to make up a genotype, which is a list of unordered pairs of alleles in a region. • Haplotype inference from genotype data can be formulated as a mixture model. HDP mixture is used in this paper.
Dirichlet Processes • A single clustering problem can be analyzed as a Dirichlet processes (DP).
DP mixture model • G can be looked as an mixture model with infinite components.
DP-Haplotyper • denotes the genotype of T contiguous SNPs of individual i from ethnic group j. • The corresponding paternal/maternal haplotypes of the individual genotype is denoted by • H is assume to be a random perturbation of an ancestral haplotype A, or founder. • DP-Haplotyper is a DP mixture model to model a single population group.
Hierarchical Dirichlet Process • Each group is modeled as a DP Gj and the group-specific DPs are linked via a global DP G0. • G0 defines the set of mixture components used by all the groups. Different groups share the same set of mixture components (underlying clusters ), but with different mixture proportions.
HDP mixture model • HDP can be used as the prior distribution over the factors for nested group data. • Consider a two-level DPs. G0 links the child Gj DPs and forces them to share components. Gj is conditionally independent given G0
HDP – Chinese Restaurant Franchise • First level: within each group, DP mixture • Φj1,…,Φj(i-1), i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to be the values taken on by Φj1,…,Φj(i-1), njk be # of Φji’= Ѱjt, 0<i’<i. • Second level: across group, sharing clusters • Base measure of each group is a draw from DP: • Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of Ѱjt=Өk, all j, t.
Parameterization form of the model • Underlying mixture component Ak := [Ak,1, … , Ak,T] – founding haplotype configuration • Base measure , where p(A) is uniform distribution and p( ) is a beta distribution. • Inheritance model • Genotyping model
Gibbs Sampling • Gibbs sampling variants includes: • Sampling scheme is similar to a two-level urn model:
Simulated data • 100 individuals from 5 groups (20 each). Each group has 2 shared founders and 3 unique founders, in a total of 17 founders.
Real data • International HapMap Project, containing four population of genotypes.
Conclusion • The author proposed a HDP mixture model for haplotype inference for multiple populations. • HDP prior couples multiple heterogeneous populations and facilitates sharing mixture components across multiple infinite mixture models. • In the future, longer SNP sequences will be considered. Also HDP can be generalized to the problem in which the group labels are unknown and to be inferred.