260 likes | 437 Views
Bayesian Generalized Product Partition Model. By David Dunson and Ju-Hyun Park Presentation by Eric Wang 2/15/08. Outline. Introduce Product Partition Models (PPM). Relate PPM to DP via the Blackwell-MacQueen Polya Urn scheme.
E N D
Bayesian Generalized Product Partition Model By David Dunson and Ju-Hyun Park Presentation by Eric Wang 2/15/08
Outline • Introduce Product Partition Models (PPM). • Relate PPM to DP via the Blackwell-MacQueen Polya Urn scheme. • Introduce predictor dependence into PPM to form Generalized PPM (GPPM). • Discussion and Results • Conclusion
Product Partition Model • A PPM is formally defined as • Where is a partition of . • Let denote the data for subjects in cluster h, h = 1,…,k. • Therefore, the probability of partition is therefore the product of all its independent subsets. • The posterior cohesion on after seeing data is also a PPM, (1)
Product Partition Model • A PPM can also be induced hierarchically • Where if , . • Taking induces a nonparametric PPM. • A prior on the weights imposes a particular form on the cohesion: a convenient choice corresponds to the Dirichlet Process.
Relating DP and PPM • In DP, . • G is seen in stick breaking. If it is marginalized out, it yields the Blackwell-MacQueen (1973) formulation: • Where is the unique value taken by the ith data. • The joint distribution of the a particular set is therefore due to the independence of the data.
Relating DP and PPM • It can be shown directly that the Blackwell-MacQueen formulation leads to • Where is the number of data taking unique value . • is the unique value of the subject in cluster h, re-sorted by their ids: • Also, , is a normalizing constant and the cohesion is Then: (2) (3)
Relating DP and PPM • From slide 3, writing the prior and likelihood together: • Notice that from (1), G can be marginalized out to get the same form • Specifically, integrate over all possible unique values which can be taken by for subset h. (4)
Relating DP and PPM • Therefore, DP is a special case of PPM with cohesion and normalizing constant . • However, (2) follows the premise of DP that data is exhcangeable and does not incorporate dependence on predictors. • Next, PPMs will be generalized such that predictor dependence is incorporated.
Generalized PPM • The goal of the paper is to formulate (1) such that the cohesion depends on the subject’s predictor: • This can be done following a process very similar to the non-predictor case above. • Once again, the connection between DP and PPM will be used, this will henceforth be referred to as GPPM • The formulation is interesting because the predictors will be treated as random variables rather than known fixed values (as in KSBP).
GPPM • Consider the following hierarchical model • Where , constitutes a base measure on and , the parameters of the data and predictor, respectively. • This model will segment data {1,…,n} into k clusters. As before, denotes that subject i belongs to cluster h. • and , which denote the unique values of the parameters associated with the subject and its predictor, shown below
GPPM (5) • The joint distribution of can be developed in a similar manner to (2): • The conditional distribution of given predictors is • For comparison, (2) is shown below: • The cohesion in (6) is • (7) meets the criteria originally set out. (6) (2) (7)
GPPM • Some thoughts on GPPM so far: • As noted earlier the posterior distribution of PPMs are still in the class of PPMs, but with updated cohesion. • Similiarly, the posterior of a GPPM will also take the form of a GPPM • (2) and (6) are quite similar. The extra portion of (6) is the marginalized probability of the predictor . • If , then the GPPM reverts to the Blackwell-MacQueen formulation, seen clearly in the following theorem.
Generalized Polya Urn Scheme • The following theorem shows that the GPPM can induce a Blackwell-MacQueen Polya Urn scheme, generalized for predictor dependence:
Generalized Polya Urn Scheme • By the above theorem, data i will do either 1) or 2) • 1) Draw a previously unseen unique value proportional to the concentration parameter and the base measure on the predictor • 2) Draw a previously used unique value equal to the parameters of cluster h proportional to the number of data which have previously chosen that unique value and the marginal likelihoods of its predictor value across the clusters. • Further, since the predictors are treated as random variables, updating the posteriors on each cluster’s predictor parameters means that GPPM is a flexible, non-parametric way to adapt the distance measure in predictor space. • In this paper G is always integrated out; however, Dunson alludes to variational techniques which could still be developed in similar fashion following the fast Variational DP proposed by Kurihara et al (2006).
Generalized Polya Urn Scheme • Consider, for example, a Normal-Wishart prior on the predictor as follows • Where and are multiplicative constants and is a Wishart distribution with degrees of freedom and mean • Notice that this formulation adds another multiplier to the precision of the predictor distribution. This analogously corresponds to kernel width in KSBP, and encourages tight local clustering in predictor space. • The marginal distributions on the predictors from Theorem 1 take the forms shown on the next slide.
Generalized Polya Urn Scheme • The marginal distribution of the predictor in the first weight: • The marginal distribution of the predictor in the second weight has the same functional form but with updated hyperparameters: Non-central multivariate t-distribution with degrees of freedom Mean and scale where And is the empirical mean of the predictors in cluster h, without predictor i.
Generalized Polya Urn Scheme • Posterior updating in this model is straightforward using MCMC. The conditional posterior of the parameters is • The indicators are updated separately from the cluster parameters . The membership indicators are sampled from it multinomial posterior: • Next, update the parameters conditioned on and number of clusters k. where is the base prior updated with the data likelihood and the weights from Theorem 1
Results • Dunson et al. demonstrates results using the following model on conditional density regression problems • Where • Demonstrate results on 3 datasets: • Simulated Single Gaussian (p = 2) • Simulated Mixture of two Gaussians (p = 2) • Epidemiology data (p = 3) P-dimensional predictor Data likelihood Parameters of cluster h.
Results • Simulated single Gaussian data, 500 data points • is generated iid from a uniform distribution over (0,1). • Data was simulated using • Algorithm was run for 10,000 iterations with 1,000 iteration burn-in. Fast mixing and good estimates. Raw Data Below are conditional distributions on y for two different values of x. The dotted lines is truth, the solid line is the estimation, and the dashed lines are 99% credibility intervals y x
Results • Simulated 2 Gaussian results, 500 data points • is generated iid from a uniform distribution over (0,1). • Data was simulated using PPM GPPM Here, the left column of plots are for a PPM (non-generalized, while the right column plots is the GPPM on the same dataset. Notice much better fitting in the bottom plots, and that the GPPM is not dragged toward 0 as the second peak appears when approaches 0.
Results • Epidemiologic Application: • DDE is shown to increase the rate of pre-term birth. Two predictors and correspond to DDE dose for child i, and mother’s age after normalization, respectively. • Dataset size was 2,313 subjects. • MCMC GPPM was run for 30,000 iterations with 10,000 iteration burn-in. • The results confirmed earlier findings that DDE causes a slightly decreasing trend as DDE level rises. • These findings are similar to previous KSBP work on the same dataset, but the implementation was simpler.
Results Raw Data Dashed lines indicate 99% credibility intervals
Conclusion • A GPPM was formulated beginning with the Blackwell-MacQueen Polya Urn scheme. • The GPPM incorporates predictor dependence by treating the predictor as a random variable. • It is similar in spirit to the KSBP, but is able to bypass issues such as kernel width selection and the inability to implement a continuous distribution in predictor space. • Future research directions could explore Dunson’s mention of a variational method similar to the formulation proposed in this paper.