100 likes | 113 Views
This paper explores methods for drawing from a Dirichlet distribution, including exact and converging techniques using gamma distributions and stick-breaking, and compares Dirichlet Distribution (DD) with Dirichlet Process (DP) in mixture modeling.
E N D
Dirichlet Prior Sieves in Finite Normal MixturesBy: Hemant Ishwaran and Mahmoud Zarepour John Paisley
Dirichlet Distribution Reparameterize: Written this way, alpha is a scalar and g is a pmf
How do you draw from a Dirichlet distribution? • Two finite exact methods • Using Gamma Distributions • Using “exact” stick-breaking • With \pi_1 set to v_1 and \pi_k getting the rest of the stick
How do you draw from a Dirichlet distribution? (2) • Two converging methods • Polya Urn • alpha balls distributed according to g are placed in an urn. Balls are then drawn with replacement and another ball of the same color as that drawn placed back in the urn. Empirical distribution in the urn represents a draw from DD as the number of ball draws approaches infinity • Sethuraman’s Stick-Breaking
Sethuraman’s Stick-Breaking • To draw from below distribution, do the following • Sethuraman’s discovery allows one to draw from a Dirichlet distribution by drawing the weights and the components independently of one another. The final pi values are not independent.
Dirichlet Process Movie: In DP, we enforce that G_0(B_i) = 1/k --- What does that look like as a function of k? I don’t have the math proof, but I say that these regions converge to points and a uniform draw of a region (Dirichlet component) is the same as drawing a point from G_0.
Letting k go to infinity • Using a uniform prior, the breaks will be infinitely small and since the probability of drawing the same Y is zero with a uniform G_0(B_i), you are left with G_0 in the limit. • For alpha less than infinity, you have the Dirichlet process (obvious when looking at this written in stick-breaking representation) • Also, notice that the “exact” methods are now impractical, but the two “infinite” methods can be used to approximate the draw.
This Paper: How DD approximates DP for mixture modeling • After parameters are drawn, the mathematical forms are exactly the same (with truncated DP) • For DD, \pi comes from a DD • For DP, \pi comes from stick-breaking • \thetas are drawn iid from G_0 in each case • The only question is: How does \pi differ between the two • Answer: We assume for DD prior that draws of Y never reduplicate, which is where we differ from DP. However, if we set k large enough and alpha small enough for the DD, there can be a good probability that most of the stick (defined by truncation error) will be allocated before duplicate copies of Y are drawn. When this happens, we’ve coincidentally drawn from a DP. Increasing k and/or reducing alpha increases the probability of this happening.
Summary • DP is a sparseness promoting prior on the weights with a mathematical relation to the atoms that makes everything logical. • DD, when alpha < k is also a sparseness promoting prior on the weights, but there is no strict mathematical relation to the components, making it more “ad hoc” (but not really) • In practice, however, they both function to obtain the same exact ends.
Aside • For the generation of a DD mixture model, if we were to also draw the Y’s “just for fun,” then on the occasions when the Y’s are all unique, that specific draw is exactly the same as a draw from a DP (as I understand it). • Using Sethuraman’s definition, and given a specific truncation error for the stick, we can get an expected number of necessary breaks (to meet that error) as a function of alpha, which is also the number of Y’s we need to draw, call it N. • We can then calculate the probability of drawing duplicate Y’s because we are drawing from N balls uniformly with replacement and we can find the probability that we draw only unique balls. Therefore, we can determine the probability that a draw from a DD mixture model is also a draw from a DP mixture model. • This paper says you can use a DD to approximate a DP. What this value would do is give you a probability as a function of alpha and N that is a measure of how close you are to a DP. Clearly when N is greater than the number of components, that probability is zero.