10 likes | 138 Views
Learning a Small Mixture of Trees. S T A N F O R D. Problem Formulation. Aim: To efficiently learn a small mixture of trees that approximates an observed distribution. Results. Standard UCI datasets. Choose from all possible trees T = { t j } defined over n random variables.
E N D
Learning a Small Mixture of Trees S T A N F O R D Problem Formulation Aim:To efficiently learn a small mixture of trees that approximates an observed distribution Results Standard UCI datasets Choose from all possible trees T = {tj} defined over n random variables Constraints defined on infinite variables max Matrix A where A(i,j) = Pr(xi|tj) Overview s.t. ai ≥ bi An intuitive objective function for learning a mixture of trees Vector b where b(i) = p(xi) Finding -optimal solution? P Formulate the problem using fractional covering P Vector ≥ 0 such that ∑ j = 1 Identify the drawbacks of fractional covering Fractional Covering Plotkin et al., 1995 MJ00 uses twice as many trees Make suitable modifications to the algorithm Minimize first-order approximation Learning Pictorial Structures Parameter log(m) Define 0 = mini ai0/bi Mixture of Trees Width w= maxmaxi ai/bi 11 characters in an episode of “Buffy” Define = /4w Drawbacks Variables V = {v1, v2, … , vn} Labeling x Label xa Xa for variable va 24,244 faces (first 80% train, last 20% test) Initial solution 0 While < 20, iterate (1) Slow convergence ∏(a,b)tab(xa,xb) 13 facial features (variables) + positions (labels) Define yi = exp(-ai/bi)/bi Pr(x | t) = z Hidden variable (2) Singleton trees (Probability = 0 for unseen test examples) min ∑iexp(-ai/bi) Unary: logistic regression, Pairwise: m ∏ata(xa)da-1 Find ’ = argmaxyTA s.t. P Bag of visual words : 65.68% RS02 Our ta(xa): Unary potentials t1 t2 t3 Update = (1-) + ’ v1 v1 v1 tab(xa,xb): Pairwise potentials Modifying Fractional Covering da: Degree of va Large step-size v2 v3 v2 v3 v2 v3 66.0566.05 66.01 66.65 66.01 66.86 66.08 67.25 66.08 67.48 66.16 67.50 66.20 67.68 Large yi for numerical stability Pr(x | m) = ∑t T Pr(x | t) (1) Start with = 1/w. Increase by a factor of 2 if necessary. (2) Minimize using convex relaxation. Minimizing the KL Divergence Initialize tolerance , parameter , factor f -Pr(xi|t) Pr(x | 1) 1 : observed distribution Solve for distribution Pr(. | t) min ∑i exp KL(1||2) = ∑x Pr(x | 1) log p(xi) Pr(x | 2) 2 : simpler distribution min f - ∑i log(Pr(xi|t)) -∑i log(1-Pr(xi|t)) EM Algorithm (Relies heavily on initialization) Update f = f until m/f ≤ Meila and Jordan, 2000 (MJ00) s.t. Pr(xi|t) ≥ 0, ∑i Pr(xi|t)≤ 1 E-step: Estimate Pr(x | t) for each x and t Rosset and Segal, 2002 (RS02) Log-barrier approach. Use Newton’s method. Dropped t T M-step: Obtain structure and potentials (Chow-Liu) Focuses on dominant mode Hessian with uniform off-diagonal elements Minimizing -Divergence To minimize g(z), update z = z - (2g(z))-1 g(z) Renyi, 1961 Matrix inversion in linear time Hessian Gradient Generalization of KL Divergence Pr(x | 1) 1 Project to tree distribution using Chow-Liu May result in increase in log ∑x D(1||2) = -1 D1(1 || 2) = KL(1 || 2) Pr(x | 2)1- Computationally expensive operation? Discard best explained sample and recompute t Enforce Pr(xi’|t) = 0 Fitting q to p i’ = argmaxi Pr(xi|t)/p(xi) Larger is inclusive Only one log-barrier optimization required Use = Pr(xi|t) = Pr(xi|t) + si Pr(xi’|t) Use previous solution si = p(xi |t)/∑k p(xk |t) Minka, 2005 = 0.5 = 1 = Convergence Properties Given distribution p(.) find a mixture of trees by minimizing -divergence Maximum number of increases for = O(log(log(m))) p(xi) Future Work Pr(xi | m) Mixtures in log-probability space? = arg maxm mini maxi log arg minm m* = Maximum discarded samples = m-1 p(xi) Pr(xi | m) Connections to Discrete AdaBoost? Polynomial time per iteration. Polynomial time convergence of overall algorithm.