Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford University UAI 2008 Presented by Haojun Chen August 1st, 2008

Outline • Background and motivation • Undirected transfer hierarchies • Experiments • Degree of transfer coefficients • Experiments • Summary

Background (1/2) • Transfer learning Data from “similar” tasks/distributions are used to compensate for the sparsity of training data in primary class or task Example: Use rhinos to help learn elephants’ shape Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

: a set of related learning tasks/classes : observed data : task/class parameters Background (2/2) • Hierarchical Bayes (HB) framework Principled approach for transfer learning Joint distribution over the observed data and all class parameters as follows: where Example of a hierarchical Bayes parameterization Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Motivation • In practice, point estimation of the MAP is desirable, for full Bayesian computations can be difficult and computationally demanding • Efficient point estimation may not be achieved in many standard hierarchical Bayes models, because many common conjugate priors such as the Dirichlet or normal-inverse-Wishart are not convex with respect to the parameters • In this paper, an undirected hierarchical Bayes(HB) reformulation is proposed to allow efficient point estimation

Undirected HB Reformulation : data-dependent objective : divergence function over child and parent parameters → 0 : encourages parameters to explain data →∞ : encourages parameters to be similar to parents Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Purpose of Reformulation • Easy to specify • Fdata can be likelihood, classification, or other objective • Divergence can be L1-norm, L2-norm, e-insensitive loss, KL divergence, etc. • No conjugacy or proper prior restrictions • Easy to optimize • Convex over Q if Fdata is concave and Divergence is convex Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Bag-of-words model Fdata : Multinomial log likelihood (regularized) : frequency of word i Divergence: L2 norm Experiment: Text categorization Newsgroup20 Dataset Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Text categorization Result Baseline: Maximum likelihood at each node (no hierarchy) Cross-validate regularization (no hierarchy) Shrinkage (McCallum et al. ’98, with hierarchy) Newsgroup Topic Classification 0.7 0.65 0.6 0.55 Classification Rate 0.5 0.45 Max Likelihood (No regularization) Shrinkage Regularized Max Likelihood 0.4 Undirected HB 0.35 75 150 225 300 375 Total Number of Training Instances Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

(Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Experiment: Shape Modeling Mammals Dataset (Fink, ’05) Covariance over landmarks Mean landmark location Regularization Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Undirect HB Shape Modeling Result Mammal Pairs 50 Regularized Max Likelihood 0 -50 Elephant-Rhino -100 Delta log-loss / instance -150 Bison-Rhino Elephant-Bison -200 Elephant-Rhino Giraffe-Bison Giraffe-Elephant -250 Giraffe-Rhino Llama-Bison Llama-Elephant -300 Llama-Giraffe Llama-Rhino -350 6 10 20 30 Total Number of Training Instances Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Problem in Transfer Not all parameters deserve equal sharing Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Degrees of Transfer (DOT) is split into subcomponentswith weights , and hence different strengths are allowed for different subcomponents, child-parent pairs → 0 : forces parameters to agree →∞ : allows parameters to be flexible Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Estimation of DOT Parameters Hyper-prior approach Bayesian idea:Put prior on and add as parameter to optimization along with Concretely: inverse-Gamma prior (forced to be positive) Prior on Degree of Transfer Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

DOT Shape Modeling Result Mammal Pairs 15 Hyperprior 10 Elephant-Rhino 5 Delta log-loss / instance 0 Regularized Max Likelihood Bison-Rhino -5 Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino -10 Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino -15 6 10 20 30 Total Number of Training Instances Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Distribution of DOT coefficients Distribution of DOT coefficients using Hyperprior approach 20 18 qroot 16 14 12 10 8 6 4 2 0 0 5 10 15 20 25 30 35 40 45 50 1/l Weaker transfer Stronger transfer Resources: http://velblod.videolectures.net/2008/pascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_01.ppt

Summary • Undirected reformulation of the hierarchical Bayes framework is proposed for efficient convex point estimation • Different degrees of transfer for different parameters are introduced so that some parts of the distribution can be transferred to a greater extent than others

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies