Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Motivation Problem: With few instances, learned models aren’t robust Task: Shape modeling Principal Components std +1 MEAN MEAN std -1 std +1 std -1

Transfer Learning Shape is stabilized, but doesn’t look like an elephant Can we use rhinos to help elephants? Principal Components std +1 MEAN MEAN std -1 std +1 std -1

Hierarchical Bayes P(qroot) qroot P(qElephant|qroot) P(qRhino|qroot) qRhino qElephant P(DataRhino|qRhino) P(DataElephant|qElephant)

Goals • Transfer between related classes • Range of settings, tasks • Probabilistic motivation • Multilevel, complex hierarchies • Simple, efficient computation • Automatically learn what to transfer

Hierarchical Bayes qroot • Compute full posterior P(£|D) • P(£c|£root) must be conjugate to P(D|£c) qElephant qRhino Problem: Often can’t perform full Bayesian computations

Approx.: Point estimation Best parameters are good enough; don’t need full distribution qroot qElephant • Empirical Bayes • Point estimation • Other approximations: • Posterior as normal, sampling, etc. qRhino

More Issues: Multiple Levels • Conjugate priors usually can’t be extended to • multiple levels (e.g., Dirichlet, inverse-Wishart) • Exception: Thibeaux and Jordan (‘05)

More Issues: Restrictive Priors • Example: inverse-Wishart • Pseudocount restriction • n >= d • If d is large, N is small, signal from prior overwhelms data • We show experiments with N=3, d=20 Normal-Inverse-Wishart parameters Pseudocounts Gaussian parameters m,L,k,n mA,SA mB,SB N = # samples, d = dimension

Alternative: Shrinkage • McCallum et al. (‘98) • 1. Compute maximum likelihood at each node • 2. “Shrink” each node toward its parent • Linear combination of q and qparent • Uses cross-validation of a • Pros: • Simple to compute • Handles multiple levels • Cons: • Naive heuristic for transfer • Averaging not always appropriate qparent q* = aq + (1-a)qparent q

Undirected HB Reformulation Probabilistic Abstraction Hierarchies (Segal et al. ‘01) Defines an undirected Markov random field model over Q, D

Undirected Probabilistic Model qroot b: high b: low Divergence Divergence qRhino qElephant Fdata Fdata Divergence: Encourage parameters to be similar to parents Fdata: Encourage parameters to explain data

Purpose of Reformulation • Easy to specify • Fdata can be likelihood, classification, or other objective • Divergence can be L1-distance, L2-distance, e-insensitive loss, KL divergence, etc. • No conjugacy or proper prior restrictions • Easy to optimize • Convex over Q if Fdata is convex and Divergence is concave

Application: Text categorization Task: Categorize Documents Bag-of-words model Fdata : Multinomial log likelihood (regularized) µi represents frequency of word i Divergence: L2 norm Newsgroup20 Dataset

Baselines 1. Maximum likelihood at each node (no hierarchy) 2. Cross-validate regularization (no hierarchy) 3. Shrinkage (McCallum et al. ’98, with hierarchy) qparent qB qA q* = aq + (1-a)qparent q

Can It Handle Multiple Levels? Newsgroup Topic Classification 0.7 0.65 0.6 0.55 Classification Rate 0.5 0.45 Max Likelihood (No regularization) Shrinkage Regularized Max Likelihood 0.4 Undirected HB 0.35 75 150 225 300 375 Total Number of Training Instances

Application: Shape Modeling Mammals Dataset (Fink, ’05) Task: Learn shape (Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Covariance over landmarks Mean landmark location Regularization MEAN Principal Components

Does Hierarchy Help? Mammal Pairs 50 Regularized Max Likelihood 0 Bison-Rhino -50 -100 Delta log-loss / instance -150 Bison-Rhino Elephant-Bison -200 Elephant-Rhino Giraffe-Bison Giraffe-Elephant -250 Giraffe-Rhino Llama-Bison Llama-Elephant -300 Llama-Giraffe Llama-Rhino -350 6 10 20 30 Total Number of Training Instances Unregularized max likelihood, shrinkage: Much worse, not shown

Transfer Not all parameters deserve equal sharing

Degrees of Transfer How do we estimate all these parameters? Split q into subcomponents µiwith weights l Allows for different strengths for different subcomponents, child-parent pairs

Learning Degrees of Transfer • Bootstrap approach If and have a consistent relationship, want to encourage them to be similar • Hyper-prior approach Bayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £ Concretely: inverse-Gamma prior (forced to be positive) • If likelihood is concave, entire objective is convex! Prior on Degree of Transfer

Do Degrees of Transfer Help? Mammal Pairs 15 Hyperprior 10 Bison-Rhino 5 Delta log-loss / instance 0 Regularized Max Likelihood Bison-Rhino -5 Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino -10 Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino -15 6 10 20 30 Total Number of Training Instances

Degrees of Transfer Distribution of DOT coefficients using Hyperprior 20 18 qroot 16 14 12 10 8 6 4 2 0 0 5 10 15 20 25 30 35 40 45 50 1/l Weaker transfer Stronger transfer

Summary • Transfer between related classes • Range of settings, tasks • Probabilistic motivation • Multilevel, complex hierarchies • Simple, efficient computation • Refined transfer of components

Future Work • Non-tree hierarchies (multiple inheritance) • Block degrees of transfer • Structure learning Gene Ontology (GO) network WordNet Hierarchy General undirected model doesn’t require tree structure Part discovery

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies