250 likes | 332 Views
Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab. Convex Point Estimation using Undirected Bayesian Transfer Hierarchies. Motivation. Problem : With few instances, learned models aren’t robust. Task : Shape modeling. Principal Components. std +1. MEAN.
E N D
Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab Convex Point Estimation using Undirected Bayesian Transfer Hierarchies
Motivation Problem: With few instances, learned models aren’t robust Task: Shape modeling Principal Components std +1 MEAN MEAN std -1 std +1 std -1
Transfer Learning Shape is stabilized, but doesn’t look like an elephant Can we use rhinos to help elephants? Principal Components std +1 MEAN MEAN std -1 std +1 std -1
Hierarchical Bayes P(qroot) qroot P(qElephant|qroot) P(qRhino|qroot) qRhino qElephant P(DataRhino|qRhino) P(DataElephant|qElephant)
Goals • Transfer between related classes • Range of settings, tasks • Probabilistic motivation • Multilevel, complex hierarchies • Simple, efficient computation • Automatically learn what to transfer
Hierarchical Bayes qroot • Compute full posterior P(£|D) • P(£c|£root) must be conjugate to P(D|£c) qElephant qRhino Problem: Often can’t perform full Bayesian computations
Approx.: Point estimation Best parameters are good enough; don’t need full distribution qroot qElephant • Empirical Bayes • Point estimation • Other approximations: • Posterior as normal, sampling, etc. qRhino
More Issues: Multiple Levels • Conjugate priors usually can’t be extended to • multiple levels (e.g., Dirichlet, inverse-Wishart) • Exception: Thibeaux and Jordan (‘05)
More Issues: Restrictive Priors • Example: inverse-Wishart • Pseudocount restriction • n >= d • If d is large, N is small, signal from prior overwhelms data • We show experiments with N=3, d=20 Normal-Inverse-Wishart parameters Pseudocounts Gaussian parameters m,L,k,n mA,SA mB,SB N = # samples, d = dimension
Alternative: Shrinkage • McCallum et al. (‘98) • 1. Compute maximum likelihood at each node • 2. “Shrink” each node toward its parent • Linear combination of q and qparent • Uses cross-validation of a • Pros: • Simple to compute • Handles multiple levels • Cons: • Naive heuristic for transfer • Averaging not always appropriate qparent q* = aq + (1-a)qparent q
Undirected HB Reformulation Probabilistic Abstraction Hierarchies (Segal et al. ‘01) Defines an undirected Markov random field model over Q, D
Undirected Probabilistic Model qroot b: high b: low Divergence Divergence qRhino qElephant Fdata Fdata Divergence: Encourage parameters to be similar to parents Fdata: Encourage parameters to explain data
Purpose of Reformulation • Easy to specify • Fdata can be likelihood, classification, or other objective • Divergence can be L1-distance, L2-distance, e-insensitive loss, KL divergence, etc. • No conjugacy or proper prior restrictions • Easy to optimize • Convex over Q if Fdata is convex and Divergence is concave
Application: Text categorization Task: Categorize Documents Bag-of-words model Fdata : Multinomial log likelihood (regularized) µi represents frequency of word i Divergence: L2 norm Newsgroup20 Dataset
Baselines 1. Maximum likelihood at each node (no hierarchy) 2. Cross-validate regularization (no hierarchy) 3. Shrinkage (McCallum et al. ’98, with hierarchy) qparent qB qA q* = aq + (1-a)qparent q
Can It Handle Multiple Levels? Newsgroup Topic Classification 0.7 0.65 0.6 0.55 Classification Rate 0.5 0.45 Max Likelihood (No regularization) Shrinkage Regularized Max Likelihood 0.4 Undirected HB 0.35 75 150 225 300 375 Total Number of Training Instances
Application: Shape Modeling Mammals Dataset (Fink, ’05) Task: Learn shape (Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Covariance over landmarks Mean landmark location Regularization MEAN Principal Components
Does Hierarchy Help? Mammal Pairs 50 Regularized Max Likelihood 0 Bison-Rhino -50 -100 Delta log-loss / instance -150 Bison-Rhino Elephant-Bison -200 Elephant-Rhino Giraffe-Bison Giraffe-Elephant -250 Giraffe-Rhino Llama-Bison Llama-Elephant -300 Llama-Giraffe Llama-Rhino -350 6 10 20 30 Total Number of Training Instances Unregularized max likelihood, shrinkage: Much worse, not shown
Transfer Not all parameters deserve equal sharing
Degrees of Transfer How do we estimate all these parameters? Split q into subcomponents µiwith weights l Allows for different strengths for different subcomponents, child-parent pairs
Learning Degrees of Transfer • Bootstrap approach If and have a consistent relationship, want to encourage them to be similar • Hyper-prior approach Bayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £ Concretely: inverse-Gamma prior (forced to be positive) • If likelihood is concave, entire objective is convex! Prior on Degree of Transfer
Do Degrees of Transfer Help? Mammal Pairs 15 Hyperprior 10 Bison-Rhino 5 Delta log-loss / instance 0 Regularized Max Likelihood Bison-Rhino -5 Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino -10 Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino -15 6 10 20 30 Total Number of Training Instances
Degrees of Transfer Distribution of DOT coefficients using Hyperprior 20 18 qroot 16 14 12 10 8 6 4 2 0 0 5 10 15 20 25 30 35 40 45 50 1/l Weaker transfer Stronger transfer
Summary • Transfer between related classes • Range of settings, tasks • Probabilistic motivation • Multilevel, complex hierarchies • Simple, efficient computation • Refined transfer of components
Future Work • Non-tree hierarchies (multiple inheritance) • Block degrees of transfer • Structure learning Gene Ontology (GO) network WordNet Hierarchy General undirected model doesn’t require tree structure Part discovery