1 / 28

Information Bottleneck EM

Information Bottleneck EM. Gal Elidan and Nir Friedman. School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel. T. X 1. X 2. X 3. 1. 0.9. 0.8. 0.7. 0.6. Likelihood. 0.5. 0.4. 0.3. 0.2. 0.1. 0. 0. 0.1. 0.2. 0.3. 0.4. 0.5. 0.6. 0.7. 0.8. 0.9.

dima
Download Presentation

Information Bottleneck EM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Bottleneck EM Gal Elidan and Nir Friedman School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel

  2. T X1 X2 X3 1 0.9 0.8 0.7 0.6 Likelihood 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Params Learning with Hidden Variables Input: Output:A model P(X,T) X1 … XN T Problem: No closed-form solution for ML estimation Use Expectation Maximization (EM) Problem:Stuck in inferior local Maxima • Random Restarts • Deterministic • Simulated annealing DATA ? ? ? ? ? ? EM + information regularizationfor learning parameters

  3. X1 X2 X3 Learning Parameters X1 … XN Input: Output: A model P(X) DATA Empirical distribution Q(X) Parametrization of P P(X1) = Q(X1)P(X2|X1) = Q(X2|X1) P(X3|X1) = Q(X3|X1)

  4. Y 1 2 T 3 4 X1 X2 X3 M Learning with Hidden Variables X1 … XN T Input: Desired structure: DATA ? ? ? ? ? ? guess of Empirical distributionQ(X,T,Y)=Q(X,T)Q(T|Y) Empirical distribution Q(X,T,Y) = Empirical distribution Q(X,T) = ? For each instance ID, complete value of T EM Iterations Parametrization for P

  5. EM Functional The EM Algorithm: E-Step: Generate empirical distribution M-Step: Maximize using  EM is equivalent to optimizing function of Q,P  Each step increases value of functional [Neal and Hinton, 1998]

  6. Information Bottleneck EM Target: In the rest of the talk… • Understanding this objective • How to use it to learn better models EM target Information between hidden and ID

  7. Information Regularization Motivating idea: Fit training data: Set T to be instance ID to “predict” X Generalization: “Forget” ID and keep essence of X Objective: parameter free regularization of Q (lower bound of) Likelihood of P Compression of instance ID vs. [Tishby et. al, 1999]

  8. =0 Compressionmeasure EMTarget 1 1 5 5 6 6 total compression 11 11 4 4 7 7 3 3 =0 10 10 2 2 8 8 9 9 Clustering example EMTarget Compressionmeasure

  9. 6 5 1 9 11 3 10 2 4 8 7 Clustering example =1 EMTarget Compressionmeasure 1 5 6 total preservation 11 4 7 3 =1 10 2 8 9 T  ID

  10. 1 7 3 1 5 6 9 11 5 11 4 7 6 10 3 4 10 2 2 8 8 9 Clustering example =? Compressionmeasure EMTarget Desired =? |T| = 2

  11. Information Bottleneck EM EM functional Formal equivalence with Information Bottleneck at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case]

  12. Information Bottleneck EM EM functional Formal equivalence with Information Bottleneck Maximum of Q(T|Y) is obtained when Marginal ofT in Q Prediction ofT using P Normalization

  13. The IB-EM Algorithm for fixed  • Iterate until convergance E-Step:Maximize LIB-EM by optimizing Q M-Step:Maximize LIB-EM by optimizing P (same as standard M-step) • Each step improves LIB-EM • Guaranteed to converge

  14. Information Bottleneck EM Target: In the rest of the talk… • Understanding this objective • How to use it to learn better models EM target Information between hidden and ID

  15. Continuation easy Follow ridge from optimum at =0 LIB-EM hard 0  Q 1

  16. Q Continuation • Recall, if Q is a local maxima of LIB-EM then • We want to follow a path in (Q, ) space so that… for all t, and y Local maxima for all 

  17. 0  1 Q Continuation Step start • Start at (Q,) so that • Compute gradient • Take  direction  • Take a step in thedesired direction

  18. 0  1 Q Staying on the ridge start • Potential problem: • Direction is tangent to path miss optimum Solution:Use EM steps to regain path

  19. The IB-EM Algorithm • Set =0 (start at easy solution) • Iterate until =1 (EM solution is reached) • Iterate (stay on the ridge) • E-Step: Maximize LIB-EM by optimizing Q • M-Step: Maximize LIB-EM by optimizing P • Step (follow the ridge) • Compute gradient and  direction • Take the step by changing  and Q

  20. 0  1 Inferior solution Q Calibrating the step size • Potential problem: • Step size too smalltoo slow • Step size too largeovershoot target

  21. 1.5 1.5 1 1 I(T;Y) 0.5 0.5   0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Calibrating the step size Recall that I(T;Y) measures compression of ID When I(T;Y) rises more of data is captured • Non-parametric: involves only Q • Can be bounded: I(T;Y) ≤ log2|T| Naive Use change in I(T;Y) “Interesting”area Too sparse

  22. 1.5 1 I(T;Y) 0.5 0 0.2 0.4 0.6 0.8 1 The Stock Dataset Naive Bayes model Daily changes of20 NASDAQ stocks. 1213 train, 303 test • IB-EM outperforms best of EM solutions • I(T;Y) follows changes of likelihood • Continuation ~follows region of change ( marks evaluated ) -19 Best of EM -21 Train likelihood IB-EM -23  0 0.2 0.4 0.6 0.8 1 [Boyen et. al, 1999]

  23. Y Multiple Hidden Variables We want to learn a model with many hiddens ( ) Naive: Potentially exponential in # of hiddens Variational approximation: use factorized form (Mean Field) P Q(T|Y)  LIB-EM = (Variational EM) - (1- )Regularization [Friedman et. al, 2002]

  24. -330 -334 Test log-loss / instance Mean Field EM1 min/run -338 -342 20 40 60 80 100 Percentage of random runs The USPS Digits dataset 400 samples 21 hiddens • Superior to all Mean Field EM runs • Time  single exact EM run single IB-EM 27 min exact EM25 min/run 3/50 EM runs are  IB-EM: EM needs  x17 time for similar results Offers good value for your time!

  25. -147.5 -148.5 -149.5 Test log-loss / instance Mean Field EM~0.5 hours -150.5 -151.5 0 20 40 60 80 100 Precentage of random runs Yeast Stress Response 173 experiments (variables) 6152 genes (samples) 25 hidden variables • Superior to all Mean Field EM runs • An order of magnitude faster then exact EM IB-EM ~6 hours Exact EM>60 hours 5-24 experiments Effective when exact solution becomes intractable!

  26. Summary New framework for learning hidden variables • Formal relation of Bottleneck and EM • Continuation for bypassing local maxima • Flexible: structure / variational approximation Future Work • Learn optimal ≤1 for better generalization • Explore other approximations of Q(T|Y) • Model selection: learning cardinality and enrich structure

  27. Relation to Weight Annealing Init: temp = hot Iterate until temp = cold • Perturb w  temp • Use QW and optimize • Cool down Y X1 … XN W 1 DATA 2 3 4 M Similarities: • Change in empirical Q • Morph towards EM solution Differences: • IB-EM uses info. regulatization • IB-EM uses continuation • WA requires cooling policy • WA applicable for wider range of problems [Elidan et. al, 2002]

  28. Relation to Deterministic Annealing Init: temp = hot Iterate until temp = cold • “Insert” entropy  temp into model • Optimize noisy model • Cool down Y X1 … XN 1 DATA 2 3 4 M Similarities: • Use informationmeasure • Morph towards EM solution Differences: • DA parameterization dependent • IB-EM uses continuation • DA requires cooling policy • DA applicable for wider range of problems

More Related