Loss-based Learning with Weak Supervision

Loss-based Learning with Weak Supervision M. Pawan Kumar

About the Talk • Methods that use latent structured SVM • A little math-y • Initial stages

Outline • Latent SSVM • Ranking • Brain Activation Delays in M/EEG • Probabilistic Segmentation of MRI Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Weakly Supervised Data x Input x h Output y  {-1,+1} Hidden h y = +1

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Ψ(x,y,h) y = +1

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Φ(x,h) Ψ(x,+1,h) = y = +1 0

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector 0 Ψ(x,-1,h) = y = +1 Φ(x,h)

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Ψ(x,y,h) y = +1 Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h

Latent SSVM Scoring function wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h)

Learning Latent SSVM Training data {(xi,yi), i= 1,2,…,n} w* = argminwΣiΔ(yi,yi(w)) Minimize empirical risk specified by loss function Highly non-convex in w Cannot regularize w to prevent overfitting

Learning Latent SSVM Training data {(xi,yi), i= 1,2,…,n} wTΨ(x,yi(w),hi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi(w),hi(w)) ≤ wTΨ(x,yi(w),hi(w)) + Δ(yi,yi(w)) - maxhiwTΨ(x,yi,hi) ≤ maxy,h{wTΨ(x,y,h) + Δ(yi,y)} - maxhiwTΨ(x,yi,hi)

Learning Latent SSVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(xi,y,h) + Δ(yi,y) - maxhiwTΨ(xi,yi,hi)≤ ξi Difference-of-convex program in w Local minimum or saddle point solution (CCCP)

CCCP Start with an initial estimate of w Impute hidden variables Loss independent hi*= argmaxhwTΨ(xi,yi,h) Update w Loss dependent minw ||w||2 + C Σiξi wTΨ(xi,y,h) + Δ(yi,y) - wTΨ(xi,yi,hi*)≤ ξi Repeat until convergence

Recap Scoring function wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h) Learning minw ||w||2 + C Σiξi wTΨ(xi,y,h) + Δ(yi,y) - maxhiwTΨ(xi,yi,hi)≤ ξi

Outline • Latent SSVM • Ranking • Brain Activation Delays in M/EEG • Probabilistic Segmentation of MRI Joint Work with AseemBehl and C. V. Jawahar

Ranking Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Average Precision = 1

Ranking Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Average Precision = 1 Accuracy = 1 Average Precision = 0.92 Average Precision = 0.81 Accuracy = 0.67

Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly

Outline • Latent SSVM • Ranking • Supervised Learning • Weakly Supervised Learning • Latent AP-SVM • Experiments • Brain Activation Delays in M/EEG • Probabilistic Segmentation of MRI Yue, Finley, Radlinski and Joachims, 2007

Supervised Learning - Input P N = {HP,HN} Training images X Bounding boxes H

Supervised Learning - Output Ranking matrix Y +1 if i is better ranked than k Yik = -1 if k is better ranked than i 0 if i and k are ranked equally Optimal ranking Y*

SSVM Formulation Joint feature vector ΣiPΣkNYik (Φ(xi,hi)-Φ(xk,hk)) Ψ(X,Y,{HP,HN}) = |P||N| Scoring function wTΨ(X,Y,{HP,HN})

Prediction using SSVM Y(w) = argmaxYwTΨ(X,Y, {HP,HN}) Sort by value of sample score wTΦ(xi,hi) Same as standard binary SVM

Learning SSVM minw Δ(Y*,Y(w)) Loss = 1 – AP of prediction

Learning SSVM wTΨ(X,Y(w),{HP,HN}) + Δ(Y*,Y(w)) - wTΨ(X,Y(w),{HP,HN})

Learning SSVM wTΨ(X,Y(w),{HP,HN}) + Δ(Y*,Y(w)) - wTΨ(X,Y*,{HP,HN})

Learning SSVM minw ||w||2+ C ξ wTΨ(X,Y,{HP,HN}) + maxY Δ(Y*,Y) - wTΨ(X,Y*,{HP,HN}) ≤ ξ

Learning SSVM minw ||w||2+ C ξ wTΨ(X,Y,{HP,HN}) + maxY Δ(Y*,Y) - wTΨ(X,Y*,{HP,HN}) ≤ ξ Loss Augmented Inference

Loss Augmented Inference Rank 1 Rank 2 Rank 3 Rank positives according to sample scores

Loss Augmented Inference Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank negatives according to sample scores

Loss Augmented Inference Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Slide best negative to a higher rank Terminate after considering last negative Continue until score stops increasing Slide next negative to a higher rank Continue until score stops increasing Optimal loss augmented inference

Recap Scoring function wTΨ(X,Y,{HP,HN}) Prediction Y(w) = argmaxYwTΨ(X,Y, {HP,HN}) Learning Using optimal loss augmented inference

Outline • Latent SSVM • Ranking • Supervised Learning • Weakly Supervised Learning • Latent AP-SVM • Experiments • Brain Activation Delays in M/EEG • Probabilistic Segmentation of MRI

Weakly Supervised Learning - Input Training images X

Weakly Supervised Learning - Latent Bounding boxes HP Training images X All bounding boxes in negative images are negative

Intuitive Prediction Procedure Select the best bounding boxes in all images

Intuitive Prediction Procedure Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank them according to their sample scores

Weakly Supervised Learning - Output Ranking matrix Y +1 if i is better ranked than k Yik = -1 if k is better ranked than i 0 if i and k are ranked equally Optimal ranking Y*

Latent SSVM Formulation Joint feature vector ΣiPΣkNYik (Φ(xi,hi)-Φ(xk,hk)) Ψ(X,Y,{HP,HN}) = |P||N| Scoring function wTΨ(X,Y,{HP,HN})

Prediction using Latent SSVM maxY,HwTΨ(X,Y, {HP,HN})

Prediction using Latent SSVM maxY,HwTΣiPΣkNYik (Φ(xi,hi)-Φ(xk,hk)) Choose best bounding box for positives Choose worst bounding box for negatives Not what we wanted

Learning Latent SSVM minw Δ(Y*,Y(w)) Loss = 1 – AP of prediction

Learning Latent SSVM wTΨ(X,Y(w),{HP(w),HN(w)}) + Δ(Y*,Y(w)) - wTΨ(X,Y(w),{HP(w),HN(w)})

Learning Latent SSVM wTΨ(X,Y(w),{HP(w),HN(w)}) + Δ(Y*,Y(w)) - wTΨ(X,Y*,{HP,HN}) maxH

Learning Latent SSVM minw ||w||2+ C ξ wTΨ(X,Y,{HP,HN}) + maxY,H Δ(Y*,Y) - wTΨ(X,Y*,{HP,HN}) ≤ ξ maxH

Learning Latent SSVM minw ||w||2+ C ξ wTΨ(X,Y,{HP,HN}) + maxY,H Δ(Y*,Y) - wTΨ(X,Y*,{HP,HN}) ≤ ξ maxH Loss Augmented Inference Cannot be solved optimally

Recap Unintuitive prediction Unintuitive objective function Non-optimal loss augmented inference Can we do better?

Outline • Latent SSVM • Ranking • Supervised Learning • Weakly Supervised Learning • Latent AP-SVM • Experiments • Brain Activation Delays in M/EEG • Probabilistic Segmentation of MRI

Latent AP-SVM Formulation Joint feature vector ΣiPΣkNYik (Φ(xi,hi)-Φ(xk,hk)) Ψ(X,Y,{HP,HN}) = |P||N| Scoring function wTΨ(X,Y,{HP,HN})

Prediction using Latent AP-SSVM Choose best bounding box for all samples hi(w) = argmaxhwTΦ(xi,h) Optimize over the ranking Y(w) = argmaxYwTΨ(X,Y, {HP(w),HN(w)}) Sort by sample scores

Loss-based Learning with Weak Supervision

Loss-based Learning with Weak Supervision

Presentation Transcript

Strengths-based Supervision

Strengths-based Supervision

Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

Loss-based Learning with Latent Variables

Loss-based Visual Learning with Weak Supervision

Supervision with Results

Loss-based Learning with Weak Supervision

Coupling interest-based learning with qualification-based learning

Modeling Latent Variable Uncertainty for Loss-based Learning

Risk-based Supervision

Towards Risk Based Supervision

Weak Lensing with LSST

Supervision and Learning Styles

Risk-based Supervision

Constraints Driven Structured Learning with Indirect Supervision

Learning Data Representations with “ Partial Supervision ”

Training Discriminative Computer Vision Models with Weak Supervision

Learning with Limited Supervision by Input and Output Coding

Large-Scale Object Recognition with Weak Supervision

Project Based Learning with Technology

Constraints Driven Structured Learning with Indirect Supervision

Risk-based Supervision