Learning with Limited Supervision by Input and Output Coding

Learning with Limited Supervision by Input and Output Coding Yi Zhang Machine Learning Department Carnegie Mellon University April 30th, 2012

Thesis Committee • Jeff Schneider, Chair • Geoff Gordon • Tom Mitchell • Xiaojin (Jerry) Zhu, University of Wisconsin-Madison

Introduction (x1,y1) … (xn,yn) • Learning a prediction system, usually based on examples • Training examples are usually limited • Cost of obtaining high-quality examples • Complexity of the prediction problem Y Learning X

Introduction (x1,y1) … (xn,yn) • Solution: exploit extra information about the input and output space • Improve the prediction performance • Reduce the cost for collecting training examples Y Learning X

Introduction (x1,y1) … (xn,yn) ? ? • Solution: exploit extra information about the input and output space • Representation and discovery? • Incorporation? Y Learning X

Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

Regularization • The general formulation • Ridge regression • Lasso

Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

Learning with unlabeled text • For a text classification task • : plenty of unlabeled text on the Web • : seemingly unrelated to the task • What can we gain from such unlabeled text? Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008

A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why?

A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why? • Gasoline ~ gallon, truck ~ vehicle

A covariance operator for regularization • Covariance structure of model coefficients • Usually unknown -- learn from unlabeled text?

Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words

Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words • For a new task, we learn with regularization

Experiments • Empirical results on 20 newsgroups • 190 1-vs-1 classification tasks, 2% labeled examples • For any task, majority of unlabeled text (18/20) is irrelevant • Similar results on logistic regression and least squares [1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006

Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

Multi-task learning • Different but related prediction tasks • An example • Landmine detection using radar images • Multiple tasks: different landmine fields • Geographic conditions • Landmine types • Goal: information sharing among tasks

Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix W =

Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix • A covariance operator for regularizing a matrix? • Vector w: • Matrix W: W = (Gaussian prior) ? Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010

Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and full covariance row covariance column covariance ≈

Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and • The matrix-normal density offers a compact form for full covariance row covariance column covariance ≈

Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization Matrix-normal prior

Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization • Other recent work as variants of special cases • Multi-task feature learning [Argyriou et al, NIPS 06]: learning with the feature covariance • Clustered multi-task learning [Jacob et al, NIPS 08]: learning with the task covariance and spectral constraints • Multi-task relationship learning [Zhang et al, UAI 10]: learning with the task covariance Matrix-normal prior

Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W

Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W • Alternating optimization • Estimating W: same as before • Estimating and : L-1 penalized covariance estimation

Results on multi-task learning • Landmine detection: multiple landmine fields • Face recognition: multiple 1-vs-1 tasks [1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008 [2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006

Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

Learning compressible models • Learning compressible models • A compression operator P instead of • Bias: model compressibility Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010

Energy compaction • Image energy is concentrated at a few frequencies JPEG (2D-DCT), 46 : 1 compression

Energy compaction • Image energy is concentrated at a few frequencies • Models need to operate at relevant frequencies JPEG (2D-DCT), 46 : 1 compression 2D-DCT

Digit recognition: • Sparse vs. compressible • Model coefficients w sparse vs compressible sparse vs compressible sparse vs compressible compressed coefficients Pw coefficients w coefficients w as an image

Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

Dimension reduction • Dimension reduction conveys information about the input space • Feature selection  importance • Feature clustering  granularity • Feature extraction  more general structures

How to use a dimension reduction? • However, any reduction loses certain information • May be relevant to a prediction task • Goal of projection penalties: • Encode useful information from a dimension reduction • Control the risk of potential information loss Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

Projection penalties: the basic idea • The basic idea: • Observation: reduce the feature space  restrict the model search to a model subspace MP • Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP

Projection penalties: linear cases • Learn with a (linear) dimension reduction P

Projection penalties: linear cases • Learn with projection penalties • Optimization: projection distance

Projection penalties: nonlinear cases w MP M P wP Rd Rp P ? F’ F X P ? F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

Projection penalties: nonlinear cases w MP M P wP Rd Rp M w MP P wP F’ F X w MP M P wP F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Projection Penalty Original Reduction Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors Projection Penalty Projection Penalty Original Original Reduction Reduction

2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Orig Red Proj Orig Red Proj Orig Red Proj Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors • Similar results on face recognition, using SVM (poly-2) • Dimension reduction: KPCA, KDA, OLaplacian Face • Similar results on house price prediction, using regression • Dimension reduction: PCA and partial least squares

Multi-label classification • Multi-label classification • Existence of certain label dependency • Example: classify an image into scenes (deserts, river, forest, etc) • Multi-class problem is a special case: only one class is true Label dependency Learn to predict … x y1 y2 yq

Output coding • d < q: compression, i.e., source coding • d > q: error-correcting codes, i.e., channel coding • Use the redundancy to correct prediction (“transmission”) errors Learn to predict … x z z2 z3 zd z1 encoding decoding … y1 y2 yq y

Error-correcting output codes (ECOCs) • Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001] • Encode into a (redundant) set of binary problems • Learn to predict the code • Decode the predictions • Our goal: design ECOCs for multi-label classification y1 y2 vs. y3 {y3,y4} vs. y7 Learn to predict … … x z1 z2 zt encoding decoding … y1 y2 yq

Composite likelihood • The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods • e.g., pairwise likelihood: • e.g., full conditional likelihood • Estimation using composite likelihoods • Computational and statistical efficiency • Robustness under model misspecification

Multi-label problem decomposition • Problem decomposition methods • Decomposition into subproblems (encoding) • Decision making by combining subproblem predictions (decoding) • Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc … … … … Learn to predict x … y1 y2 yq

1-vs-All (Binary Relevance) Independently • Classify each label independently • The composite likelihood view Learn to predict … x y1 y2 yq

Learning with Limited Supervision by Input and Output Coding

Learning with Limited Supervision by Input and Output Coding

Presentation Transcript

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and Output

Input and output

Input and Output

Input and Output

input and output

Input and Output

Input and Output