1k likes | 1.12k Views
Learning with Limited Supervision by Input and Output Coding. Yi Zhang Machine Learning Department Carnegie Mellon University April 30 th , 2012. Thesis Committee. Jeff Schneider, Chair Geoff Gordon Tom Mitchell Xiaojin (Jerry) Zhu, University of Wisconsin-Madison. Introduction.
E N D
Learning with Limited Supervision by Input and Output Coding Yi Zhang Machine Learning Department Carnegie Mellon University April 30th, 2012
Thesis Committee • Jeff Schneider, Chair • Geoff Gordon • Tom Mitchell • Xiaojin (Jerry) Zhu, University of Wisconsin-Madison
Introduction (x1,y1) … (xn,yn) • Learning a prediction system, usually based on examples • Training examples are usually limited • Cost of obtaining high-quality examples • Complexity of the prediction problem Y Learning X
Introduction (x1,y1) … (xn,yn) • Solution: exploit extra information about the input and output space • Improve the prediction performance • Reduce the cost for collecting training examples Y Learning X
Introduction (x1,y1) … (xn,yn) ? ? • Solution: exploit extra information about the input and output space • Representation and discovery? • Incorporation? Y Learning X
Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Regularization • The general formulation • Ridge regression • Lasso
Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Learning with unlabeled text • For a text classification task • : plenty of unlabeled text on the Web • : seemingly unrelated to the task • What can we gain from such unlabeled text? Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008
A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why?
A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why? • Gasoline ~ gallon, truck ~ vehicle
A covariance operator for regularization • Covariance structure of model coefficients • Usually unknown -- learn from unlabeled text?
Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words
Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words • For a new task, we learn with regularization
Experiments • Empirical results on 20 newsgroups • 190 1-vs-1 classification tasks, 2% labeled examples • For any task, majority of unlabeled text (18/20) is irrelevant • Similar results on logistic regression and least squares [1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Multi-task learning • Different but related prediction tasks • An example • Landmine detection using radar images • Multiple tasks: different landmine fields • Geographic conditions • Landmine types • Goal: information sharing among tasks
Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix W =
Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix • A covariance operator for regularizing a matrix? • Vector w: • Matrix W: W = (Gaussian prior) ? Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010
Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and full covariance row covariance column covariance ≈
Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and • The matrix-normal density offers a compact form for full covariance row covariance column covariance ≈
Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization Matrix-normal prior
Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization • Other recent work as variants of special cases • Multi-task feature learning [Argyriou et al, NIPS 06]: learning with the feature covariance • Clustered multi-task learning [Jacob et al, NIPS 08]: learning with the task covariance and spectral constraints • Multi-task relationship learning [Zhang et al, UAI 10]: learning with the task covariance Matrix-normal prior
Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W
Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W • Alternating optimization • Estimating W: same as before • Estimating and : L-1 penalized covariance estimation
Results on multi-task learning • Landmine detection: multiple landmine fields • Face recognition: multiple 1-vs-1 tasks [1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008 [2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Learning compressible models • Learning compressible models • A compression operator P instead of • Bias: model compressibility Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010
Energy compaction • Image energy is concentrated at a few frequencies JPEG (2D-DCT), 46 : 1 compression
Energy compaction • Image energy is concentrated at a few frequencies • Models need to operate at relevant frequencies JPEG (2D-DCT), 46 : 1 compression 2D-DCT
Digit recognition: • Sparse vs. compressible • Model coefficients w sparse vs compressible sparse vs compressible sparse vs compressible compressed coefficients Pw coefficients w coefficients w as an image
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Dimension reduction • Dimension reduction conveys information about the input space • Feature selection importance • Feature clustering granularity • Feature extraction more general structures
How to use a dimension reduction? • However, any reduction loses certain information • May be relevant to a prediction task • Goal of projection penalties: • Encode useful information from a dimension reduction • Control the risk of potential information loss Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
Projection penalties: the basic idea • The basic idea: • Observation: reduce the feature space restrict the model search to a model subspace MP • Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP
Projection penalties: linear cases • Learn with a (linear) dimension reduction P
Projection penalties: linear cases • Learn with projection penalties • Optimization: projection distance
Projection penalties: nonlinear cases w MP M P wP Rd Rp P ? F’ F X P ? F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
Projection penalties: nonlinear cases w MP M P wP Rd Rp M w MP P wP F’ F X w MP M P wP F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Projection Penalty Original Reduction Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors Projection Penalty Projection Penalty Original Original Reduction Reduction
2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Orig Red Proj Orig Red Proj Orig Red Proj Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors • Similar results on face recognition, using SVM (poly-2) • Dimension reduction: KPCA, KDA, OLaplacian Face • Similar results on house price prediction, using regression • Dimension reduction: PCA and partial least squares
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Multi-label classification • Multi-label classification • Existence of certain label dependency • Example: classify an image into scenes (deserts, river, forest, etc) • Multi-class problem is a special case: only one class is true Label dependency Learn to predict … x y1 y2 yq
Output coding • d < q: compression, i.e., source coding • d > q: error-correcting codes, i.e., channel coding • Use the redundancy to correct prediction (“transmission”) errors Learn to predict … x z z2 z3 zd z1 encoding decoding … y1 y2 yq y
Error-correcting output codes (ECOCs) • Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001] • Encode into a (redundant) set of binary problems • Learn to predict the code • Decode the predictions • Our goal: design ECOCs for multi-label classification y1 y2 vs. y3 {y3,y4} vs. y7 Learn to predict … … x z1 z2 zt encoding decoding … y1 y2 yq
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Composite likelihood • The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods • e.g., pairwise likelihood: • e.g., full conditional likelihood • Estimation using composite likelihoods • Computational and statistical efficiency • Robustness under model misspecification
Multi-label problem decomposition • Problem decomposition methods • Decomposition into subproblems (encoding) • Decision making by combining subproblem predictions (decoding) • Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc … … … … Learn to predict x … y1 y2 yq
1-vs-All (Binary Relevance) Independently • Classify each label independently • The composite likelihood view Learn to predict … x y1 y2 yq