990 likes | 1.19k Views
Some Useful Machine Learning Tools. M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay , Île-de-France. Outline. Part I : Supervised Learning Part II: Weakly Supervised Learning. Outline – Part I. Introduction to Supervised Learning Probabilistic Methods
E N D
Some Useful Machine Learning Tools M. Pawan Kumar ÉcoleCentrale Paris Écoledes PontsParisTech INRIA Saclay, Île-de-France
Outline • Part I : Supervised Learning • Part II: Weakly Supervised Learning
Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine
Image Classification Is this an urban or rural area? Input: x Output: y {-1,+1}
Image Classification Is this scan healthy or unhealthy? Input: x Output: y {-1,+1}
Image Classification Which city is this? Input: x Output: y {1,2,…,C}
Image Classification What type of tumor does this scan contain? Input: x Output: y {1,2,…,C}
Object Detection Where is the object in the image? Input: x Output: y {Pixels}
Object Detection Where is the rupture in the scan? Input: x Output: y {Pixels}
Segmentation sky tree car sky road grass What is the semantic class of each pixel? Input: x Output: y {1,2,…,C}|Pixels|
Segmentation What is the muscle group of each pixel? Input: x Output: y {1,2,…,C}|Pixels|
A Simplified View of the Pipeline Input x Features Φ(x) Extract Features http://deeplearning.net Compute Scores Learn f Prediction y(f) Scores f(Φ(x),y) maxy f(Φ(x),y)
Learning Objective Data distribution P(x,y) Distribution is unknown Measure of prediction quality • f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth
Learning Objective Training data {(xi,yi), i= 1,2,…,n} Measure of prediction quality • f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth
Learning Objective Training data {(xi,yi), i= 1,2,…,n} Finite samples Measure of prediction quality • f* = argminfΣi Error(yi(f),yi) Expectation over empirical distribution Prediction Ground Truth
Learning Objective Training data {(xi,yi), i= 1,2,…,n} Finite samples • f* = argminfΣi Error(yi(f),yi) + λ R(f) Regularizer Relative weight (hyperparameter)
Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine
Logistic Regression Input: x Output: y {-1,+1} Features: Φ(x) f(Φ(x),y) = yθTΦ(x) Prediction: sign(θTΦ(x)) P(y|x) = l(f(Φ(x),y)) l(z) = 1/(1+e-z) Logistic function Is the distribution normalized?
Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ R(θ) Regularizer Negative Log-likelihood
Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Proof left as an exercise. Hint: Prove that Hessian H is PSD aTHa ≥ 0, for all a
Gradient Descent Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) dθ θt Repeat until decrease in objective is below a threshold
Gradient Descent Small μ Large μ
Gradient Descent Small μ Large μ
Gradient Descent Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) Small constant or Line search dθ θt Repeat until decrease in objective is below a threshold
Newton’s Method Minimize g(z) Solution at iteration t = zt Define gt(Δz) = g(zt + Δz) Second-order Taylor’s Series gt(Δz) ≈ g(zt) + g’(zt)Δz + g’’(zt) (Δz)2 Derivative wrtΔz = 0, implies g’(zt) + g’’(zt) Δz= 0 Solving for Δzprovides the learning rate
Newton’s Method Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 μ-1 = d2L(θ) θt+1 θt - μ dL(θ) dθ2 dθ θt θt Repeat until decrease in objective is below a threshold
Logistic Regression Input: x Features: Φ(x) Output: y {1,2,…,C} Train C 1-vs-all logistic regression binary classifiers Prediction: Maximum probability of +1 over C classifiers Simple extension, easy to code Loses the probabilistic interpretation
Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine
Multiclass Logistic Regression Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 00 … 0] Ψ(x,2) = [0Φ(x) 0 … 0] … Ψ(x,C) = [000 … Φ(x)]
Multiclass Logistic Regression Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxyθTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σyexp(f(Ψ(x,y)))
Multiclass Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Gradient Descent, Newton’s Method, and many others
Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine
Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yi,yj), for all i, j]
Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yij), for all i, j] [Ψ(x,yi), for all i; Ψ(x,yc), c is a subset of variables]
Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxyθTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σyexp(f(Ψ(x,y)))
Regularized Maximum Likelihood Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Partition function is expensive to compute Approximate inference (Nikos Komodakis’ tutorial)
Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine (multiclass) • Structured output support vector machine
Multiclass SVM Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 00 … 0] Ψ(x,2) = [0Φ(x) 0 … 0] … Ψ(x,C) = [000 … Φ(x)]
Multiclass SVM Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxywTΨ(x,y)) Predicted Output: y(w) = argmaxywTΨ(x,y))
Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} Loss function for i-th sample Δ(yi,yi(w)) Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur)
Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi(w)) - wTΨ(x,yi) ≤wTΨ(x,yi(w)) + Δ(yi,yi(w)) ≤maxy{ wTΨ(x,y) + - wTΨ(x,yi) Δ(yi,y) } Sensitive to regularization of w Convex
Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi for all y wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi Quadratic program with polynomial # of constraints Specialized software packages freely available http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html
Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine (multiclass) • Structured output support vector machine
Structured Output SVM Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxywTΨ(x,y))
Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi for all y wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi Quadratic program with exponential # of constraints Many polynomial time algorithms
Cutting Plane Algorithm Define working sets Wi = {} REPEAT Update w by solving the following problem minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y Wi Compute the most violated constraint for all samples ŷi=argmaxywTΨ(x,y) + Δ(yi,y) Update the working sets Wi by adding ŷi
Cutting Plane Algorithm Termination criterion: Violation of ŷi < ξi + ε, for all i Number of iterations = max{O(n/ε),O(C/ε2)} At each iteration, convex dual of problem increases. Convex dual can be upper bounded. IoannisTsochantaridis et al., JMLR 2005 http://svmlight.joachims.org/svm_struct.html
Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y {1,2,…,C}m Number of constraints = nCm
Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y Y
Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)≤ ξi for all zi Y