Some Useful Machine Learning Tools

Some Useful Machine Learning Tools M. Pawan Kumar ÉcoleCentrale Paris Écoledes PontsParisTech INRIA Saclay, Île-de-France

Outline • Part I : Supervised Learning • Part II: Weakly Supervised Learning

Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine • Structured output support vector machine

Image Classification Is this an urban or rural area? Input: x Output: y  {-1,+1}

Image Classification Is this scan healthy or unhealthy? Input: x Output: y  {-1,+1}

Image Classification Which city is this? Input: x Output: y  {1,2,…,C}

Image Classification What type of tumor does this scan contain? Input: x Output: y  {1,2,…,C}

Object Detection Where is the object in the image? Input: x Output: y  {Pixels}

Object Detection Where is the rupture in the scan? Input: x Output: y  {Pixels}

Segmentation sky tree car sky road grass What is the semantic class of each pixel? Input: x Output: y  {1,2,…,C}|Pixels|

Segmentation What is the muscle group of each pixel? Input: x Output: y  {1,2,…,C}|Pixels|

A Simplified View of the Pipeline Input x Features Φ(x) Extract Features http://deeplearning.net Compute Scores Learn f Prediction y(f) Scores f(Φ(x),y) maxy f(Φ(x),y)

Learning Objective Data distribution P(x,y) Distribution is unknown Measure of prediction quality • f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

Learning Objective Training data {(xi,yi), i= 1,2,…,n} Measure of prediction quality • f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

Learning Objective Training data {(xi,yi), i= 1,2,…,n} Finite samples Measure of prediction quality • f* = argminfΣi Error(yi(f),yi) Expectation over empirical distribution Prediction Ground Truth

Learning Objective Training data {(xi,yi), i= 1,2,…,n} Finite samples • f* = argminfΣi Error(yi(f),yi) + λ R(f) Regularizer Relative weight (hyperparameter)

Logistic Regression Input: x Output: y  {-1,+1} Features: Φ(x) f(Φ(x),y) = yθTΦ(x) Prediction: sign(θTΦ(x)) P(y|x) = l(f(Φ(x),y)) l(z) = 1/(1+e-z) Logistic function Is the distribution normalized?

Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ R(θ) Regularizer Negative Log-likelihood

Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Proof left as an exercise. Hint: Prove that Hessian H is PSD aTHa ≥ 0, for all a

Gradient Descent Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) dθ θt Repeat until decrease in objective is below a threshold

Gradient Descent Small μ Large μ

Gradient Descent Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) Small constant or Line search dθ θt Repeat until decrease in objective is below a threshold

Newton’s Method Minimize g(z) Solution at iteration t = zt Define gt(Δz) = g(zt + Δz) Second-order Taylor’s Series gt(Δz) ≈ g(zt) + g’(zt)Δz + g’’(zt) (Δz)2 Derivative wrtΔz = 0, implies g’(zt) + g’’(zt) Δz= 0 Solving for Δzprovides the learning rate

Newton’s Method Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 μ-1 = d2L(θ) θt+1 θt - μ dL(θ) dθ2 dθ θt θt Repeat until decrease in objective is below a threshold

Logistic Regression Input: x Features: Φ(x) Output: y  {1,2,…,C} Train C 1-vs-all logistic regression binary classifiers Prediction: Maximum probability of +1 over C classifiers Simple extension, easy to code Loses the probabilistic interpretation

Multiclass Logistic Regression Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 00 … 0] Ψ(x,2) = [0Φ(x) 0 … 0] … Ψ(x,C) = [000 … Φ(x)]

Multiclass Logistic Regression Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxyθTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σyexp(f(Ψ(x,y)))

Multiclass Logistic Regression Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Gradient Descent, Newton’s Method, and many others

Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yi,yj), for all i, j]

Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yij), for all i, j] [Ψ(x,yi), for all i; Ψ(x,yc), c is a subset of variables]

Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxyθTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σyexp(f(Ψ(x,y)))

Regularized Maximum Likelihood Training data {(xi,yi), i= 1,2,…,n} • minθΣi –log(P(yi|xi)) + λ ||θ||2 Partition function is expensive to compute Approximate inference (Nikos Komodakis’ tutorial)

Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine (multiclass) • Structured output support vector machine

Multiclass SVM Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 00 … 0] Ψ(x,2) = [0Φ(x) 0 … 0] … Ψ(x,C) = [000 … Φ(x)]

Multiclass SVM Input: x Features: Φ(x) Output: y  {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxywTΨ(x,y)) Predicted Output: y(w) = argmaxywTΨ(x,y))

Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} Loss function for i-th sample Δ(yi,yi(w)) Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur)

Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi(w)) - wTΨ(x,yi) ≤wTΨ(x,yi(w)) + Δ(yi,yi(w)) ≤maxy{ wTΨ(x,y) + - wTΨ(x,yi) Δ(yi,y) } Sensitive to regularization of w Convex

Multiclass SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi for all y wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi Quadratic program with polynomial # of constraints Specialized software packages freely available http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html

Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods • Logistic regression • Multiclass logistic regression • Regularized maximum likelihood • Loss-based Methods • Support vector machine (multiclass) • Structured output support vector machine

Structured Output SVM Input: x Features: Φ(x) Output: y  {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxywTΨ(x,y))

Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi for all y wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi Quadratic program with exponential # of constraints Many polynomial time algorithms

Cutting Plane Algorithm Define working sets Wi = {} REPEAT Update w by solving the following problem minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y  Wi Compute the most violated constraint for all samples ŷi=argmaxywTΨ(x,y) + Δ(yi,y) Update the working sets Wi by adding ŷi

Cutting Plane Algorithm Termination criterion: Violation of ŷi < ξi + ε, for all i Number of iterations = max{O(n/ε),O(C/ε2)} At each iteration, convex dual of problem increases. Convex dual can be upper bounded. IoannisTsochantaridis et al., JMLR 2005 http://svmlight.joachims.org/svm_struct.html

Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y  {1,2,…,C}m Number of constraints = nCm

Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi)≤ ξi for all y  Y

Structured Output SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)≤ ξi for all zi Y

Some Useful Machine Learning Tools

Some Useful Machine Learning Tools

Presentation Transcript

Useful Research Tools

Some Useful Distributions

Some Useful Instructions

MACHINE TOOLS

Useful Statistical Tools

Useful Tools

Some Useful Circuits

Some useful links

E-learning – some interactive tools

Some free useful ICT tools

E-research - a quick romp through some useful tools

Some new Developments in Machine Learning

Useful Study Tools

Useful Electricians Tools

Introduction to different Machine Learning tools

Useful Study Tools

Machine Tools

Some Useful Distributions

Other Science Tools and Some Useful Links

Machine Tools

Some Useful Terms

Tools For Building Machine Learning Models