Lecture 17: Supervised Learning Recap

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010

Last Time • Support Vector Machines • Kernel Methods

Today • Short recap of Kernel Methods • Review of Supervised Learning • Unsupervised Learning • (Soft) K-means clustering • Expectation Maximization • Spectral Clustering • Principle Components Analysis • Latent Semantic Analysis

Kernel Methods • Feature extraction to higher dimensional spaces. • Kernels describe the relationship between vectors (points) rather than the new feature space directly.

When can we use kernels? • Any time training and evaluation are both based on the dot product between two points. • SVMs • Perceptron • k-nearest neighbors • k-means • etc.

Kernels in SVMs • Optimize αi’s and bias w.r.t. kernel • Decision function:

Kernels in Perceptrons • Training • Decision function

Good and Valid Kernels • Good: Computing K(xi,xj) is cheaper than ϕ(xi) • Valid: • Symmetric: K(xi,xj) =K(xj,xi) • Decomposable into ϕ(xi)Tϕ(xj) • Positive Semi Definite Gram Matrix • Popular Kernels • Linear, Polynomial • Radial Basis Function • String (technically infinite dimensions) • Graph

Supervised Learning • Linear Regression • Logistic Regression • Graphical Models • Hidden Markov Models • Neural Networks • Support Vector Machines • Kernel Methods

Major concepts • Gaussian, Multinomial, Bernoulli Distributions • Joint vs. Conditional Distributions • Marginalization • Maximum Likelihood • Risk Minimization • Gradient Descent • Feature Extraction, Kernel Methods

Some favorite distributions • Bernoulli • Multinomial • Gaussian

Maximum Likelihood • Identify the parameter values that yield the maximum likelihood of generating the observed data. • Take the partial derivative of the likelihood function • Set to zero • Solve • NB: maximum likelihood parameters are the same as maximum log likelihood parameters

Maximum Log Likelihood • Why do we like the log function? • It turns products (difficult to differentiate) and turns them into sums (easy to differentiate) • log(xy) = log(x) + log(y) • log(xc) = clog(x)

Risk Minimization • Pick a loss function • Squared loss • Linear loss • Perceptron (classification) loss • Identify the parameters that minimize the loss function. • Take the partial derivative of the loss function • Set to zero • Solve

Frequentistsv. Bayesians • Point estimates vs. Posteriors • Risk Minimization vs. Maximum Likelihood • L2-Regularization • Frequentists: Add a constraint on the size of the weight vector • Bayesians: Introduce a zero-mean prior on the weight vector • Result is the same!

L2-Regularization • Frequentists: • Introduce a cost on the size of the weights • Bayesians: • Introduce a prior on the weights

Types of Classifiers • Generative Models • Highest resource requirements. • Need to approximate the joint probability • Discriminative Models • Moderate resource requirements. • Typically fewer parameters to approximate than generative models • Discriminant Functions • Can be trained probabilistically, but the output does not include confidence information

Linear Regression • Fit a line to a set of points

Linear Regression • Extension to higher dimensions • Polynomial fitting • Arbitrary function fitting • Wavelets • Radial basis functions • Classifier output

Logistic Regression • Fit gaussians to data for each class • The decision boundary is where the PDFs cross • No “closed form” solution to the gradient. • Gradient Descent

Graphical Models • General way to describe the dependence relationships between variables. • Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.

Junction Tree Algorithm • Moralization • “Marry the parents” • Make undirected • Triangulation • Remove cycles >4 • Junction Tree Construction • Identify separators such that the running intersection property holds • Introduction of Evidence • Pass slices around the junction tree to generate marginals

Hidden Markov Models • Sequential Modeling • Generative Model • Relationship between observations and state (class) sequences

Perceptron • Step function used for squashing. • Classifier as Neuron metaphor.

Perceptron Loss • Classification Error vs. Sigmoid Error • Loss is only calculated on Mistakes Perceptrons use strictly classification error

Neural Networks • Interconnected Layers of Perceptrons or Logistic Regression “neurons”

Neural Networks • There are many possible configurations of neural networks • Vary the number of layers • Size of layers

Support Vector Machines • Maximum Margin Classification Small Margin Large Margin

Support Vector Machines • Optimization Function • Decision Function

Visualization of Support Vectors

Questions? • Now would be a good time to ask questions about Supervised Techniques.

Clustering • Identify discrete groups of similar data points • Data points are unlabeled

Recall K-Means • Algorithm • Select K – the desired number of clusters • Initialize K cluster centroids • For each point in the data set, assign it to the cluster with the closest centroid • Update the centroid based on the points assigned to each cluster • If any data point has changed clusters, repeat

k-means output

Soft K-means • In k-means, we force every data point to exist in exactly one cluster. • This constraint can be relaxed. Minimizes the entropy of cluster assignment

Soft k-means example

Soft k-means • We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points • Convergence is based on a stopping threshold rather than changed assignments

Gaussian Mixture Models • Rather than identifying clusters by “nearest” centroids • Fit a Set of k Gaussians to the data.

GMM example

Gaussian Mixture Models • Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,

Graphical Modelswith unobserved variables • What if you have variables in a Graphical model that are never observed? • Latent Variables • Training latent variable models is an unsupervised learning application uncomfortable amused sweating laughing

Latent Variable HMMs • We can cluster sequences using an HMM with unobserved state variables • We will train the latent variable models using Expectation Maximization

Expectation Maximization • Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization • Step 1: Expectation (E-step) • Evaluate the “responsibilities” of each cluster with the current parameters • Step 2: Maximization (M-step) • Re-estimate parameters using the existing “responsibilities” • Related to k-means

Questions • One more time for questions on supervised learning…

Next Time • Gaussian Mixture Models (GMMs) • Expectation Maximization

Lecture 17: Supervised Learning Recap

Lecture 17: Supervised Learning Recap

Presentation Transcript

Specifying and Checking Stateful Software Interfaces (Lecture 2)

Distance Metric Learning: A Comprehensive Survey

PSYC 2290 A04 Lecture 3

On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled*

Financial Institutions and Markets

CMSC 671 Fall 2003

Medical Terminology 2

Lecture 7 Artificial neural networks: Supervised learning

Introduction to Predictive Learning

Supervised Learning II: Backpropagation and Beyond

Unsupervised Models for Coreference Resolution

6.096 Lecture 10

Supervised Independent Living for Foster Youth

Cold atoms

CS433/533 Computer Networks

Revision of P1

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Cold atoms

Predictive Learning from Data

Advanced RECAP Workshop

Chapter 3: Supervised Learning