440 likes | 812 Views
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering. Some content provided by Milos Hauskrecht, University of Pittsburgh Computer Science. ITK Questions?. Classification. Classification. Classification. Features.
E N D
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University of Pittsburgh Computer Science
Features • Loosely stated, a feature is a value describing something about your data points (e.g. for pixels: intensity, local gradient, distance from landmark, etc) • Multiple (n) features are put together to form a feature vector, which defines a data point’s location in n-dimensional feature space
Feature Space • Feature Space - • The theoretical n-dimensional space occupied by n input raster objects (features). • Each feature represents one dimension, and its values represent positions along one of the orthogonal coordinate axes in feature space. • The set of feature values belonging to a data point define a vector in feature space.
Statistical Notation • Class probability distribution: p(x,y) = p(x | y) p(y) x: feature vector – {x1,x2,x3…,xn} y: class p(x | y): probabilty of x given y p(x,y):probability of both x and y
Example: Binary Classification • Two class-conditional distributions: p(x | y = 0) p(x | y = 1) • Priors: p(y = 0) + p(y = 1) = 1
Modeling Class Densities • In the text, they choose to concentrate on methods that use Gaussians to model class densities
Generative Approach to Classification • Represent and learn the distribution: p(x,y) • Use it to define probabilistic discriminant functions e.g. go(x) = p(y = 0 | x) g1(x) = p(y = 1 | x)
Generative Approach to Classification Typical model: p(x,y) = p(x | y) p(y) p(x | y) = Class-conditional distributions (densities) p(y) = Priors of classes (probability of class y) We Want: p(y | x) = Posteriors of classes
Class Modeling • We model the class distributions as multivariate Gaussians x ~ N(μ0, Σ0) for y = 0 x ~ N(μ1, Σ1) for y = 1 • Priors are based on training data, or a distribution can be chosen that is expected to fit the data well (e.g. Bernoulli distribution for a coin flip)
Making a class decision • We need to define discriminant functions ( gn(x) ) • We have two basic choices: • Likelihood of data – choose the class (Gaussian) that best explains the input data (x): • Posterior of class – choose the class with a better posterior probability:
Calculating Posteriors • Use Bayes’ Rule: • In this case,
Linear Decision Boundary • When covariances are the same
Quadratic Decision Boundary • When covariances are different
Clustering • Basic Clustering Problem: • Distribute data into k different groups such that data points similar to each other are in the same group • Similarity between points is defined in terms of some distance metric • Clustering is useful for: • Similarity/Dissimilarity analysis • Analyze what data point in the sample are close to each other • Dimensionality Reduction • High dimensional data replaced with a group (cluster) label
Distance Metrics • Euclidean Distance, in some space (for our purposes, probably a feature space) • Must fulfill three properties:
Distance Metrics • Common simple metrics: • Euclidean: • Manhattan: • Both work for an arbitrary k-dimensional space
Clustering Algorithms • k-Nearest Neighbor • k-Means • Parzen Windows
k-Nearest Neighbor • In essence, a classifier • Requires input parameter k • In this algorithm, k indicates the number of neighboring points to take into account when classifying a data point • Requires training data
k-Nearest Neighbor Algorithm • For each data point xn, choose its class by finding the most prominent class among the k nearest data points in the training set • Use any distance measure (usually a Euclidean distance measure)
k-Nearest Neighbor Algorithm - - - - + q1 e1 + - + + - 1-nearest neighbor: the concept represented by e1 5-nearest neighbors: q1 is classified as negative
k-Nearest Neighbor • Advantages: • Simple • General (can work for any distance measure you want) • Disadvantages: • Requires well classified training data • Can be sensitive to k value chosen • All attributes are used in classification, even ones that may be irrelevant • Inductive bias: we assume that a data point should be classified the same as points near it
k-Means • Suitable only when data points have continuous values • Groups are defined in terms of cluster centers (means) • Requires input parameter k • In this algorithm, k indicates the number of clusters to be created • Guaranteed to converge to at least a local optima
k-Means Algorithm • Algorithm: • Randomly initialize k mean values • Repeat next two steps until no change in means: • Partition the data using a similarity measure according to the current means • Move the means to the center of the data in the current partition • Stop when no change in the means
k-Means • Advantages: • Simple • General (can work for any distance measure you want) • Requires no training phase • Disadvantages: • Result is very sensitive to initial mean placement • Can perform poorly on overlapping regions • Doesn’t work on features with non-continuous values (can’t compute cluster means) • Inductive bias: we assume that a data point should be classified the same as points near it
Parzen Windows • Similar to k-Nearest Neighbor, but instead of using the k closest training data points, its uses all points within a kernel (window), weighting their contribution to the classification based on the kernel • As with our classification algorithms, we will consider a gaussian kernel as the window
Parzen Windows • Assume a region defined by a d-dimensional Gaussian of scale σ • We can define a window density function: • Note that we consider all points in the training set, but if a point is outside of the kernel, its weight will be 0, negating its influence
Parzen Windows • Advantages: • More robust than k-nearest neighbor • Excellent accuracy and consistency • Disadvantages: • How to choose the size of the window? • Alone, kernel density estimation techniques provide little insight into data or problems