510 likes | 654 Views
Prototype Classification Methods. Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw. Types of Prototype Methods. Crisp model (K-means, KM) Prototypes are centers of non-overlapping clusters Fuzzy model (Fuzzy c -means, FCM)
E N D
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw
Types of Prototype Methods • Crisp model (K-means, KM) • Prototypes are centers of non-overlapping clusters • Fuzzy model (Fuzzy c-means, FCM) • Prototypes are weighted average of all samples • Gaussian Mixture model (GM) • Prototypes have a mixture of distributions • Linear Discriminant Analysis (LDA) • Prototypes are projected sample means • K-nearest neighbor classifier (K-NN) • Learning vector quantization (LVQ)
Prototypes thru Clustering • Given the number k of prototypes, find k clusters whose centers are prototypes • Commonality: • Use iterative algorithm, aimed at decreasing an objective function • May converge to local minima • The number of k as well as an initial solution must be specified
Clustering Objectives • The aim of the iterative algorithm is to decrease the value of an objective function • Notations: • Samples • Prototypes • L2-distance:
Objectives (cnt’d) • Crisp objective: • Fuzzy objective: • Gaussian mixture objective
The Algorithm • Initiate k seeds of prototypes p1, p2, …, pk • Grouping: • Assign samples to their nearest prototypes • Form non-overlapping clusters out of these samples • Centering: • Centers of clusters become new prototypes • Repeat the grouping and centering steps, until convergence
Justification • Grouping: • Assigning samples to their nearest prototypes helps to decrease the objective • Centering: • Also helps to decrease the above objective, because and equality holds only if
Exercise: • Prove that for any group of vectors yi, the following inequality is always true • Prove that the equality holds only when • Use this fact to prove that the centering step is helpful to decrease the objective function
Crisp vs. Fuzzy Membership • Membership matrix: Uc×n • Uijis the grade of membership of samplejwith respect to prototypei • Crisp membership: • Fuzzy membership:
Fuzzy c-means (FCM) • The objective function of FCM is
FCM (Cnt’d) • Introducing the Lagrange multiplier λ with respect to the constraint we rewrite the objective function as:
FCM (Cnt’d) • Setting the partial derivatives to zero, we obtain
FCM (Cnt’d) • From the 2nd equation, we obtain • From this fact and the 1st equation, we obtain
FCM (Cnt’d) • Therefore, and
FCM (Cnt’d) • Together with the 2nd equation, we obtain the updating rule for uij
FCM (Cnt’d) • On the other hand, setting the derivative of J with respect to pi to zero, we obtain
FCM (Cnt’d) • It follows that • Finally, we can obtain the update rule ofci:
FCM (Cnt’d) • To summarize:
K-means vs. Fuzzy c-means Sample Points
K-means vs. Fuzzy c-means K-means Fuzzy c-means
What Is Given • Observed data: X = {x1, x2, …, xn}, each of them is drawn independently from a mixture of probability distributions with the density where
Incomplete vs. Complete Data • The incomplete-data log-likelihood is given by: which is difficult to optimize • The complete-data log-likelihood can be handled much easily, where H is the set of hidden random variables • How do we compute the distribution of H?
EM Algorithm • E-Step: first find the expected value where is the current estimate of • M-Step: Update the estimate • Repeat the process, until convergence
Justification • The expected value (the circled term) is the lower bound of the log-likelihood
Justification (Cnt’d) • The maximum of the lower bound equals to the log-likelihood • The first term of (1) is the relative entropy of q(h) with respect to • The second term is a magnitude that does not depend on h • We would obtain the maximum of (1) if the relative entropy becomes zero • With this choice, the first term becomes zero and (1) achieves the upper bound, which is
Details of EM Algorithm • Let be the guessed values of • For the given , we can compute
Details (Cnt’d) • We then consider the expected value:
Details (Cnt’d) • Lagrangian and partial derivative equation:
Details (Cnt’d) • From (2), we derive that λ = - n and • Based on these values, we can derive the optimal for , of which only the following part involves :
Exercise: • Deduce from (1) thatλ = - n and
Gaussian Mixtures • The Gaussian distribution is given by: • For Gaussian mixtures,
Gaussian Mixtures (Cnt’d) • Partial derivative: • Setting this to zero, we obtain
Gaussian Mixtures (Cnt’d) • Taking the derivative of with respect to and setting it to zero, we get (many details are omitted)
Gaussian Mixtures (Cnt’d) • To summarize:
Definitions • Given: • Samples x1, x2, …, xn • Classes: ni of them are of class i, i = 1, 2, …, c • Definition: • Sample mean for class i: • Scatter matrix for class i:
Scatter Matrices • Total scatter matrix: • Within-class scatter matrix: • Between-class scatter matrix:
Multiple Discriminant Analysis • We seek vectors wi, i = 1, 2, .., c-1 • And project the samples x to the c-1 dimensional space y • The criterion for W = (w1, w2, …, wc-1) is
Multiple Discriminant Analysis (Cnt’d) • Consider the Lagrangian • Take the partial derivative • Setting the derivative to zero, we obtain
Multiple Discriminant Analysis (Cnt’d) • Find the roots of the characteristic function as eigenvalues and then solve for wi for the largest c-1 eigenvalues
LDA Prototypes • The prototype of each class is the mean of the projected samples of that class, the projection is thru the matrix W • In the testing phase • All test samples are projected thru the same optimal W • The nearest prototype is the winner
K-NN Classifier • For each test sample x, find the nearest K training samples and classify x according to the vote among the K neighbors • The error rate is where • This shows that the error rate is at most twice the Bayes error
LVQ Algorithm • Initialize R prototypes for each class: m1(k), m2(k), …, mR(k), where k = 1, 2, …, K. • Sample a training sample x and find the nearest prototype mj(k) to x • If x and mj(k) match in class type, • Otherwise, • Repeat step 2, decreasing ε at each iteration