Generally Discriminant Analysis

Generally Discriminant Analysis 報告 : 張志豪日期 : 2004/10/29

Outline • Introduction • Classification • Feature Space Transformation • Criterion

Introduction • Discriminant Analysis • discriminant (classification, predictor) • 當已知class的分佈時, 如果有新的樣本進來, 則可以利用所選定的鑑別方法, 來判定新樣本歸於哪個類別中 • 事實上, 我們沒有辦法知道class真實的分佈, 只好利用train data來統計class的分佈 • Train data越多, 相對的, 統計的越準 • 在純統計的情況下, discriminant analysis只是用來做predictor, 並沒有使用到feature space transformation ?? • 都是兩個class的判別

Introduction • Component of Discriminant Analysis • Classification • Linear, Quadratic, Nonlinear (Kernel) • Feature Space Transformation • Linear • Criterion • ML, MMI, MCE

Introduction • Exposition • 先統計labeled資料間class的分佈 • 在feature space中, class的分佈有所重疊, 所以利用feature space transformation來改變feature space • 利用某種criterion來找出最合適的轉換基底 • 當有新pattern進來後, 可利用統計到(轉換後)的分佈來做predictor feature space transformation 以LDA為例

Introduction

ClassificationOutline • Linear Discriminant Analysis & Quadratic Discriminant Analysis • Linear Discriminant Analysis • Quadratic Discriminant Analysis • Problem • Practice • Flexible Discriminant Analysis • Linear Discriminant Analysis → Multivariable Linear Regressions • Parametric→ Non-Parametric • Kernel LDA

ClassificationLinear Discriminant Analysis • Classification • A simple application of Bayes theorem gives us • Assumption : • Class is single Gaussian distribution.

ClassificationLinear Discriminant Analysis • Classification (count.) • In comparing two classes k and l, it is sufficient to look at the log-ratio Assumption : Common covariance Intuition : Classify Two class

ClassificationLinear Discriminant Analysis • Classification (count.) • This linear log-odds function implies that the decision boundary between classes k and l is linear in x. This is of course true for any pair of classes, so all the decision boundaries are linear. • The linear discriminant functions

ClassificationQuadratic Discriminant Analysis • Classification • If the covariance are not assumed to be equal, then the convenient cancellations do not occur; in particular the pieces quadratic in x remain. • The quadratic discriminant functions

ClassificationLDA & QDA • Parametric • LDA : P2 • QDQ : JP2 • Accuracy • LDA is mismatch. 圓圈的大小代表著分佈散設的程度 LDA QDA

ClassificationProblem • How do we use a linear discriminant when we have more than two classes ? • There are two approaches : • Learn one discriminant function for each class • Learn a discriminant function for all pairs of classes => If c is the number of classes, in the first case we have c functions and in the second c(c-1) / 2 functions. => In both cases we are left with ambiguous regions.

ClassificationProblem

ClassificationProblem • ambiguous regions • we can use linear machines: • We define c linear discriminant functions and choose the one with highest value for a given x.

ClassificationConclusion : LDA & QDA • LDA • Classify x to kth class : common covariance • posterior prob. 為最大 • linear discriminant function score 為最大 • QDA • Classify x to kth class : variant covariance • posterior prob. 為最大 • quadratic discriminant function score 為最大

ClassificationPractice : LDA • Mind • 希望經過特徵空間轉換後, class間可以較容易做線性鑑別 • Component • Classification • Linear decision boundaries • Feature Space Transformation • Linear : • Criterion • ML

ClassificationPractice : LDA • Linear transformation • Likelihood is same, but scale is larger.

ClassificationPractice : LDA • Maximum likelihood criterion => assumed • Single Gaussian distribution • Class prior prob. are the same • Diagonal and Common covariance (within-class) • Lack of classification information is equivalent distribution (total-class) JHU有證明, Appendix C

ClassificationPractice : LDA • Intuition T = B+W B is between-class covariance. W is within-class covariance. is transformation matrix.

ClassificationPractice : HDA • Mind • 希望經過特徵空間轉換後, class間可以較容易做二次式的鑑別 • Component • Classification • Quadratic decision boundaries • Feature Space Transformation • Linear : • Criterion • ML

ClassificationPractice : HDA • Maximum likelihood criterion => assumed • Single Gaussian distribution • Class prior prob. are the same • Diagonal covariance • Lack of classification information is equivalent distribution JHU use the steepest-descent algorithm Cambridge useing semi-tied is guaranteed to find a locally optimal solution and to be stable.

ClassificationPractice : HDA • Intuition T = B+W B is between-class covariance. W is within-class covariance. is transformation matrix. HDA is worse than LDA.

ClassificationPractice • Problem • Linear transformation為何有效? • Information theory • It is impossible to create new information by transforming data, transformations can only lead to information loss. => One finds the K < P dimensional subspace of Rp in which the group centroids are most separated. • Single muti-dimensional gaussian • 當每個class只用一個Gaussian來紀錄時, 可以classify的好, 那麼當每個class使用mixture Gaussian來紀錄, 直覺的, 可以classify更好 ?? • Observation probability is classification ?

ClassificationLDA : Linear Regression • mind • Linear discriminant analysis is equivalent to multi-response linear regression using optimal scorings to represent the groups. • In this way, any multi-response regression technique can be post-processed to improve their classification performance. • We obtain nonparametric versions of discriminant analysis by replacing linear regression by any nonparametric regression method. • 迴歸分析為迴歸分析 : • 探討各變數之間的關係, 並找出一適當的數學方程式表示其關係, 進而藉由該方程式預測未來 • 根據某變數來預測另一變數. 迴歸分析是以相關分析為基礎, 因任何預測的可靠性是依變數間關係的強度而有所不同

ClassificationLDA : Linear Regression • Linear Regression • Suppose is a function that assigns scores to the classes, such that the transformed class labels are optimally predicted by linear regression on X. • So we have to choose and such that • It gives a one-dimensional separation between classes. Least squares estimator

ClassificationLDA : Linear Regression • Multi-Response Linear Regression • Independent scoring labels : • Linear maps • the scores and the maps are chosen to minimum (1) 第i個observation投影到第k維的值第i個observation的label在第k維的分數(mean ??) The set of scores are assumed to be mutually orthogonal and normalized.

ClassificationLDA : Linear Regression • Multi-Response Linear Regression (count.) • It can be show [Mardia79, Hastie95] that • The are equivalent up to a constant to the fisher discriminant coordinates • The Mahalanobis distances can be derived from the ASR solutions LDA can be performed by a sequence of linear regressions, followed by a classification in the space of fits (Mardia, Kent and Bibby, 1979)

ClassificationLDA : Linear Regression • Multi-Response Linear Regression (count.) • Let Y be the N*J indicator matrix corresponding to the dummy-variable coding for the classes. That is, the ijth element of Y is 1 if the ith observation falls in class j, and 0 otherwise. • Let , be a matrix of K score vectors for the J classes. • be the N*K matrix of transformed values of the classes with ikth element . Y

ClassificationLDA : Linear Regression • Solution 1 • Looking at (1), it is clear that if the scores were fixed we could minimize ASR by regressing on x. • If we let project onto the column space of the predictors, this says (2)

ClassificationLDA : Linear Regression • Solution 1 (count.) • If we assume the scores have mean zero, unit variance and are uncorrelated for the N observations. • Minimizing (2) amounts to finding the K largest eigenvectors , with normalization , where , a diagonal matrix of the sample class proportions . • We could do this by constructing the matrix , computing , and then calculating its eigenvectors. But a more convenient approach avoids explicit construction of and takes advantage of the fact that computes a linear regression. 為(N*N) matrix, 太大了, 沒有辦法建構

ClassificationLDA : Linear Regression • Solution 2 Y : (N*J), 正確答案 -> class與observation的關係 : (N*J), 預測的結果 -> observation與class的關係 YT : (J*J), covariance?? -> class與class的關係 B : (P*J), coefficient matrix -> 維度與class的關係 XTY X : (N*P), training data -> observation與維度的關係 It turns out that LDA amounts to the regression followed by an eigen-decomposition of .

ClassificationLDA : Linear Regression • Solution 2 (count.) • The final coefficient matrix B is, up to a diagonal scale matrix, the same as the discriminant analysis coefficient matrix. is the kth largest eigenvalue computed in step 3 above. LDA transformation matrix

ClassificationFlexible Discriminants Analysis • Nonparametric version • We replace the linear-projection operator by a nonparametric regression procedure, which we denote by the linear operator S. • One simple and effective approach to this end is to expand X into a larger set of basis variables h(X), and then simply use in place of . 凡是有內積運算都可以套用kernel fuction

ClassificationFlexible Discriminants Analysis

ClassificationFlexible Discriminants Analysis • Non-Parametric Algorithm (count.)

ClassificationFlexible Discriminants Analysis

ClassificationKernel LDA • Linear Discriminant Analysis

ClassificationKernel LDA • Kernel Linear Discriminant Analysis

ClassificationKernel LDA • Kernel Linear Discriminant Analysis (count.)

ClassificationKernel LDA • Kernel Linear Discriminant Analysis (count.) • This problem can be solved by finding the leading eigenvector of N-1M. • The projection of a new pattern x onto w is given by

Generally Discriminant Analysis

Generally Discriminant Analysis

Presentation Transcript

Discriminant Analysis

Discriminant Analysis

Multiple Discriminant Analysis

Discriminant Function Analysis

Classification Discriminant Analysis

Discriminant Analysis

Discriminant Analysis

Discriminant Analysis

12 Discriminant Analysis

Discriminant Analysis

Linear Discriminant Analysis

Linear Discriminant Analysis

Discriminant Analysis

Discriminant Analysis

Discriminant Analysis

Discriminant Analysis

Discriminant Function Analysis

Linear Discriminant Analysis

Discriminant Analysis

Discriminant Analysis

Discriminant Analysis

Classification Discriminant Analysis