420 likes | 638 Views
Generally Discriminant Analysis. 報告 : 張志豪 日期 : 2004/10/29. Outline. Introduction Classification Feature Space Transformation Criterion. Introduction. Discriminant Analysis discriminant (classification, predictor) 當已知 class 的分佈時 , 如果有新的樣本進來 , 則可以利用所選定的鑑別方法 , 來判定新樣本歸於哪個類別中
E N D
Generally Discriminant Analysis 報告 : 張志豪 日期 : 2004/10/29
Outline • Introduction • Classification • Feature Space Transformation • Criterion
Introduction • Discriminant Analysis • discriminant (classification, predictor) • 當已知class的分佈時, 如果有新的樣本進來, 則可以利用所選定的鑑別方法, 來判定新樣本歸於哪個類別中 • 事實上, 我們沒有辦法知道class真實的分佈, 只好利用train data來統計class的分佈 • Train data越多, 相對的, 統計的越準 • 在純統計的情況下, discriminant analysis只是用來做predictor, 並沒有使用到feature space transformation ?? • 都是兩個class的判別
Introduction • Component of Discriminant Analysis • Classification • Linear, Quadratic, Nonlinear (Kernel) • Feature Space Transformation • Linear • Criterion • ML, MMI, MCE
Introduction • Exposition • 先統計labeled資料間class的分佈 • 在feature space中, class的分佈有所重疊, 所以利用feature space transformation來改變feature space • 利用某種criterion來找出最合適的轉換基底 • 當有新pattern進來後, 可利用統計到(轉換後)的分佈來做predictor feature space transformation 以LDA為例
ClassificationOutline • Linear Discriminant Analysis & Quadratic Discriminant Analysis • Linear Discriminant Analysis • Quadratic Discriminant Analysis • Problem • Practice • Flexible Discriminant Analysis • Linear Discriminant Analysis → Multivariable Linear Regressions • Parametric→ Non-Parametric • Kernel LDA
ClassificationLinear Discriminant Analysis • Classification • A simple application of Bayes theorem gives us • Assumption : • Class is single Gaussian distribution.
ClassificationLinear Discriminant Analysis • Classification (count.) • In comparing two classes k and l, it is sufficient to look at the log-ratio Assumption : Common covariance Intuition : Classify Two class
ClassificationLinear Discriminant Analysis • Classification (count.) • This linear log-odds function implies that the decision boundary between classes k and l is linear in x. This is of course true for any pair of classes, so all the decision boundaries are linear. • The linear discriminant functions
ClassificationQuadratic Discriminant Analysis • Classification • If the covariance are not assumed to be equal, then the convenient cancellations do not occur; in particular the pieces quadratic in x remain. • The quadratic discriminant functions
ClassificationLDA & QDA • Parametric • LDA : P2 • QDQ : JP2 • Accuracy • LDA is mismatch. 圓圈的大小代表著分佈散設的程度 LDA QDA
ClassificationProblem • How do we use a linear discriminant when we have more than two classes ? • There are two approaches : • Learn one discriminant function for each class • Learn a discriminant function for all pairs of classes => If c is the number of classes, in the first case we have c functions and in the second c(c-1) / 2 functions. => In both cases we are left with ambiguous regions.
ClassificationProblem • ambiguous regions • we can use linear machines: • We define c linear discriminant functions and choose the one with highest value for a given x.
ClassificationConclusion : LDA & QDA • LDA • Classify x to kth class : common covariance • posterior prob. 為最大 • linear discriminant function score 為最大 • QDA • Classify x to kth class : variant covariance • posterior prob. 為最大 • quadratic discriminant function score 為最大
ClassificationPractice : LDA • Mind • 希望經過特徵空間轉換後, class間可以較容易做線性鑑別 • Component • Classification • Linear decision boundaries • Feature Space Transformation • Linear : • Criterion • ML
ClassificationPractice : LDA • Linear transformation • Likelihood is same, but scale is larger.
ClassificationPractice : LDA • Maximum likelihood criterion => assumed • Single Gaussian distribution • Class prior prob. are the same • Diagonal and Common covariance (within-class) • Lack of classification information is equivalent distribution (total-class) JHU有證明, Appendix C
ClassificationPractice : LDA • Intuition T = B+W B is between-class covariance. W is within-class covariance. is transformation matrix.
ClassificationPractice : HDA • Mind • 希望經過特徵空間轉換後, class間可以較容易做二次式的鑑別 • Component • Classification • Quadratic decision boundaries • Feature Space Transformation • Linear : • Criterion • ML
ClassificationPractice : HDA • Maximum likelihood criterion => assumed • Single Gaussian distribution • Class prior prob. are the same • Diagonal covariance • Lack of classification information is equivalent distribution JHU use the steepest-descent algorithm Cambridge useing semi-tied is guaranteed to find a locally optimal solution and to be stable.
ClassificationPractice : HDA • Intuition T = B+W B is between-class covariance. W is within-class covariance. is transformation matrix. HDA is worse than LDA.
ClassificationPractice • Problem • Linear transformation為何有效? • Information theory • It is impossible to create new information by transforming data, transformations can only lead to information loss. => One finds the K < P dimensional subspace of Rp in which the group centroids are most separated. • Single muti-dimensional gaussian • 當每個class只用一個Gaussian來紀錄時, 可以classify的好, 那麼當每個class使用mixture Gaussian來紀錄, 直覺的, 可以classify更好 ?? • Observation probability is classification ?
ClassificationLDA : Linear Regression • mind • Linear discriminant analysis is equivalent to multi-response linear regression using optimal scorings to represent the groups. • In this way, any multi-response regression technique can be post-processed to improve their classification performance. • We obtain nonparametric versions of discriminant analysis by replacing linear regression by any nonparametric regression method. • 迴歸分析為迴歸分析 : • 探討各變數之間的關係, 並找出一適當的數學方程式表示其關係, 進而藉由該方程式預測未來 • 根據某變數來預測另一變數. 迴歸分析是以相關分析為基礎, 因任何預測的可靠性是依變數間關係的強度而有所不同
ClassificationLDA : Linear Regression • Linear Regression • Suppose is a function that assigns scores to the classes, such that the transformed class labels are optimally predicted by linear regression on X. • So we have to choose and such that • It gives a one-dimensional separation between classes. Least squares estimator
ClassificationLDA : Linear Regression • Multi-Response Linear Regression • Independent scoring labels : • Linear maps • the scores and the maps are chosen to minimum (1) 第i個observation投影到第k維的值 第i個observation的label在第k維的分數(mean ??) The set of scores are assumed to be mutually orthogonal and normalized.
ClassificationLDA : Linear Regression • Multi-Response Linear Regression (count.) • It can be show [Mardia79, Hastie95] that • The are equivalent up to a constant to the fisher discriminant coordinates • The Mahalanobis distances can be derived from the ASR solutions LDA can be performed by a sequence of linear regressions, followed by a classification in the space of fits (Mardia, Kent and Bibby, 1979)
ClassificationLDA : Linear Regression • Multi-Response Linear Regression (count.) • Let Y be the N*J indicator matrix corresponding to the dummy-variable coding for the classes. That is, the ijth element of Y is 1 if the ith observation falls in class j, and 0 otherwise. • Let , be a matrix of K score vectors for the J classes. • be the N*K matrix of transformed values of the classes with ikth element . Y
ClassificationLDA : Linear Regression • Solution 1 • Looking at (1), it is clear that if the scores were fixed we could minimize ASR by regressing on x. • If we let project onto the column space of the predictors, this says (2)
ClassificationLDA : Linear Regression • Solution 1 (count.) • If we assume the scores have mean zero, unit variance and are uncorrelated for the N observations. • Minimizing (2) amounts to finding the K largest eigenvectors , with normalization , where , a diagonal matrix of the sample class proportions . • We could do this by constructing the matrix , computing , and then calculating its eigenvectors. But a more convenient approach avoids explicit construction of and takes advantage of the fact that computes a linear regression. 為(N*N) matrix, 太大了, 沒有辦法建構
ClassificationLDA : Linear Regression • Solution 2 Y : (N*J), 正確答案 -> class與observation的關係 : (N*J), 預測的結果 -> observation與class的關係 YT : (J*J), covariance?? -> class與class的關係 B : (P*J), coefficient matrix -> 維度與class的關係 XTY X : (N*P), training data -> observation與維度的關係 It turns out that LDA amounts to the regression followed by an eigen-decomposition of .
ClassificationLDA : Linear Regression • Solution 2 (count.) • The final coefficient matrix B is, up to a diagonal scale matrix, the same as the discriminant analysis coefficient matrix. is the kth largest eigenvalue computed in step 3 above. LDA transformation matrix
ClassificationFlexible Discriminants Analysis • Nonparametric version • We replace the linear-projection operator by a nonparametric regression procedure, which we denote by the linear operator S. • One simple and effective approach to this end is to expand X into a larger set of basis variables h(X), and then simply use in place of . 凡是有內積運算都可以套用kernel fuction
ClassificationFlexible Discriminants Analysis • Non-Parametric Algorithm (count.)
ClassificationKernel LDA • Linear Discriminant Analysis
ClassificationKernel LDA • Kernel Linear Discriminant Analysis
ClassificationKernel LDA • Kernel Linear Discriminant Analysis (count.)
ClassificationKernel LDA • Kernel Linear Discriminant Analysis (count.) • This problem can be solved by finding the leading eigenvector of N-1M. • The projection of a new pattern x onto w is given by