290 likes | 403 Views
Dimension Reduction for Under-sampled High Dimensional Data. Haesun Park Division of Computational Science and Eng. College of Computing Georgia Institute of Technology Atlanta, GA. ( Joint work w/ Barry Drake, Peg Howland, Hyunsoo Kim, and Cheonghee Park)
E N D
Dimension Reduction for Under-sampled High Dimensional Data Haesun Park Division of Computational Science and Eng. College of Computing Georgia Institute of Technology Atlanta, GA (Joint work w/ Barry Drake, Peg Howland, Hyunsoo Kim, and Cheonghee Park) KAIST, Korea, June 2007
Cluster Structure Preserving Dimension Reduction • Algorithms: LDA and its generalizations Orthogonal Centroid Method (OCM) Extension to Kernel-based Nonlinear method Adaptive Dimension Reduction Algorithms • Matrix Decompositions in Feature Extraction: QRD, SVD, Sym EVD, Generalized EVD, Generalized SVD • Applications: Text classification, Face recognition, Fingerprint classification • Other Problems and Methods: 2D Extensions, Nonnegative Matrix Factorization
2D Representation Utilize Cluster Structure if Known 2D representation of 150x1000 data with 7 clusters: LDA vs. SVD
Clustered Data: Facial Recognition AT&T (ORL ) Face Database • 400 frontal images = 40 person * 1o images each variations in pose, facial expression • image size :92 x 112 • Severely Undersampled The 1st sample The 35th sample
Known Data New Data Dimension Reduction of Clustered Data Original images/data 1 2 i r Data preprocessing Form Data Matrix m x n m x 1 Dimension Reducing Transformation 1 2 i r q x n q x 1 Classification Lower Dimensional Representation Want: Dimension reducing transformation that can be effectively applied across many application areas
Measure for Cluster Quality A = [a1...an] :mxn, clustered data Ni= items in class i, | Ni | = ni ,total r classesci= average of data items in class i, centroidc= global average, global centroid • Within-class scatter matrix • Sw = ∑1≤ i≤ r ∑ j∈Ni (aj – ci ) (aj – ci )T (2)Between-class scatter matrix Sb= ∑1≤ i≤ r∑ j ∈Ni (ci – c) (ci – c)T (3)Total scatter matrix St= ∑1≤ i≤ n(ai– c ) (ai– c)T NOTE: Sw + Sb = St
Trace of Scatter Matrix trace (Sw ) = ∑1≤ i≤ r ∑ j ∈ Ni ||aj – ci ||2 trace (Sw ) trace (Sb ) = ∑1≤ i≤ r ∑ j ∈ Ni ||ci - c||2 trace (St ) = ∑1≤ i≤ r ∑ j ∈ Ni ||aj – c||2 Dimension Reducing Transformation trace (Sb )
Optimal Dimension Reducing Transformation GT: qxm GTy : qx1, q << m y:mx1 High quality clusters have small trace(Sw) & large trace(Sb) Want: G s.t. min trace(GT SwG) & max trace(GT SbG) • max trace ((GT SwG)-1 (GT SbG)) LDA(Fisher 36, Rao 48) • max trace (GT SbG) Orthogonal Centroid(Park et al. 03) • max trace (GT(Sw+Sb)G) PCA(Hotelling 33) • max trace (GTAATG) LSI(Deerwester et al. 90) GTG=I GTG=I GTG=I
Some Matrix Decompositions (Golub and Van Loan, Matrix Computations, 96) • EVD for Symmetric-Definite Pencil: A:mxm & sym, B:mxm & sym p.d.A =Y LAYT B= Y LB YTY: nonsingular, LA & LB : diagonal • Generalized SVD (Van Loan 76) :A:mxn & m>=n, B:pxnA =U ∑A XT B =V ∑B XT UTU=Im , VTV=Ip , X: nonsingular, ∑A & ∑B:diagonal Note: ATA = XSATSAXT, BTB = XSBTSBXT • Generalized SVD (Paige & Saunders81) : A:mxn, B:pxn Defined for any two matrices with same numbers of columns
Classical LDA(Fisher ’36, Rao ‘48) max trace ((GT SwG)-1 (GT SbG)) • G : leading (r-1) e.vectors of Sw-1Sb Fails when m>n (undersampled), Sw singular Sw Hw HwT x = • Sw=Hw HwT, Hw=[a1-c1, a2-c1, …, an-cr ] : mxn • Sb=Hb HbT, Hb=[1/ n1(c1-c), …,1/ nr(cr - c)] : mxr
LDA based on GSVD (LDA/GSVD)(Howland, Jeon, Park, SIMAX03, Howland and Park, IEEE TPAMI 04) • Works regardless of singularity of scatter matrices • Sw-1Sb x = l x Sbx=lSwx δ2Hb HbTx = b2Hw HwTx • G comes from leading (r-1) generalized singular vectors of HbT and HwT 0 UT HbT X = (Sb 0) = VT HwT X = (Sw 0) = 0 XT HbHbTX = XT Sb X and XTHwHwTX = XT Sw X Classical LDA is a special case of LDA/GSVD
Generalized SVD(Paige and Saunders ’81)Sbx=lSwx δ 2Hb HbTx = b 2Hw HwTx XTSb X = XTSw X = Want G s.t. max trace (GT Sb G) and min trace (GT Sw G)
Generalization of LDA for Undersampled Problems • Regularized LDA(Friedman ’89, Zhao et al. ’99 … ) • LDA/GSVD : Solution G = [ X1 X2 ](Howland, Jeon, Park ’03) • Solutions based on Null(Sw ) and Range(Sb )… (Chen et al. ’00, Yu & Yang ’01, Park & Park ’03 …) • Two-stage methods: • Face Recognition: PCA + LDA(Swets & Weng ’96 , Zhao et al. 99 ) • Information Retrieval: LSI + LDA(Torkkola ’01) • Mathematical Equivalence:(Howland and Park ’03) PCA+ LDA/GSVD = LDA/GSVD LSI +LDA/GSVD = LDA/GSVD More efficient = QRD + LDA/GSVD
Orthogonal Centroid (OC) Algorithm(Park, Jeon, Rosen ’03, BIT) • Algorithm 1. Form Centroid matrix C=[c1 , …, cr ] : m x r 2. Compute QRD of C : C= Q R, Q: m x r • Dimension reduction by QT to r dim. Space y : m x 1QT y : r x 1 • Q solves max trace(GTSbG) :trace(QTSbQ) = trace(Sb) • Need QRD of C: m x r not EVD of Sb: m x m(nor SVD of Hb) GTG=I
Text Classification on Medline Data (Kim, Howland, Park, JMLR03) Classification accuracy (%), 5classes Similarity measures: L2 norm and Cosine
Text Classification on Reuters Data (Kim, Howland, Park, JMLR03) Classification accuracy (%), 90classes Similarity measures: L2 norm and Cosine
Face Recognition on AT&T Data • Orthogonal Centroid: 88 ~ 96% • LDA/GSVD: 90 ~ 98% Orthogonal Centroid LDA/GSVD Query Image Top choice Second choice Third choice Classification Accuracy using centroid, kNN (1, 3, 5, 7) with L2 norm Average of 100 runs, random split of training and test data
Face Recognition on Yale Data (C. Park and H. Park, icdm04) Dim. Red. Method Dim kNN k=1 k=5 k=9 Full Space 8586 79.4 76.4 72.1 LDA/GSVD 14 98.8 (90) 98.8 98.8 Regularized LDA(l=1) 14 97.6 (85) 97.6 97.6 Proj. to null (Sw) 14 97.6 (84) 97.6 97.6 (Chen et al., ’00) Transf. to range(Sb)14 89.7 (82) 94.6 91.5 (Yu & Yang, ’01) Prediction Accuracy in %, leave-one-out ( and average of 100 random split) Yale Face Database: 243 x 320 pixels = full dimension of 77760 11 images/person x 15 people = 165 images After Preprocessing (avg 3x3): 8586 x 165
Nonlinear Dimension Reduction by Kernel Functions Ex. Feature mappingF x12 2 x1x2 x22 x1 x2 x= F(x) = , k (x, y) = < F(x), F(y) >= < x, y >2 (a polynomial kernel function) F 2D
Nonlinear Dimension Reduction by Kernel Functions If k(x,y) satisfies Mercer’s condition, then there is a mapping F to an inner product space, k(x,y) = < F(x), F(y) > F A < x, y > (A) F k(x,y) = < F(x), F(y) > Mercer’s condition for A=[a1,…,an]: kernel matrix K = [ k(ai, aj) ]1≤i, j≤n is positive semi-definite. Ex) RBF Kernel Function: k(ai, aj) =exp(-s||ai – aj ||2)
Kernel Orthogonal Centroid (KOC) (C. Park and H. Park, Pattern Recognition 04) • Apply OCin feature mapped space F(A) • Need QRD of Centroid matrix C in F(A), but C is unknown C = [ 1/n1S F(ai), …, 1/nr S F(ai)] =QR C = [ 1/n1S F(ai), …, 1/nr S F(ai)] =QR i∈N1 i∈Nr • CTC = MT K M = RTR, where K = [k(ai , aj)]1<= i, j <=n • z = QT y = R-T CTy • solve RTz = CT y for z 1/n1S k(ai ,y) = i∈N1 ... 1/nrS k(ai ,y) i∈Nr
Experimental Results Musk (from UCI) kNN OC KOC Kernel PCA dim: 167 k=1 87.2% 95.7% 87.8% # of classes: 2 15 88.5% 96.0% 89.2% # of data: 6599 29 88.5% 96.1% 88.5% OC KOC Kernel PCA .. (Scholkopf et al., 1999)
Fingerprint Classification Left LoopRight Loop Whorl Arch Tented Arch Construction of Directional Images by DFT 1. Compute directionality in local neighborhood by FFT 2. Compute the dominant direction 3. Find core point for unified centering of fingerprints within the same class
Fingerprint Classification Results on NIST Fingerprint Database 4 (C. Park and H. Park, Pattern Recognition, to appear) KDA/GSVD: Nonlinear Extension of LDA/GSVD based on Kernel Functions Rejection rate(%) 0 1.8 8.5 KDA/GSVD 90.7 91.3 92.8 kNN & NN Jain et al., 99 - 90.0 91.2 SVM Yao et al., 03 - 90.0 92.2 4000 fingerprint images of size 512x512 By KDA/GSVD, dimension reduced from 105x105 to 4
MSE f(a) = wTa + w0 n/n1 if a∈ class 1 -n/n2 if a ∈ class 2 LDA/GSVD Sb x = lSw x Min. Squared Error (MSE) and LDA/GSVD (Binary Class) (Duda et al. 01) { = wT a +w0 = wT(a-c) = a xT(a-c) n/n1 ⋮ -n/n2 ⋮ 1 a1T ⋮ 1 anT _ min W0 W 2 Can be extended to multi-class relationship to design an efficient algorithm for LDA/GSVD (also can easily be made adaptive) C. Park and H. Park 04, H.Kim, B. Drake, H. Park 06
2D Extension: 2D PCA, 2D SVD, 2D LDA Ex. 2D PCA (Yang et al., TPAMI 04) (Ding&Ye04, Ye et al. SIGKDD04) A i Yi • More natural way to capture characteristics of 2D images • Requires less memory • Can be computationally more efficient G PCA St= ∑1≤ i≤ n(ai– c ) (ai– c)T 2D PCA St= ∑1≤ i≤ n(Ai– C)T(Ai– C)
Nonnegative Matrix Factorization (Lee&Seung 01, Pauca et al. SIAM DM 04, Xu et al. SIGIR03 …) Rank Reducing Decomposition: Given A:mxn, find H:mxr and W:rxn with r << min(m,n) s.t. A H min || A – H W ||F W ~ = NMF: aij >=0, hij >=0, wij >=0 Image analysis, Text mining, Document clustering, …
Summary / Future Research • Effective Algorithms for Dimension Reduction and Matrix Decompositions • applicable to a wide range of problems • (Text classification, Facial recognition, Fingerprint classification, Missing-value Estimation in Microarray data, Protein secondary structure prediction … ) • Current and Future Research * Establish relationships among Dimension Reduction, Feature Selection, Data Reduction, Classifier Design … * Find better Mathematical Models • * Design of more efficient algorithms • Binary and multi - class Gene Selection methods based on dimension reduction and L1 norm minimization • Missing value estimation • Thank you!
Relationship between SVM and LDA/GSVD (Kim, Drake, and Park ’04) • Apply LDA/GSVD on support vectors (D1 D2) only • Hw is rank deficient, rank(Hb) = 1 • GSVD on (HbT, HwT) gives • diTw + w0 = ± 1 Dw = y - w0 • St = DTD - nccTand Sb = n1 (c1-c)(c1-c)T + n2 (c2-c)(c2-c)T (Sw + Sb)w = Sb w = 2 n1 n2 (c2 – c1)/n, =>Sb w ≠ 0andSw w = 0 Therefore w from SVM = w from LDA/GSVD on (D1, D2)