Effective Dimension Reduction with Prior Knowledge

Effective Dimension Reductionwith Prior Knowledge Haesun Park Division of Computational Science and Eng. College of Computing Georgia Institute of Technology Atlanta, GA Joint work w/ Barry Drake, Peg Howland, Hyunsoo Kim, and Cheonghee Park DIMACS, May, 2007

Dimension Reduction • Dimension Reduction for Clustered Data:Linear Discriminant Analysis (LDA) Generalized LDA (LDA/GSVD, regularized LDA) Orthogonal Centroid Method (OCM) • Dimension Reduction for Nonnegative Data: Nonnegative Matrix Factorization (NMF) • Applications:Text classification, Face recognition, Fingerprint classification, Gene clustering in Microarray Analysis …

2D Representation Utilize Cluster Structure if Known 2D representation of 150x1000 data with 7 clusters: LDA vs. SVD

Dimension Reduction for Clustered Data Measure for Cluster Quality A = [a1...an] :mxn, clustered data Ni= items in class i, | Ni | = ni ,total r classesci= centroid, c= global centroid • Sw = ∑1≤ i≤ r ∑ j∈Ni (aj – ci ) (aj – ci )T Sb= ∑1≤ i≤ r∑ j ∈Ni (ci – c) (ci – c)T St= ∑1≤ i≤ n(ai– c ) (ai– c)T , Sw + Sb = St

Optimal Dimension Reducing Transformation GT: qxm GTy : qx1, q << m y:mx1 High quality clusters have small trace(Sw) & large trace(Sb) Want: G s.t. min trace(GT SwG) & max trace(GT SbG) • max trace ((GT SwG)-1 (GT SbG)) LDA(Fisher 36, Rao 48) • max trace (GT SbG) Orthogonal Centroid(Park et al. 03) • max trace (GT(Sw+Sb)G) PCA(Pearson 1901, Hotelling 33) • max trace (GTAATG) LSI(Deerwester et al. 90) GTG=I GTG=I GTG=I

Classical LDA(Fisher ’36, Rao ‘48) max trace ((GT SwG)-1 (GT SbG)) • G : leading (r-1) e.vectors of Sw-1Sb Fails when m>n (undersampled), Sw singular Sw Hw HwT x = • Sw=Hw HwT, Hw=[a1-c1, a2-c1, …, an-cr ] : mxn • Sb=Hb HbT, Hb= [ n1(c1-c) , … ,nr(cr - c) ] : mxr

LDA based on GSVD (LDA/GSVD)(Howland, Jeon, Park, SIMAX03, Howland and Park, IEEE TPAMI 04) Sw-1Sb x = l x Sbx=lSwx  δ2Hb HbTx = b2Hw HwTx 0 UT HbT X = (Sb 0) = VT HwT X = (Sw 0) = 0 XTSb X = XTSw X = XT HbHbTX = XT Sb X and XTHwHwTX = XT Sw X Classical LDA is a special case of LDA/GSVD

Generalization of LDA for Undersampled Problems • Regularized LDA(Friedman ’89, Zhao et al. ’99 … ) • LDA/GSVD : Solution G = [ X1 X2 ](Howland, Jeon, Park ’03) • Solutions based on Null(Sw ) and Range(Sb )… (Chen et al. ’00, Yu & Yang ’01, Park & Park ’03 …) • Two-stage methods: • Face Recognition: PCA + LDA(Swets & Weng ’96 , Zhao et al. 99 ) • Information Retrieval: LSI + LDA(Torkkola ’01) • Mathematical Equivalence:(Howland and Park ’03) PCA+ LDA/GSVD = LDA/GSVD LSI +LDA/GSVD = LDA/GSVD More efficient = QRD + LDA/GSVD

QRD Preprocessing in Dim. Reduction (Distance Preserving Dim. Redution) For undersampled data A:mxn, m>>n A Q1 R Q1 Q2 R = = 0 Q1:orthonormal basis for span(A) Dimension reduction of A by Q1T, Q1T A = R: nxn Q1Tpreserves distance of L2 norm: || ai ||2 = || Q1Tai ||2 || ai - aj ||2 = || Q1T (ai - aj )||2 In cos distance: cos(ai, aj) = cos(Q1Tai, Q1T aj) Applicable to PCA, LDA, LDA/GSVD, Isomap, LTSA, LLE, …

Speed Up with QRD Preprocessing(computation time)

Text Classification with Dim. Reduction (Kim, Howland, Park, JMLR03) Classification accuracy (%) Similarity measures: L2 norm and Cosine

Face Recognition on Yale Data (C. Park and H. Park, icdm04) Dim. Red. Method Dim kNN k=1 k=5 k=9 Full Space 8586 79.4 76.4 72.1 LDA/GSVD 14 98.8 (90) 98.8 98.8 Regularized LDA(l=1) 14 97.6 (85) 97.6 97.6 Proj. to null (Sw) 14 97.6 (84) 97.6 97.6 (Chen et al., ’00) Transf. to range(Sb)14 89.7 (82) 94.6 91.5 (Yu & Yang, ’01) Prediction Accuracy in %, leave-one-out ( and average of 100 random split) Yale Face Database: 243 x 320 pixels = full dimension of 77760 11 images/person x 15 people = 165 images After Preprocessing (avg 3x3): 8586 x 165

Fingerprint Classification Results on NIST Fingerprint Database 4 (C. Park and H. Park, Pattern Recognition, 2005) KDA/GSVD: Nonlinear Extension of LDA/GSVD based on Kernel Functions Rejection rate(%) 0 1.8 8.5 KDA/GSVD 90.7 91.3 92.8 kNN & NN Jain et al., 99 - 90.0 91.2 SVM Yao et al., 03 - 90.0 92.2 4000 fingerprint images of size 512x512 By KDA/GSVD, dimension reduced from 105x105 to 4

Nonnegativity Preserving Dim. Reduction Nonnegative Matrix Factorization (Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 06, …) Given A:mxn with A>=0 and k << min (m,n), find W:mxkand H:kxn with W>=0 and H>=0 s.t. A W min || A – WH ||F H ~ = • NMF/ANLS: Two-block Coordinate Descent Method in Bound-constrained Opt. • Iterate the following ANLS (Kim and Park, Bioinformatics, to appear ) : fixing W , solve minH>=0 || W H–A||F fixing H , solve minW>=0 || HTWT –AT||F Any limit point is a stationary point (Grippo and Siandrone 00)

Nonnegativity Constraints? • Better Approximation vs. Better Representation/Interpretation • Given A : m x n and k < min(m, n) • SVD: Best Approximation  min ||A – W H||F, A = US VT, A @ UkSkVkT • NMF: Better Representation/Interpretation? •  min ||A – W H|| F, W>=0, H>=0 ? • Nonnegative Constraints are physically meaningful • Pixels in digital image, Molecule concentration in bioinformatics • Signal Intensities, Visualization…. • Interpretation of analysis results: nonsubtractive combinations of • nonnegative basis vectors

Performance of NMF Algorithms The relative residuals vs. the number of iterations for NMF/ANLS, NMF/MUR, and NMF/ALS on a zero residual artificial problem A:200x50

Recovery of Factors by SVD and NMF A: 2500x28, W:2500x3, H:3x28 where A=W*H Recovery of the factors W and H by SVD and NMF/ANLS

Summary • Effective Algorithms for Dimension Reduction and Matrix Decompositions that exploits prior knowledge • Design of New Algorithms: e.g. for undersampled data • Take Advantage of Prior Knowledge for • Physically More Meaningful Modeling • Storage and Efficiency Issues for Massive Scale Data • Adaptive Algorithms • * Applicable to a wide range of problems • (Text classification, Facial recognition, Fingerprint classification, Gene class discovery in Microarray data, Protein secondary structure prediction … ) • Thank you!

Effective Dimension Reduction with Prior Knowledge

Effective Dimension Reduction with Prior Knowledge

Presentation Transcript

Dimension Reduction - PCA

Dimension reduction (2)

Prior Knowledge

Dimension reduction (1)

Access Prior Knowledge

Prior Knowledge!

Prior Knowledge

Prior Knowledge

Prior Knowledge

Prior knowledge

Nonlinear Dimension Reduction:

Dimension Reduction Methods

Dimension Reduction

Prior knowledge necessary

Dimension Reduction

Developing Prior Knowledge with Primary Sources

Nonlinear Dimension Reduction

Prior Knowledge

Dimension Reduction - PCA

Activate Prior Knowledge

Document Clustering with Prior Knowledge

Nonlinear Dimension Reduction: