Dimension Reduction

Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma (RMS) Ewing family of tumors (EWS) Burkitt lymphomas (BL) Arrays: Training set= 63 arrays(23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 arrays(6 EWS, 5 RMS, 6 NB, 3 BL, 5 other) Genes: 2308 genes were selected because they showed minimal expression levels. 2. PLASTIC EXPLOSIVES: The data comes from a study for the detection of plastic explosives in suitcases using X-ray signals. The 23 variables are the discrete x-components of the xray absorption spectrum. The objective is to detect the suitcases with explosives. 2993 suitcases were use for training and 60 testing. (see web page for dataset).

Covariance Vs Correlation Matrix • Use covariance or correlation matrix? If variables are not in the same units  Use Correlations • Dim(V) =Dim(R) = pxp and if p is large  Dimension reduction.

Sample Correlation Matrix Scatterplot Matrix

Principal Components Geometrical Intuition • - The data cloud is approximated by an ellipsoid • - The axes of the ellipsoid represent the natural components of the data • - The length of the semi-axis represent the variability of the component. Variable X2 Component1 Component2 Data Variable X1

DIMENSION REDUCTION • When some of the components show a very small variability they can be omitted. • The graphs shows that Component 2 has low variability so it can be removed. • The dimension is reduced from dim=2 to dim=1 Variable X2 Component1 Component2 Data Variable X1

Linear Algebra Linear algebra is useful to write computations in a convenient way. Singular Value Decomposition: X = U D V’ nxp nxp pxp pxp X centered =>S = V D2 V’ pxp pxp pxp pxp Principal Components(PC): Columns of V. Eigenvalues (Variance of PC’s): Diagonal elements of D2 Correlation Matrix: Subtract mean of rows of X and divide by standard deviation and calculate the covariance If p > n then SVD: X’ = U D V’ and S = U D2 U’ pxn pxn nxn nxn

PRINCIPAL COMPONENTS TABLE Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 MURDER 0.329 0.588 0.190 -0.217 0.521 -0.377 0.223 RAPE 0.429 0.182 -0.221 0.299 0.746 -0.285 ROBBERY 0.392 0.489 -0.590 -0.467 0.190 ASSAULT 0.395 0.355 0.606 -0.543 0.217 BURGLARY 0.435 -0.219 -0.228 -0.505 -0.673 LARCENY 0.355 -0.380 -0.572 -0.227 0.589 AUTO 0.287 -0.546 0.543 0.424 0.352 0.145 Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.0436891 1.0763811 0.8621946 0.5664485 0.50353374 Proportion of Variance 0.5966664 0.1655138 0.1061971 0.0458377 0.03622089 Cumulative Proportion 0.5966664 0.7621802 0.8683773 0.9142150 0.95043587 Analysis: Dimension Reduction: 2 components explain 76.2% of variability First component: represents the sum or average of all crimes because the loadings are very similar . PC1 = violent crimes + non-violent crimes Second component: Violent crimes: MURDER RAPE ROBBERY ASSAULT all have positive coefficients. Non-violent crimes: BURGLARY LARCENY AUTO all have negative coefficients. PC2 = violent crimes – non-violent crimes

Geometrical Intuition PC2=Violent - Non-Violent PC2=Violent Violent 45º PC1=Violent + Non-Violent PC1=Non-Violent Non-Violent PC1= Violent + NonViolent 45º rotation PC1= NonViolent PC2= Violent – NonViolent PC2= Violent

Biplot • Combination of two graphs into one: • 1. Graph of the observations in the coordinates of the two principal components. • Graph of the Variables projected into the plane of the two principal components. • The variables are represented as arrows, the observations as points or labels.

Variances and Biplot

Analysis after rotation: First Component: Non violent crimes Second component: Violent crimes

Principal components of 100 genes. PC2 Vs PC1. (a) Cells are the observations Genes are the variables (b) Genes are the observations Cells are the variables

Dimension reduction: • Choosing the number of PC’s • k components explain some percentage of the variance: 70%,80%. • k eigenvalues are greater than the average (1) • Scree plot: Graph the eigenvalues and look for the last sharp decline and choose k as the number of points above the cut off. • Test the null hypothesis that the last m eigenvalues are equal (0) • The same idea can be applied to factor analysis.

The top 5 eigenvalues explain 81% of variability. • Five eigenvalues greater than the average 2.5% • Scree Plot • Test statistic is 4 significant for 6 and highly significant for 2. average

More general biplots • Graphical display of X in which two sets of markers are plotted. • One set of markers a1,…,aG represents the rows of X • The other set of markers, b1,…, bp, represents the columns of X. • For example: X = UDV’X2 = U2D2V2’ • A = U2D2a and B=V2D2b, a+b=1 so X2=AB’ • The biplot is the graph of A and B together in the same graph.

Biplot of the first two principal components. Biplot of the first two Principal components.

Dimension Reduction