Data Mining: Dimension Reduction and Feature Selection

Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University

Overview • Dimension Reduction • Correlation • Principal Component Analysis • Singular Value Decomposition • Feature Selection • Information Content • … MSCS 282: Data Mining - Craig A. Struble

Dimension Reduction • The number of attributes causes complexity of learning, clustering, etc. to grow exponentially • “Curse of dimensionality” • We need methods to reduce the number of attributes • Dimension reduction reduces attributes without (directly) considering relevance of the attribute. • Not really removing attributes, but combining/recasting them. MSCS 282: Data Mining - Craig A. Struble

Correlation • A causal, complementary, parallel, or reciprocal relationship • The simultaneous change in value of two numerically valued random variables • So, if one attribute’s value changes in a predictable way whenever another one changes, why keep them both? MSCS 282: Data Mining - Craig A. Struble

Correlation Analysis • Pearson’s Correlation Coefficient • Positive means both increase simultaneously • Negative means one increases as other decreases • If rA,B has a large magnitude, A and B are strongly correlated and one of the attributes can be removed MSCS 282: Data Mining - Craig A. Struble

Correlation Analysis Strong relationship MSCS 282: Data Mining - Craig A. Struble

Principal Component Analysis • Karhunen-Loeve or K-L method • Combine “essence” of attributes to create a (hopefully) smaller set of variables the describe the data • An instance with k attributes is a point in k-dimensional space • Find ck-dimensional orthogonal vectors that best represent the data such that c <= k • These vectors are combinations of attributes. MSCS 282: Data Mining - Craig A. Struble

Principal Component Analysis • Normalize the data • Compute c orthonormal vectors, which are the principal components • Sort in order of decreasing “significance” • Measured in terms of data variance • Can reduce data dimension by choosing only the most significant principal components MSCS 282: Data Mining - Craig A. Struble

Singular Value Decomposition • One method of PCA • Let A be an m by n matrix. Then A can be written as the product of matrices such that U is an m by n matrix, V is an n by n matrix, and  is an n by n diagonal matrix with singular values 1>=2 >=…>= n>=0. Furthermore, U and V are orthogonal matrices MSCS 282: Data Mining - Craig A. Struble

Singular Value Decomposition MSCS 282: Data Mining - Craig A. Struble

Singular Value Decomposition > x <- t(array(1:12,dim=c(3,4))) > str(s <- svd(x)) $u [,1] [,2] [,3] [1,] -0.1408767 -0.82471435 -0.3128363 [2,] -0.3439463 -0.42626394 0.7522216 [3,] -0.5470159 -0.02781353 -0.5659342 [4,] -0.7500855 0.37063688 0.1265489 $v [,1] [,2] [,3] [1,] -0.5045331 0.76077568 -0.4082483 [2,] -0.5745157 0.05714052 0.8164966 [3,] -0.6444983 -0.64649464 -0.4082483 > a <- diag(s$d) [,1] [,2] [,3] [1,] 25.46241 0.000000 0.000000e+00 [2,] 0.00000 1.290662 0.000000e+00 [3,] 0.00000 0.000000 8.920717e-16 MSCS 282: Data Mining - Craig A. Struble

Singular Value Decomposition • The amount of variance captured by a singular value is • The entropy of the data set is MSCS 282: Data Mining - Craig A. Struble

Feature Selection • Select the most “relevant” subset of attributes • Wrapper approach • Features are selected as part of the mining algorithm • Filter approach • Features selected before mining algorithm • Wrapper approach is generally more accurate but also more computationally expensive MSCS 282: Data Mining - Craig A. Struble

Feature Selection • Feature selection is actually a search problem • Want to select subset of features giving most accurate model a,b,c b,c a,c a,b b c a  MSCS 282: Data Mining - Craig A. Struble

Feature Selection • Any search heuristics will work • Branch and bound • “Best-first” or A* • Genetic algorithms • etc. • Bigger problem is to estimate the relevance of attributes without building classifier. MSCS 282: Data Mining - Craig A. Struble

Feature Selection • Using entropy • Calculate information gain of each attribute • Select the l attributes with the highest information gain • Removes attributes that are the same for all data instances MSCS 282: Data Mining - Craig A. Struble

Feature Selection • Stepwise forward selection • Start with empty attribute set • Add “best” of attributes • Add “best” of remaining attributes • Repeat. Take the top l • Stepwise backward selection • Start with entire attribute set • Remove “worst” of attributes • Repeat until l are left. MSCS 282: Data Mining - Craig A. Struble

Feature Selection • Other methods • Sample data, build model for subset of data and attributes to estimate accuracy. • Select attributes with most or least variance • Select attributes most highly correlated with goal attribute. • What does feature selection provide you? • Reduced data size • Analysis of “most important” pieces of information to collect. MSCS 282: Data Mining - Craig A. Struble

Data Mining: Dimension Reduction and Feature Selection