180 likes | 361 Views
MSCS 282: Data Mining - Craig A. Struble. 2. Overview. Dimension ReductionCorrelationPrincipal Component AnalysisSingular Value DecompositionFeature SelectionInformation Content
E N D
1. Dimension Reduction and Feature Selection Craig A. Struble, Ph.D.
Department of Mathematics, Statistics, and Computer Science
Marquette University
2. MSCS 282: Data Mining - Craig A. Struble 2 Overview Dimension Reduction
Correlation
Principal Component Analysis
Singular Value Decomposition
Feature Selection
Information Content
…
3. MSCS 282: Data Mining - Craig A. Struble 3 Dimension Reduction The number of attributes causes complexity of learning, clustering, etc. to grow exponentially
“Curse of dimensionality”
We need methods to reduce the number of attributes
Dimension reduction reduces attributes without (directly) considering relevance of the attribute.
Not really removing attributes, but combining/recasting them.
4. MSCS 282: Data Mining - Craig A. Struble 4 Correlation A causal, complementary, parallel, or reciprocal relationship
The simultaneous change in value of two numerically valued random variables
So, if one attribute’s value changes in a predictable way whenever another one changes, why keep them both?
5. MSCS 282: Data Mining - Craig A. Struble 5 Correlation Analysis Pearson’s Correlation Coefficient
Positive means both increase simultaneously
Negative means one increases as other decreases
If rA,B has a large magnitude, A and B are strongly correlated and one of the attributes can be removed
6. MSCS 282: Data Mining - Craig A. Struble 6 Correlation Analysis
7. MSCS 282: Data Mining - Craig A. Struble 7 Principal Component Analysis Karhunen-Loeve or K-L method
Combine “essence” of attributes to create a (hopefully) smaller set of variables the describe the data
An instance with k attributes is a point in k-dimensional space
Find c k-dimensional orthogonal vectors that best represent the data such that c <= k
These vectors are combinations of attributes.
8. MSCS 282: Data Mining - Craig A. Struble 8 Principal Component Analysis Normalize the data
Compute c orthonormal vectors, which are the principal components
Sort in order of decreasing “significance”
Measured in terms of data variance
Can reduce data dimension by choosing only the most significant principal components An example using R was performed in class. This is done with the princomp function in the mva package.An example using R was performed in class. This is done with the princomp function in the mva package.
9. MSCS 282: Data Mining - Craig A. Struble 9 Singular Value Decomposition One method of PCA
Let A be an m by n matrix. Then A can be written as the product of matrices
such that U is an m by n matrix, V is an n by n matrix, and ? is an n by n diagonal matrix with singular values ?1>=?2 >=…>= ?n>=0. Furthermore, U and V are orthogonal matrices
Note that there are slight modifications to the SVD, but in both cases, the singular values are the same.Note that there are slight modifications to the SVD, but in both cases, the singular values are the same.
10. MSCS 282: Data Mining - Craig A. Struble 10 Singular Value Decomposition
11. MSCS 282: Data Mining - Craig A. Struble 11 Singular Value Decomposition
12. MSCS 282: Data Mining - Craig A. Struble 12 Singular Value Decomposition The amount of variance captured by a singular value is
The entropy of the data set is
If the entropy is 0, then the data is ordered and redundant. If it is 1, then it is completely disordered (equal representation, no patterns).If the entropy is 0, then the data is ordered and redundant. If it is 1, then it is completely disordered (equal representation, no patterns).
13. MSCS 282: Data Mining - Craig A. Struble 13 Feature Selection Select the most “relevant” subset of attributes
Wrapper approach
Features are selected as part of the mining algorithm
Filter approach
Features selected before mining algorithm
Wrapper approach is generally more accurate but also more computationally expensive
14. MSCS 282: Data Mining - Craig A. Struble 14 Feature Selection Feature selection is actually a search problem
Want to select subset of features giving most accurate model
15. MSCS 282: Data Mining - Craig A. Struble 15 Feature Selection Any search heuristics will work
Branch and bound
“Best-first” or A*
Genetic algorithms
etc.
Bigger problem is to estimate the relevance of attributes without building classifier.
16. MSCS 282: Data Mining - Craig A. Struble 16 Feature Selection Using entropy
Calculate information gain of each attribute
Select the l attributes with the highest information gain
Removes attributes that are the same for all data instances
17. MSCS 282: Data Mining - Craig A. Struble 17 Feature Selection Stepwise forward selection
Start with empty attribute set
Add “best” of attributes
Add “best” of remaining attributes
Repeat. Take the top l
Stepwise backward selection
Start with entire attribute set
Remove “worst” of attributes
Repeat until l are left.
18. MSCS 282: Data Mining - Craig A. Struble 18 Feature Selection Other methods
Sample data, build model for subset of data and attributes to estimate accuracy.
Select attributes with most or least variance
Select attributes most highly correlated with goal attribute.
What does feature selection provide you?
Reduced data size
Analysis of “most important” pieces of information to collect.