180 likes | 321 Views
Dimensionality R e d u c t i o n. Another unsupervised task. Clustering, etc. -- all forms of data modeling Trying to identify statistically supportable patterns in data Another way of looking at it: reduce complexity of data Clustering: 1000 data points → 3 clusters
E N D
Another unsupervised task • Clustering, etc. -- all forms of data modeling • Trying to identify statistically supportable patterns in data • Another way of looking at it: reduce complexity of data • Clustering: 1000 data points → 3 clusters • Dimensionality reduction: reduce complexity of space in which data lives • Find low-dimensional projection of data
Objective functions • All learning methods depend on optimizing some objective function • Otherwise, can’t tell if you’re making any progress • Measures whether model A is better than B • Supervised learning: loss function • Difference between predicted and actual values • Unsupervised learning: model fit/distortion • How well does model represent data?
The fit of dimensions • Given: Data set X={X1,...,XN} in feature space F • Goal: find a low-dimensional representation of data set • Projection of X into F’⊂F • That is: find g() such that g(X)∈F’ • Constraint: preserve some property of X as much as possible
Capturing classification • Easy “fit” function: keep aspects of data that make it easy to classify • Uses dimensionality reduction in conjunction with classification • Goal: find g() such that loss of model learned on g(X) is minimized:
Feature subset selection • Early idea: • Let g() be a subset of the feature space • E.g., if X=[X[1], X[2], ..., X[d]] • Then g(X)=[X[2], X[17], ..., X[k]] for k≪d • Tricky part: picking the indices to keep • Q: How many such index sets are possible?
Wrapper method • Led to wrapper method for FSS • Kohavi et al. (KDD-1995, AIJ 97(1-2), etc.) • Core idea: use target learning algorithm as black-box subroutine • Wrap (your favorite) search for feature subset around black box
An example wrapper FSS • // hill-climbing search-based wrapper FSS • function wrapper_FSS_hill(X,Y,L,baseLearn) • // Inputs: data X, labels Y, loss function L, • // base learner method, baseLearn() • // Outputs: feature subset S, model fHat • S={} // initialize: empty set • [XTr,YTr,Xtst,Ytst]=split_data(X,Y); • l=Inf; • do { • lLast=l; • nextSSet=extend_feature_set(S); • foreach sp in nextSSet { • model=baseLearn(Xtr[sp],Ytr); • err=L(model(Xtst),Ytst); • if (err<l) { • l=err; • fHat=model; • } • } • } while (l<lLast);
More general projections • FSS uses orthagonal projection onto a subspace • Essentially: drop some dimensions, keep others • Often useful to work with more general projection functions, g() • Example: linear projection: • Pick A to reduce dimension: k×d matrix, k≪d
The right linearity • How to pick A? • What property of the data do we want to preserve? • Typical answer: squared-error between the original data point and the low-dimensional representation of that point: • Leads to method of principle component analysis (PCA), a.k.a., Karhunen-Loéve (KL) transform
PCA • Find mean of data:
PCA • Find mean of data: • Find scatter matrix: • Essentially, denormalized covariance matrix
PCA • Find mean of data: • Find scatter matrix: • Essentially, denormalized covariance matrix • Find eigenvectors/eigenvalues of S:
PCA • Find mean of data: • Find scatter matrix: • Essentially, denormalized covariance matrix • Find eigenvectors/eigenvalues of S: • Take top k<<d eigenvectors:
PCA • Find mean of data: • Find scatter matrix: • Essentially, denormalized covariance matrix • Find eigenvectors/eigenvalues of S: • Take top k<<d eigenvectors: • Form A from those vectors:
Nonlinearity • The coolness of PCA: • Finds directions of “maximal variance” in data • Good for linear data sets • The downfall of PCA: • Lots of stuff in the world is nonlinear
LLE et al. • Leads to a number of methods for nonlinear dimensionality reduction (NLDR) • LLE, Isomap, MVUE, etc. • Core idea to all of them: • Look at small “patch” on surface of data manifold • Make low-dim local, linear approximation to patch • “Stitch together” all local approximations into global structure
Unfolding the swiss roll 3-d data 2-d approximation