370 likes | 471 Views
Computational Intelligence for Information Selection. Filters and wrappers for feature selection and discretization methods. Włodzisław Duch Google: Duch. Concept of information. Information may be measured by the average amount of surprise of observing X (data, signal, object).
E N D
Computational Intelligence for Information Selection Filters and wrappers for feature selection and discretization methods. Włodzisław Duch Google: Duch
Concept of information Information may be measured by the average amount of surprise of observing X (data, signal, object). 1. If P(X)=1 there is no surprise, so s(X)=0 2. If P(X)=0 then this is a big surprise, so s(X)=. • If two observations X, Y are independent than P(X,Y)=P(X)P(Y), but the amount of surprise should be a sum s(X,Y)=s(X)+s(Y). The only suitable surprise function that fulfills these requirements is ... The average amount of surprise is called information or entropy.Entropy is a measure of disorder, information is the change in disorder.
Information Information derived from observations of variable X (vector variable, signal or some object) that has n possible values is thus defined as: If the variable X is continuous with distribution P(x) an integral is taken instead of the sum: What type of logarithm should be used? Consider binary event with P(X(1))=P(X(2))=0.5, like tossing a coin: how much do we learn each time? A bit. Exactly one bit. Taking lg2 will give:
Distributions For a scalar variable P(X(i)) may be displayed in form of a histogram, and information (entropy) calculated for each histogram.
Joint information Other ways of introducing the concept of information start from the number of bits needed to code a signal. Suppose now that two variables, X and Y, are observed. Joint information is: For two uncorrelated features this is equal to: Since:
Conditional information If the value of Y variable is fixed and X is not quite independent conditional information (average “conditional surprise”) may be useful: Prove that
Mutual information The information in one X variable that is shared with Y is:
Kullback-Leibler divergence If two distributions for X variable are compared, their divergence is expressed by: KL divergence is the expected value of the ratio of two distributions, it is non-negative, but not symmetric, so it is not a distance. Mutual information is KL distance between joint and product (independent) distributions:
Joint mutual information The information in two X variable that is shared with Y is: Efficient method to calculate joint MI: where conditional joint information:
Graphical relationships H(X,Y) H(X) H(Y) MI(X;Y) H(X|Y) H(Y|X) So the total joint information is (prove it!)
Some applications of info theory Information theory has many applications in different CI areas. Few applications will be mentioned only, with visualization, discretization, and feature selection treated here in more details. Information gain has already been used in decision trees (ID3, C4.5) to define the gain of information by making a split: for feature A, used to split node S into left Sl, and right Sr sub-node, with classes w=(w1 ... wK) with being information contained in the class distribution wfor vectors in the node S. Information is zero if all samples in the node are from one class (log 1 = 0, and 0 log 0 = 0), and maximum H(S)=lg2K for uniform distribution in K equally probable classes, P(wi|S)=1/K.
Model selection How complex should our model be? For example, what size of a tree, how many functions in a network, and what degree of the kernel? Crossvalidation is a good method but sometimes costly. Another way to optimize model complexity is to measure the amount of information necessary to specify the model and its errors. Simpler models make more errors, complex need longer description. Minimum Description Length for model + errors (very simplified). General intuition: learning is compression, finding simple models, regularities in the data (ability to compress). Therefore estimate: L(M) = how many bits of information are needed to transmit the model. L(D|M) = how many bits to transmit information about data, given M. Minimize L(M)+L(D|M). Data correctly handled need not be transmitted. Estimations of L(M) are usually nontrivial.
More on model selection Many criteria for information-based model selection have been devised in computational learning theory, two best known are: AIC, Akaike Information Criterion. BIC, Bayesian Information Criterion. The goal is to predict, using training data, which model has the best potential for accurate generalization. Although model selection is a popular topic applications are relatively rare and selection via crossvalidation is commonly used. Models may be trained by max. mutual information of outputs/classes: This may in general be any non-linear transformation, for example implemented via basis set expansion methods.
Visualization via max MI Linear transformation that maximizes mutual information between a set of class labels and a set of new input features Yi, i=1..d’ < dis: Here W is a d’ x d dimensional matrix of mixing coefficients. Maximization proceeds using gradient-based iterative methods. Left – FDA view of 3 clusters, right – linear MI view with better separation FDA does not perform well if distributions are multimodal, separating the means that may be lead to overlapping clusters.
More examples Torkkola (Motorola Labs) has developed the MI-based visualization;see more ex at: http://members.cox.net/torkkola/mmi.html Example: the Landsat Image data contain 36 features (spectral intensities of 3x3 submatrix of pixels in 4 spectral bands) are used to classify 6 type of land use; 1500 samples used for visualization. Left: FDA, right: MI, note violet/blue separation. Movie 1: Reuters Movie 2: Satimage (local only) Classification in the reduced space is more accurate.
Feature selection and attention • Attention: basic cognitive skill, without attention learning would not have been possible. First we focus on some sensation (visual object, sounds, smells, tactile sensations) and only then the full power of the brain is used to analyze this sensation. • Given a large database, to find relevant information you may: • discard features that do not contain information, • use weights to express their relative importance, • reduce dimensionality aggregating information, making linear or non-linear combinations of subsets of features (FDA, MI) – new features may not be so understandable as the original ones. • create new, more informative features, introducing new higher-level concepts; this is usually left to human invention.
Ranking and selection • Feature ranking: treat each feature as independent, and compare them to determine the order of relevance or importance (rank). Note that: Several features may have identical relevance, especially for nominal features. This is either by chance, or because features are strongly correlated and therefore redundant. Ranking depends on what we look for, for example rankings of cars from the comfort, usefulness in the city, or rough terrain performance point of view, will be quite different. • Feature selection: search for the best subsets of features, remove redundant features, create subsets of one, two, k-best features.
Filters and wrappers • Can ranking or selection be universal, independent of the particular decision system used? Some features are quite irrelevant to the task at hand. • Feature filters are model-independent, universal methods based on some criteria to measure relevance, for information filtering. They are usually computationally inexpensive. • Wrappers – methods that check the influence of feature selection on the result of particular classifier at each step of the algorithm. For example, LDA or kNN or NB methods may be used as wrappers for feature selection. • Forward selection: add one feature, evaluate result using a wrapper. • Backward selection: remove one feature, evaluate results.
NB feature selection Naive Bayes assumes that all features are independent. Results degrade if redundant features are kept. Naive Bayes predicts the class and its probability. For one feature Xi P(X) does not need to be computed if NB returns a class label only.Comparing predictions with the desired class C(X) (or probability distribution over all classes) for all training data gives an error estimation when a given feature Xiis used for the dataset D
NB selection algorithm Forward selection, best first: • start with a single feature, find Xi1that minimizes the NB classifier error rate; this is the most important feature; set Xs ={Xi1}. • Let Xsbe the subspace of s features already selected; check all the remaining features, one after another, calculating probabilities (as products of factors) in the Xs +Xisubspace: Set Xs <= {Xs +Xs+1} and repeat until the error stops decreasing. For NB this forward selection wrapper approach works quite well.
Filters Complexity of the wrapper approach for d features in the worst case is O(d*(d-1)/2), usually the number of selected features m < d If the evaluation of the error on the training set is costly, or d is very large (in some problems it can be 104-105), then filters are necessary. Complexity of filters is always O(d), but evaluations are less expansive. Simplest filter for nominal data: MAP classifier.K=2 classes, dbinary features Xi=0, 1, i=1..d and N samples X(k). Joint probability P(wj,Xi)is a 2x2 matrix carrying full information. Since P(w0,0)+P(w0,1)=P(w0), P(w1,0)+P(w1,1)=P(w1)=1-P(w0), and P(w0)+P(w1)=1there are only 2 free parameters here.
MAP Bayesian filter The “informed majority classifier” (i.e. knowing the Xi value) makes in the two-class, two Xi=0,1values, optimal decisions: IF P(w0, Xi=0) > P(w1, Xi=0)THEN class w0Predicted accuracy: a fraction P(w0, Xi=0)correct, P(w1, Xi=0)errors. IF P(w0, Xi=1) > P(w1, Xi=1)THEN class w0 Predicted accuracy: a fraction P(w0, Xi=1)correct, P(w1, Xi=1)errors. In general MAP classifier predicts: Accuracy of this classifier using feature Xi requires summing over all values Xi that the feature may take:
MC properties Since no more information is available two features with the same accuracy A(MC,Xa) = A(MC,Xb)should be ranked as equal. If for a given value xall samples are from a single class then accuracy of MC is 100%, and a single feature is sufficient. Since optimal decisions are taken at each step is the majority classifier an optimal solution? For binary features yes, but for others: • Joint probabilities are difficult to estimate, especially for smaller datasets – smoothed probabilities lead to more reliable choices. • For continuous features results will strongly depend on discretization. • Information theory weights contributions from each value of X taking into account not only the most probable class, but also distribution of probabilities over other classes, so it may have some advantages, especially for continuous features.
Bayesian MAP index Bayesian MAP rule accuracy Accuracy of the majority classifier: Bayesian MAP index:
MI index Mutual informationindex is frequently used: To avoid numerical problems with 0/0 for values xk that are not present in some dataset (ex. in crossvalidation partition) Laplace corrections are used. Their most common form is: number of classes where the number of all different feature values is N(X=x). Some decision trees evaluate probabilities in this way; instead of 1 and N(wi) other values may be used as long as probabilities sum to 1.
Other entropy-based indices JBC and mutual informationindex measures concentration of probability around maximum value; simplest measure is: Joint or conditional? Measures something like entropy for each partition among classes. Other possibilities include Renyi entropy; meaning of q: Never used in decision trees or info selection? Joint of conditional?
Confusion matrices for BC Mapping from joint probability to confusion matrices for Bayesian rule:
An example Compare three binary features with class distributions: BC ranking: X3,> X1,= X2, , MI ranking: X1,> X3,>X2 Gini ranking: X3,> X2,>X1 Which is the best? Why to use anything else but BC?
Correlation coefficient Perhaps the simplest index is based on the Pearson’s correlation coefficient (CC) that calculates expectation values for product of feature values and class values: For features values that are linearly dependent correlation coefficient is 1 or -1, while for completely class distribution independent of Xj it is 0. How significant are small correlations? It depends on the number of samples n. The answer (see “Numerical Recipes”) is given by: For n=1000 even small CC=0.02 gives P ~ 0.5, but for n=10 only 0.05
Other relevance indices Mutual information is based on Kullback-Leibler distance, any distance measure between distributions may also be used, ex. Jeffreys-Matusita Bayesian concentration measure is simply: Many other such measures exist. Which is the best? In practice they are similar, although accuracy of calculations is important, relevance indices should be insensitive to noise and unbiased in their treatment of features with many values.
Discretization All indices of feature relevance require summation over probability distributions. What to do if the feature is continuous? There are two solutions: 1. Fit some functions to histogram distributions using Parzen windows, ex. a sum of several Gaussians, and integrate: • Discretize the range of the feature values, and calculate sums. • Histograms with equal width of bins (intervals); • Histograms with equal number of samples per bin; • Maxdiff histograms: bins starting in the middle (xi+1-xi)/2 of largest gaps • V-optimal: sum of variances in bins should be minimal (difficult).
Tree (entropy-based) discretization V-opt histograms are good, but difficult to create (dynamics programming techniques should be used). Simple approach: use decision trees with a single feature, or a small subset of features, to find good splits – this avoids local discretization. Ex: C4.5 decision tree discretization maximizing information gain, or SSV tree based separability criterion, vs. constant width bins. Hypothyroid screening data, 5 continuous features, MI shown. EP = equal width partition; SSV = decision tree partition (discretization) into 4, 8 .. 32 bins
Discretized information With partition of Xjfeature values x into rkbins, joint information is calculated as: and mutual information as:
Feature selection Selection requires evaluation of mutual information or other indices on subsets of features S={Xj1,Xj2,..Xjl}, with discretization of l-dimensional feature values X into Rkbins: The difficulty here is reliable estimation of distributions for large number M(S)of l-dimensional partitions. Feedforward pair approximation tries to maximize MI of new feature, minimizing sum of MI with features in S Ex. select feature maximizing: with some b < 1.
Influence on classification Selecting best 1, 2, ... k-dimensional subsets, check how different methods perform using the reduced number of features. MI using SSC discretization. SSV bfs/beam – selection of features that are at the top of the SSV decision tree, with best first search, or beam search tree creation method. kNN ranking – backward wrapper using kNN. Ba – pairwise approximation. Hypothyroid data, 21 features, 5 continuous.
GM example, wrapper/manual selection Look at GM 3 wrapper-based feature selection. Try GM on Australian Credit Data, with 14 features. • Standardize the data. • Select “transform and classify”, add feature selection wrapper and choose SVM leaving default features. • Run it and look at “Feature Ranking”, showing accuracy achieved with each single feature; with default 0.8 cut-off level only one feature is left (F8) and 85.5% accuracy is achieved. • Check that lowering level to 0.7 does not improve results. • Checked that SVM 10xCV with all features gives 85.1% on the test part, and with one feature F8 only result is 85.5% for both train/test. • This is a binary feature, so a simple rule should give the same accuracy; run SSV tree on raw data with this single feature.
GM example, Leukemia data Two types of leukemia, ALL and AML, 7129 genes, 38 train/34 test. Try SSV – decision trees are usually fast, no preprocessing. • Run SSV selecting 3 CVs; train accuracy is 100%, test 91% (3 err) and only 1 feature F4847 is used. • Removing it manually and running SSV creates a tree with F2020, 1 err train, 9 err test. This is not very useful ... • SSV Stump Field shows ranking of 1-level trees, SSV may also be used for selection, but trees separate training data with F4847 • Standardize all the data, run SVM, 100% with all 38 as support vectors but test is 59% (14 errors). • Use SSV for selection of 10 features, run SVM with these features, train 100%, test 97% (1 error only). Small samples ... but 10-30 genes should be sufficient.