260 likes | 505 Views
Identifying Feature Relevance Using a Random Forest. Jeremy Rogers & Steve Gunn. Overview. What is a Random Forest? Why do Relevance Identification? Estimating Feature Importance with a Random Forest Node Complexity Compensation Employing Feature Relevance Extension to Feature Selection.
E N D
Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn
Overview • What is a Random Forest? • Why do Relevance Identification? • Estimating Feature Importance with a Random Forest • Node Complexity Compensation • Employing Feature Relevance • Extension to Feature Selection
Random Forest • Combination of base learners using Bagging • Uses CART-based decision trees
Random Forest (cont...) • Optimises split using Information Gain • Selects feature randomly to perform each split • Implicit Feature Selection of CART is removed
Feature Relevance: Ranking • Analyse Features individually • Measures of Correlation to the target • Feature is relevant if: Assumes no feature interaction Fails to identify relevant features in parity problem
Feature Relevance: Subset Methods • Use implicit feature selection of decision tree induction • Wrapper methods • Subset search methods • Identifying Markov Blankets • Feature is relevant if:
Relevance Identification using Average Information Gain • Can identify feature interaction • Reliability dependant upon node composition • Irrelevant features give non-zero relevance
Node Complexity Compensation • Some nodes are easier to split • Requires each sample to be weighted by some measure of node complexity • Data projected on to one-dimensional space • For Binary Classification:
Unique & Non-Unique Arrangements • Some arrangements are reflections (non-unique) Some arrangements are symmetrical about their centre (unique)
Node Complexity Compensation (cont…) Au - No. Unique Arrangements
Information Gain Density Functions • Node Complexity improves measure of average IG • The effect is visible when examining the IG density functions for each feature • These are constructed by building a forest and recording the frequencies of IG values achieved by each feature
Information Gain Density Functions • RF used to construct 500 trees on an artificial dataset • IG density functions recorded for each feature
Employing Feature Relevance • Feature Selection • Feature Weighting • Random Forest uses a Feature Sampling distribution to select each feature. • Distribution can be altered in two ways • Parallel: Update during forest construction • Two-stage: Fixed prior to forest construction
Parallel • Control update rate using confidence intervals. • Assume Information Gain values have normal distribution. Statistic has a Student’s t distribution with n-1 degrees of freedom Maintain most uniform distribution within confidence bounds
Results • 90% of data used for training, 10% for testing • Forests of 100 trees were tested and averaged over 100 trials
Irrelevant Features • Average IG is the mean of a non-negative sample. • Expected IG of an irrelevant feature is non-zero. • Performance is degraded when there is a high proportion of irrelevant features.
Expected Information Gain nL - No. examples in left descendant iL - No. positive examples in left descendant
Expected Information Gain No. positive examples No. negative examples
Bounds on Expected Information Gain • Upper can be approximated as Lower Bound is given by
Irrelevant Features: Bounds • 100 trees built on artificial dataset • Average IG recorded and bounds calculated
Friedman FS: CFS:
Simple FS: CFS:
Results • 90% of data used for training, 10% for testing • Forests of 100 trees were tested and averaged over 100 trials • 100 trees constructed for feature evaluation in each trial
Summary • Node complexity compensation improves measure of feature relevance by examining node composition • Feature sampling distribution can be updated using confidence intervals to control the update rate • Irrelevant features can be removed by calculating their expected performance