Identifying Feature Relevance Using a Random Forest

Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn

Overview • What is a Random Forest? • Why do Relevance Identification? • Estimating Feature Importance with a Random Forest • Node Complexity Compensation • Employing Feature Relevance • Extension to Feature Selection

Random Forest • Combination of base learners using Bagging • Uses CART-based decision trees

Random Forest (cont...) • Optimises split using Information Gain • Selects feature randomly to perform each split • Implicit Feature Selection of CART is removed

Feature Relevance: Ranking • Analyse Features individually • Measures of Correlation to the target • Feature is relevant if: Assumes no feature interaction Fails to identify relevant features in parity problem

Feature Relevance: Subset Methods • Use implicit feature selection of decision tree induction • Wrapper methods • Subset search methods • Identifying Markov Blankets • Feature is relevant if:

Relevance Identification using Average Information Gain • Can identify feature interaction • Reliability dependant upon node composition • Irrelevant features give non-zero relevance

Node Complexity Compensation • Some nodes are easier to split • Requires each sample to be weighted by some measure of node complexity • Data projected on to one-dimensional space • For Binary Classification:

Unique & Non-Unique Arrangements • Some arrangements are reflections (non-unique) Some arrangements are symmetrical about their centre (unique)

Node Complexity Compensation (cont…) Au - No. Unique Arrangements

Information Gain Density Functions • Node Complexity improves measure of average IG • The effect is visible when examining the IG density functions for each feature • These are constructed by building a forest and recording the frequencies of IG values achieved by each feature

Information Gain Density Functions • RF used to construct 500 trees on an artificial dataset • IG density functions recorded for each feature

Employing Feature Relevance • Feature Selection • Feature Weighting • Random Forest uses a Feature Sampling distribution to select each feature. • Distribution can be altered in two ways • Parallel: Update during forest construction • Two-stage: Fixed prior to forest construction

Parallel • Control update rate using confidence intervals. • Assume Information Gain values have normal distribution. Statistic has a Student’s t distribution with n-1 degrees of freedom Maintain most uniform distribution within confidence bounds

Convergence Rates

Results • 90% of data used for training, 10% for testing • Forests of 100 trees were tested and averaged over 100 trials

Irrelevant Features • Average IG is the mean of a non-negative sample. • Expected IG of an irrelevant feature is non-zero. • Performance is degraded when there is a high proportion of irrelevant features.

Expected Information Gain nL - No. examples in left descendant iL - No. positive examples in left descendant

Expected Information Gain No. positive examples No. negative examples

Bounds on Expected Information Gain • Upper can be approximated as Lower Bound is given by

Irrelevant Features: Bounds • 100 trees built on artificial dataset • Average IG recorded and bounds calculated

Friedman FS: CFS:

Simple FS: CFS:

Results • 90% of data used for training, 10% for testing • Forests of 100 trees were tested and averaged over 100 trials • 100 trees constructed for feature evaluation in each trial

Summary • Node complexity compensation improves measure of feature relevance by examining node composition • Feature sampling distribution can be updated using confidence intervals to control the update rate • Irrelevant features can be removed by calculating their expected performance

Identifying Feature Relevance Using a Random Forest

Identifying Feature Relevance Using a Random Forest

Presentation Transcript

Random Forest

Random Forest

Random Survey Methodology Using A Random Number Generator

Minimum Redundancy and Maximum Relevance Feature Selection

Improving Protein-Ligand Binding Affinity Prediction using Random Forest

Random Forest 101

Using Trees to Depict a Forest

Using Trees to Depict a Forest

Random Forest Photometric Redshift Estimation

A Model Using Random Graph Theory

Bayesian Metanetworks for Context-Sensitive Feature Relevance

Using Trees to Depict a Forest

Using random numbers

Identifying Forest Change with SAR

Using Trees to Depict a Forest

Domain-Based Protein-Protein Interaction Prediction Using Random Decision Forest Framework

Using a Random Forest model to predict enrollment

Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science Training | Edureka

Identifying Forest Change with SAR

Identifying Forest Products and Uses

Weather Prediction Model using Random Forest Algorithm and Apache Spark