200 likes | 336 Views
How to solve a classification problem with 45 class levels using Random Forests Nicholas L. Crookston Gerald E. Rehfeldt US Forest Service, Rocky Mountain Research Station, Moscow, ID Western Mensurationists Missoula, MT June 20-22, 2010. Problem (we have 45 class levels, that’s a lot)
E N D
How to solve a classification problem with 45 class levels using Random Forests Nicholas L. Crookston Gerald E. Rehfeldt US Forest Service, Rocky Mountain Research Station, Moscow, ID Western Mensurationists Missoula, MT June 20-22, 2010
Problem (we have 45 class levels, that’s a lot) Solution (we broke the problem into many subsets and formed an ensemble classifier) Results (very good, and we have a measure of extrapolation) Discussion Contents
We desire to predict the biotic community as a function of climate. There are 45 biotic communities of interest. Brown, D.E., F. Reichenbacher, S.E. Franson. 1998. A classification of North American biotic communities. University of Utah Press, Salt Lake City. 141 pp. Problem
In a 2006 effort on a subset of these communities, we had great results using: Breiman, Leo. 2001. Random Forests. Machine Learning 45:5-32. These results were published in: Rehfeldt, G.E., N.L. Crookston, M.V. Warwell and J.S. Evans. 2006. Empirical analyses of plant-climate relationships for the western United States. Int. J. Plant Sci. 167, 1123-1150. Problem
A Random Forest (RF) is a set of classification or regression trees (CART). RF builds many trees, each one minimizes the classification error on a boot-strap sample of training data. 32 class-levels are supported, but when there are over 10, it uses a sampling scheme for each tree. Random Forests
To classify a new observation: RF puts the new observation down each of the trees in the forest Each tree gives a classification, the classification is a vote. The forest chooses the class having the most votes over all the trees. Random Forests -- continued
We have 45 class levels, over the limit in package randomForest 32! We desire to make predictions using future climates. RF might predict nonsense answers for future climatic conditions that are unique with respect to the training data. These are extrapolations we need to detect. Problem -- continued
Training data: ~1.6 million obs, 35 climate variables from the Moscow climate model. We created 100 Random Forests. To create 1 of the forests: Sample 9 of 45 class levels (without replacement) Make a copy of the training data. Recode the biotic community in this copy; keep as is if code is one of the 9 in the sample, otherwise change the observed class to “other”. Solution -- Steps
Fit each of the 100 RFs. To make a prediction: Put the new case down all 100 RFs, providing a vector of 100 predictions for the case. Count the number of predictions by biotic community code, including “other”. This gives a table of codes and counts that has 46 rows (one for each community code plus “other”). Steps -- continued.
Divide the counts for each code by the number of RFs that contained the code. The ensemble classification is the class value corresponding to the maximum of these quotients. Steps -- continued.
We interpret predictions of other to indicate extrapolation. For this work, extrapolation indicates there is no biotic community in our study area that corresponds to the (new) climate. It is not a perfect indication of extrapolation. Results
Application to Brown’s biotic communities All of North America Prediction of community as a function of climatic metrics Mapped at 0.0083333 arc degrees (~ 1km2) Results
No analog: 2090 Canadian Princeton Hadley
The method can be use on larger problems and perhaps with CART-based methods other than Random Forests. One could add samples that are actually other, that is, not any of those of interest. Random Forests remains a very important tool in our tool set. Discussion / Conclusion