290 likes | 479 Views
Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013. R package “ randomForests ” Erick Towett. Welcome. Outline Introduction Usage total element composition of Africa soils using total X-ray fluorescence ( TXRF).
E N D
Hands-on Soil Infrared Spectroscopy Training CourseGetting the best out of light11 – 15 November 2013 R package “randomForests” Erick Towett
Welcome Outline Introduction Usage total element composition of Africa soils using total X-ray fluorescence (TXRF). combining MIR and TXRF for the prediction of soil properties. MIRS randomForests prediction models for soil properties. Demo application of RF to MIRS calibration.
Introduction I “randomForest” (RF) implements Breiman’s random forest algorithm for classification and regression based on a forest of trees using random inputs. Version 4.6-7 Depends R (>= 2.5.0) Description: Classification and regression based on a forest of trees using random inputs. URL http://stat-www.berkeley.edu/users/breiman/RandomForests Reference: Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
Features of Random Forests RF is fast and easy to implement, produce highly accurate predictions It runs efficiently on large data bases. It can handle thousands of input variables without variable deletion and without overfitting. It gives estimates of variable importance in the classification. RF handles complex data types well. Obviates the need for transformation of predictors to approximate normal distributions.
Features of RF What are the challenges of RF? X There are many possible alternative nodes; X reseeding will give different models. How does RF work? The out-of-bag (oob) error estimate In RF, each tree is constructed using a different bootstrap sample from the original data. ~ 1/3 of the cases are left out of the bootstrap sample and not used in the construction of the kthtree. Data to get a running unbiased estimate of classification error as trees are added to the forest. It is used to get estimates of variable importance.
How does RF work? RF can output a list of predictor variables that are important in predicting the outcome. The randomForest package in R has two measures of importance. One is "total decrease in node impurities from splitting on the variable, averaged over all trees.” The other is based on a permutation test.
Usage Study 1: Variability and patterns in total element composition of sub-Saharan Africa (SSA) soils using TXRF. The objectives were to; quantify the variability in total element composition of soils from a diverse set of soils across SSA using TXRF, and explore the patterns in total element composition of soils analysed.
Materials and Methods Soils from 34 randomly-located 100-km2 sentinel sites across Africa.
Land degradation surveillance framework (LDSF) Sentinel sites Randomized sampling schemes • LDSF = a hierarchical spatially stratified random sampling scheme with ten 100 m2 plots nested within sixteen 1 km2 clusters, nested within 100 km2 sites. Soil spectroscopy Consistent field protocol
Materials and Methods Soil samples collected at two depths, 0-20 & 20-50 cm. Total of 1074 samples (16 samples per cluster x 2 soil depths x 34 sentinel sites) used for exploring spectral (TXRF) patterns. Total element conc. for 17 elements; Al, P, K, Ca, Ti, V, Cr, Mn, Fe, Ni, Cu, Zn, Ga, Sr, Y, Ta, & Pb.
Materials and Methods PCA on the TXRF data RF regression of factors vs the first 5 PCs of the TXRF element conc. to confirm whether site or soil-forming factors (e.g., mineralogy, climate, topography & vegetation) are important drivers of total elemental conc. in the soil to view the importance of the predictor variables. Site factors extracted for each site from LDSF database & Worldclim data & mineralogy data from XRD analysis raw semi-quantitative mineralogy data & dominant mineralogy grouping.
Results Total element conc. values were within the range reported globally for soil Cr, Mn, Zn, Ni, V, Sr, & Y and in the high range for Al, Cu, Ta, Pb, & Ga.
Results Significant variations (P < 0.05) in total element composition within & between the sites for the 17 elements analysed. Greatest proportion of total variance & number of significant variance components occurred at the site (55-88%) followed by the cluster nested within site levels (10-40%).
Results PCA revealed that patterns in total element conc. between sites appeared to relate to differences in mineralogical ‘functional groups’ . The pattern of clustering of the individual minerals and sorting of heavy minerals (V, Pb, Ni, Cr, Cu Ti, and Fe) along the positive Dim1 axis is apparent. Biplots (arrow sizes are proportional to the “initial” variability in the elements present) based on the principal component Dim 1 vs Dim 2 and Dim 1 and Dim 3, on the log transformed data of the soil total element concentration from all sites analysed.
Results Strong observed within site & between site variations in many elements can serve to diagnose of soil fertility potential. Elements clustered out differently in the different sample sets from different sentinel sites, indicating a wide variation in associations. some elements are poorly represented (short arrows in the PCA). Biplots based on PCA of element concentration for 4 sentinel sites.
Results RF model performances were acceptable with R2>0.75. Most important variables = cluster, topography, landuse, precipitation and temperature, The importance of cluster explained by spatial correlation at distances of < 1 km. Variable importance plots showing the model accuracies & mean decrease in accuracy (%IncMSE) of the Random Forests regression of TXRF element concs against mineralogy + site/soil-forming factors (a) including cluster and (b) excluding cluster.
Usage Study 2: • Potential of combining MIR & TXRF spectroscopy for the prediction of soil properties • Objectives: • to evaluate whether TXRF can complement MIR for predicting soil test values, especially for tests that are poorly predicted by MIR (e.g. extractable P and K; and some micronutrients).
Materials and Methods Georeferenced soil samples associated with the AfSIS Project. A total of 700 soil samples 44 random 100-km2 sentinel sites, stratified according to Köppen-Geiger climatic zones distributed across SSA.
Materials and Methods Samples were analysed using MIR spectrometer. • Infrared absorbance spectra were recorded at 4 cm-1 intervals in the range of 400 to 4000 cm-1. • The average of the spectra for 4 replicates was taken. Fourier-Transform MIR spectrometer • TXRF methodology for total elemental concentrations in each soil sample. TXRF spectrometer
Materials and Methods RF-OOB calibration models developed (n= 700). to predict the reference properties from the TXRF total element composition using the raw total element concentration data as ‘spectra’. Raw TXRF spectra in conjunction with 1stderivative MIR spectra to predict the reference soil properties. RF used to calibrate the residuals of the predictions from the MIR spectral data to the raw TXRF total element data as mixing different data types in the predictor variables might affect the variable importance weights in the fitted models.
Results MIR spectra resulted in very good prediction models using RF out-of-bag validation (R2 > 0.80) for organic C and N, total C and N, exchangeable Ca, Mehlich-3 Al and pH. Also predicted well (R2 > 0.60) were Ca/Mg ratio, exchangeable bases, exchangeable Mg, phosphorus sorption index (PSI) water- and calgon-dispersed particles analysed by laser diffraction for sand content, clay content, and silt content.
Results Calibration models were not satisfactory (R2<0.60) Mehlich-3 extractable K, Mn, Fe, Cu, B, Zn, P, S, and Na, exchangeable acidity, electrical conductivity (ECd), exchangeable sodium percentage (ESP), exchangeable sodium ratio (ESR), air-dispersed particles for silt content, clay content and sand contents.
Results RF was able to improve prediction accuracies if the raw TXRF spectra was added to the MIR data. e.g. ECd(63% reduction in rmse), Mehlich-3 S (54), exchangeable Na (53%), ESP (50%), ESR (50%), total C (29%), Mehlich-3 B (28%), Mehlich-3 Mn (26%), exchangeable Mg (17%), Mehlich-3 Cu (15%), Mehlich-3 Fe (11%), organic C (10%), Mehlich-3 Zn (6%), and silt content (8-50 microns) air-dispersed particles by laser diffraction (4%)). The improvement in the predictions was mostly due to TXRF detecting a few outlier samples that were different from the rest of the samples. TXRF data used as a predictor did not add value to MIR beyond identifying outlying samples, these could not be detected as MIR spectral outliers hence TXRF may be used as an outlier detector. 2 2
Usage Study 3: • Analysis of MIRS randomForests prediction models for soil properties. • Ongoing study • attempt to offer an in-depth analysis of random forests models for the prediction of a number of soil properties using MIR spectroscopy.
Materials and Methods 1907 soil samples scanned through MIR spectrometer at a resolution of 4 cm-1. 1st derivative of the spectral range 601.7-4001.6 cm-1calculated smoothing interval of 21 data points using the soil.spec package in R. RF-OOB built to predict the reference properties from the MIRS 1st derivative spectra using the entire data set.
R package “randomForests” Thank you for your attention