Imputating snag data to forest inventory for wildlife habitat modeling

Imputating snag data to forest inventory for wildlife habitat modeling Kevin Ceder College of Forest Resources University of Washington GMUG – 11 February 2008

Why impute snag data? • Snags are an important habitat element and needed for habitat assessments. • These data are often not collected in forest inventory • The Large-Landscape Wildlife Assessment models will need these data

Why use Nearest-Neighbor? • Non-parametric requiring no assumptions of underlying functional form • Retains the variance/covariance structure of the input data in the output data

The Questions • Can snag data be imputed using kNN techniques with stand and site data? • How well do the results fit observed data? • Which distance measure performs best? • What is the effect of increasing neighborhood size? • How do the results compare with random sampling?

The Process • The database • FIA integrated database version 2.1 • Data for private forests in western Washington (1510 plots) • Both tree and snag data collected between 1989 - 1991 • Representative of the forest targeted for the LLWA project

The Process • The tool - • The yaImpute package for kNN imputation • Raw, Euclidean, Mahalanobis, MSN, MSN2, ICA, and randomForest distance measures • k = 1, 2, 3, 4, 5, 10 • For k>1 imputed data are distance weighted means of neighbors • 9999 permutations of the data for comparisons with random sampling • k = 1, 2, 3, 4, 5, 10 • For k>1 imputed data are distance weighted means of neighbors using Euclidean distance

Goodness of fit Comparison with random The Statistics

The Input Data – Tree and site data (xData)

The Input Data – Snag data (yData) • 695 of 1510 plots did not have snags present

Results • Can snag data be imputed using kNN techniques with stand and site data? • Yes!

Results • How well do the results fit observed data?

Results • How well do the results fit observed data? • Marginally… • High RMSD and MAD relative to mean snag measures in the data • Observed vs imputed plots show poor patterning

Results • Which distance measure performs best? • What is the effect of increasing neighborhood size?

Results • Which distance measure performs best? • All are generally similar • randomForest imputations provide lower RMSD and MAD but under-predict more than others • What is the effect of increasing neighborhood size? • Increasing k reduces RMSD and MAD • Little effect on bias • Slightly decreased range in imputed values with k = 10

Results • How do the results compare with random sampling?

Results • How do the results compare with random sampling? • p-values of 0.001 suggest that there is some underlying very weak relationship between snags and overstory • Imputation is better than just randomly assigning snags to stands

Why didn’t it work better? • Very weak correlations between overstory and snags • Snags are from prior stand • Many of the snags in the FIA database have advanced decay classes • Often snags are larger than QMD • Management history • Snags were removed at harvest • Thinning captures mortality

Future Direction • Assessing the effects of imputed data on habitat model outputs • If there are big differences then what?

Imputating snag data to forest inventory for wildlife habitat modeling