210 likes | 417 Views
Deer-Vehicle Crashes . Hui Anne Ben. Goal. Create a model that will be useful in predicting the number of deer vehicle crashes on a given section of roadway. The Response Variable. Y = number of deer-vehicle crashes per half-mile section of roadway over 1 year period
E N D
Deer-Vehicle Crashes Hui Anne Ben
Goal • Create a model that will be useful in predicting the number of deer vehicle crashes on a given section of roadway
The Response Variable • Y = number of deer-vehicle crashes per half-mile section of roadway over 1 year period • Location – Ashtabula County
The Predictor Variables • X1 = no. of vertical curves • X2 = no. of horizontal curves • X3 = no. of ditches • X4 = no. of residences • X5 = no. driveways • X6 = % of adjacent forest land
Preview • 40 observations total • 6 candidate regressors • X’s are clearly known • Lurking Variable(s) seem possible however • Y is a count per unit time
Lurking Variable Residual plot from full linear regression reveals two distinct groups Data is divided in half
Enter a New Variable • Create an unknown variable by grouping data into two group based on the two groups from the residuals plot • Noticeable difference between Y values for the first 20 observations and the last 20
Unknown Variable • New variable is X7 = unknown • Variable is unknown to us • Was not considered during the collection of data
Variable selection • Best Subsets method in conjunction with Several Extra Sum of Squares Tests • Four variables X1, X5, X6, X7 are chosen
Linear regression analysis • We run linear regression • Y vs. X1, X5, X6, X7 • Model : • R2 = 88.2% • SSE = 51.17 • Decent
Correlation Analysis • Noticeable Correlation between X7 and X6 • Unknown variable is associated with forested land
A Thought About the Unknown Variable • Unknown variable negatively correlated with % of forested land • possible values of X7=Unknown: 0 and 1 • Might correspond to section of county • 0 -> rural part of county • 1 -> urban part of county
A Transformation • Many transformations were attempted • Best one: Y* = ln( Y + e2 ) • R2 = 87.9% (untransformed) • SSE = 51.00 (untransformed) • Conclusion: not better than original linear model
Poisson Regression • Recall: Y is a count per unit of time • A Poisson Model is now derived • Proc GENMOD • Link function ln(Y)
Poisson Regression Analysis • Fits and Residuals were collected from work library in SAS • R2 = 89.15% • SSE = 45.04 • Not bad
Dominant Variable • Type I and and Type III analysis in SAS • Suggests that the unknown variable is the only significant contributor • Decision: do not throw out the other regressors • Unknown variable is just a dominating variable
The Winning Model • The Poisson Model gets our vote
Thank You Abdullah Alhomidan (civil engineering) gave permission for us to use his data. FIN