1 / 117

TR 555 Statistics “Refresher” Lecture 3: Models

TR 555 Statistics “Refresher” Lecture 3: Models. References. Penn State University, Dept. of Statistics Statistical Education Resource Kit a collection of resources used by faculty in Penn State's Department of Statistics in teaching introductory statistics courses.  

mbuckley
Download Presentation

TR 555 Statistics “Refresher” Lecture 3: Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TR 555 Statistics “Refresher”Lecture 3: Models

  2. References • Penn State University, Dept. of Statistics • Statistical Education Resource Kit • a collection of resources used by faculty in Penn State's Department of Statistics in teaching introductory statistics courses.   • Page maintained by Laura J. Simon, Sept. 2003 • Tom Maze, stat course prepared for KDOT, 2003 • Statistical and Econometric Methods for Transportation Data Analysis byWashington, Karlaftis and Mannering, Chapman and Hall, 2003 • Online Documentation: Scientific Approaches to Transportation Research - NCHRP 20-45 Scientific Approaches to Transportation Research, http://gulliver.trb.org/publications/nchrp/cd-22/start.htm accessed September 23, 2003

  3. Outline • ANOVA • Linear Regression Analysis • Poisson Regression • Probit and Logit Models

  4. One-Way Analysis of Variance … to compare 2 or more population means

  5. Does learning method affect student’s exam scores? • Consider 3 methods: • standard • osmosis • shock therapy • Convince 15 students to take part. Assign 5 students randomly to each method. • Wait eight weeks. Then, test students to get exam scores.

  6. Suppose … Study #1 Is there a reasonable conclusion?

  7. Suppose … Study #2 Is there a reasonable conclusion?

  8. Suppose … Study #3 Is there a reasonable conclusion?

  9. “Analysis of Variance” The variation between the group means and the grand mean is larger than the variation within each of the groups.

  10. “Analysis of Variance” The variation between the group means and the grand mean is smaller than the variation within each of the groups.

  11. Analysis of Variance • A division of the overall variability in data values in order to compare means. • Overall (or “total”) variability is divided into two components: • the variability “between” groups, and • the variability “within” groups • Summarized in an “ANOVA” table.

  12. Assumptions of ANOVA • Distributions are normal (see normal tests, last lecture (plots, Chi2,KS …) • Variances are approx equal … • For more than 2 factor levels, use Bartlett’s or Hartley’s test • If assumptions are “significantly” violated use the Kruskal-Wallis test in lieu of ANOVA

  13. General ANOVA Table “F” means “F test statistic” One-way Analysis of Variance Source DF SS MS F P Factor t-1 SS(Between) MSB MSB/MSE Error N-t SS(Error) MSE Total N-1 SS(Total) P-Value “Source” means “find the components of variation in this column” “DF” means “degrees of freedom” “SS” means “sums of squares” “MS” means “mean squared”

  14. General ANOVA Table One-way Analysis of Variance Source DF SS MS F P Factor t-1 SS(Between) MSB MSB/MSE Error N-t SS(Error) MSE Total N-1 SS(Total) “Factor” means “Variability between groups” or “Variability due to the factor of interest” “Error” means “Variability within groups” or “unexplained random variation” “Total” means “Total variation from the grand mean”

  15. General ANOVA Table N = number of total data values. t = number of groups (or “factor levels”) One-way Analysis of Variance Source DF SS MS F P Factor t-1 SS(Between) MSB MSB/MSE Error N-t SS(Error) MSE Total N-1 SS(Total) From F-distribution with t-1 numerator and N-t denominator d.f. MSB = SS(Between)/(t-1) MSE = SS(Error)/(N-t) N-1 = (t-1) + (N-t) SS(Total) = SS(Between) + SS(Error)

  16. ANOVA Table for Study #1 One-way Analysis of Variance Source DF SS MS F P Factor 2 2510.5 1255.3 93.44 0.000 Error 12 161.2 13.4 Total 14 2671.7 1255.2 = 2510.5/2 13.4 = 161.2/12 14 = 2 + 12 93.44 = 1255.3/13.4 2671.7 = 2510.5 + 161.2

  17. Recall Study #3

  18. ANOVA Table for Study #3 One-way Analysis of Variance Source DF SS MS F P Factor 2 80.1 40.1 0.46 0.643 Error 12 1050.8 87.6 Total 14 1130.9 The P-value is pretty large so cannot reject the null hypothesis. There is insufficient evidence to conclude that the average exam scores differ for the three learning methods.

  19. Does distance it takes to stop car at 60 mph depend on tire brand? • Brand1 Brand2 Brand3 Brand4 Brand5 • 194 189 185 183 195 • 184 204 183 193 197 • 189 190 186 184 194 • 189 190 183 186 202 • 188 189 179 194 200 • 186 207 191 199 211 • 195 203 188 196 203 • 186 193 196 188 206 • 183 181 189 193 202 • 188 206 194 196 195

  20. Comparison of Five Tire BrandsStopping Distance at 60 mph

  21. Sample Descriptive Statistics Brand N MEAN SD 1 10 188.20 3.88 2 10 195.20 9.02 3 10 187.40 5.27 4 10 191.20 5.55 5 10 200.50 5.44

  22. Hypotheses • The null hypothesis is that the group population means are all the same. That is: • H0: 1 = 2 = 3 = 4 = 5 • The alternative hypothesis is that at least one group population mean differs from the others. That is: • HA: at least one i differs from the others

  23. Analysis of Variance Analysis of Variance for comparing all 5 brands Source DF SS MS FP Brand 4 1174.8 293.7 7.95 0.000 Error 45 1661.7 36.9 Total 49 2836.5 The P-value is small (0.000, to three decimal places) so reject the null hypothesis. There is sufficient evidence to conclude that at least one brand is different from the others.

  24. Another Transportation Example

  25. Another Transportation Example (cont)

  26. Another Transportation Example (cont) important

  27. Regression Analysis

  28. Purpose • Model a continuous Y (dependent variable)on a vector of Xs (explanatory or independent variables, aka covariates) … • What causes Y? • What is the future of Y? • Can we control Y? • How does a change in an X affect Y? • Note: causation is important in specifying model

  29. Examples of regression models • Trips per household per day related to household demographics, land use, access to the network, etc. • Arterial crashes related to accesses per mile, traffic volume, minutes of delays, stops per vehicle, etc. • Crashes related to facility design such as lane width, shoulder width, degree of horizontal curvature, etc. • What other regression relationships are commonly used in Transportation?

  30. Assumptions of linear regression • Dependent variable is continuous (if not, use poisson, binomial or logistic regression) 1. The dependent variable varies linearly with the independent variable (you can linearize, but not always appropriate) 2. The dependent variable is randomly sampled from the population of interest 3. Changes in the dependent variable are caused by changes in the independent variable

  31. Assumptions of linear regression 4. There is uncertainty in the relationships, reflected as error terms 5. The errors must be normally distributed with mean zero and constant variance (homoskedastic) or the distribution must be identified (if you want to use inference) 6. Independent variable is measured without error • Errors are not autocorrelated (over time, same person, etc.) • Errors are independent of X values • Xs are independent (or at least not too co-dependent) • All effect variables are in the model (no exogeneity) • No endogeneity exists (Y influences X, e.g., frequency of ice related crashes influences presence of ice on roadway signs)

  32. If assumptions are violated … • Non-normal errors (5) • Transform, use poisson or other, bootstrap or monte carlo to define actual distribution • Non-linearity (1) • Transform (careful of other assumptions!) • Non-constant variance (heteroskedastic) (5) • Use weighted, ridge or generalized regression • Correlations across time • Use time series (e.g., ARIMA) • Non-random errors • Instumental techniques, proxy, structural models

  33. Regression theory • First specify a relationship Yi = b0 + b1X1,j+ b2X2,j + …+bm-1Xp-1,j+ ei • Yi = the ith the dependent variable • b1, b2, ..bp-1are the partial effectives of the independent variables (covariates or coefficients) • X1,X2,…Xp are the independent values of the explanatory variables • eiis the random error term with mean of zero and ei and ej are uncorrelated

  34. http://www.cogs.susx.ac.uk/users/andyf/teaching/pg

  35. First steps Note: estimates are made based on minimizing square error in the Y direction • Propose a model form y=f(x’s) • Can include interaction terms (e.g., X1*X2) • Plot data (Y vs each X) • Identify linear relationships • Identify data issues • Transform X data to linearize if needed • Estimate the model

  36. Estimate the model

  37. Goodness of fit • R squared = (SST-SSE)/SST = 1-SSE/SST • Always goes up when adding variables • AKA Pearson Correlation Cooficient • Adjusted R squared = 1-(n-1)/(n-p)*SSE/SST • Where n is the sample size and p is the number of x variables (parameters) • Goes down when adding insignificant variables • Only use R squared to compare models SSE SST

  38. Full vs. reduced models • Full model uses all variables • Reduced model uses one or more less variables thought not to contribute (or have problems) • To test the hypothesis that additional variables in the full model have beta of zero (e.g., meaningless), use F = SSER-SSEF DFR-DFF SSER DFF If F<F from table (1-α, DFR-DFF, DFF) then conclude H0

  39. Example FARS Data

  40. One possible model How to Read the Output From Simple Linear Regression Analyses http://www.tufts.edu/~gdallal/slrout.htm =coeff/standard error * Fatal Crashes = 48.13122 + 0.014259 (VMT) *The Standard Errors are the standard errors of the regression coefficients. They can be used for hypothesis testing and constructing confidence intervals. For example, say the standard error of a coefficient is 0.219. A 95% confidence interval for the regression coefficient for the coefficient is constructed as (mean ± k 0.219), where k is the appropriate percentile of the t distribution with degrees of freedom equal to the Error DF from the ANOVA table. If say,the degrees of freedom is 60, the multiplier is 2.00. Thus, the confidence interval is given by (3.016 ± 2.00 (0.219)). If the sample size were huge, the error degrees of freedom would be larger and the multiplier would become the familiar 1.96.

  41. Model is good enough? • Although we expect VMT and fatal crashes to be related – we know its not that simple • Other factors that can include • Percentage of VMT on rural highways • Percentage of VMT on highways of different classes • Weather conditions • Response of medical services

  42. Adding another variable • Include percentage of VMT on rural highways Fatal Crashes = -113.2 + 0.0142 (VMT) + 297.4 (% of VMT on rural roads) What does an R Square of 0.93 mean?

  43. Using rural variable alone … • Fatal crashes and percent of VMT in Rural area • What is up with this? Fatal Crashes = 1526 + -1485 (% of VMT on rural roads)

  44. Regression example • Fatal Crash rate vs percent rural VMT Fatal Crash rate = 0.017 + 0.009 (% of VMT on rural roads) Why can we only account for 18% of the independent variable variance?

  45. Percentage of Rural VMT versus Crash Rate

  46. Taking % VMT rural into account • Add a dummy variable (0 or 1) for each of the following ranges of % rural VMT. • 0% to 20% • 20% to 40% • 40% to 60% • 60% to 100% is not done to avoid perfect dependence between variables and is taken into account by intercept constant.

  47. Result of dummy model Fatal Crashes = 83.44 + 0.015(total VMT) + -323.9(1 when rural VMT l.t. 20%) + -47.99(1 when rural vmt l.t. 40%) + -41.49(1 when rural VMT l.t. 60%)

  48. Alternative specification Fatal Crashes = 39.71 + 0.015(total VMT) + 43.92(1 when rural VMT gt 60%) + -279.3(1 when rural VMT lt 20%)

  49. Final specification Fatal Crashes = 59.44 + 0.015(total VMT) + -287.9(1when rural VMT < 20%)

  50. Assumption Checks • Non-linearity of regression function • Hetroscedasticity of error terms • Lack of independence of error terms • Extreme influence of outlying observations • Non-normality of error terms • Omission of important variables • Multi-collinearity of independent variables • Poorly measured independent variables

More Related