1 / 44

Data Analysis Using SAS

University of California at Berkeley (Extension) December 2006 Instructor: Jianmin Liu, Ph.D SAS-X446 Project Team Dan Brockman, Saranne Warner, Christine Iodice, Satish Prasad. Data Analysis Using SAS. Introduction. Many Topics to Cover A. BUSINESS OBJECTIVE B. DATA SET

damita
Download Presentation

Data Analysis Using SAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University of California at Berkeley (Extension)December 2006Instructor: Jianmin Liu, Ph.D SAS-X446 Project TeamDan Brockman, Saranne Warner, Christine Iodice, Satish Prasad Data Analysis Using SAS

  2. Introduction Many Topics to Cover A. BUSINESS OBJECTIVE B. DATA SET C. DATA EXPLORATION D. OUTLIERS E. MODEL SPECIFICATION F. TOP CHOICE MODEL H. MEAN VALUE OF CPM / 95% CI’S I. REGRESSION DIAGNOSTICS (RESIDUALS) J. ALTERNATIVE MODELS K FINAL THOUGHTS

  3. Introduction • Focus of discussion will be on our Best Regression Model When we remove outlier CPM=3.306, RSquare = .79, Adj RSq = .76

  4. Introduction • We will also introduce a Short Range model and Long Range Model. • Both with impressive R-Squared Values!

  5. Business Objective • Develop a Cost Model for airline service • Use regression modeling to estimate the Cost per Passenger Mile (CPM) • Identify the key factors that determine cost (statistically significant variables) • Identify the functional relationship between key factors and Cost (functional form) • Identify the impact that each factor has on cost (Beta coefficients) • Why Cost Models? • Profit optimization (cost versus revenues) • Long range planning • Cost sensitivity analysis • Operational budgeting • Just to name a few….

  6. The Data Set • Data Source • Civil Aeronautics Board report: Aircraft Operating Cost and Performance Report (August 1972) • Variables

  7. Exploratory look at the data • Know thy data! • Learn about your variables • Seek out potential errors • Spot potential outliers • Check for normal distribution of variables • Jump start model specification by looking at correlations between Dependent and Independent variables • Identify potential correlation issues among independent variables • Data Exploration Methods • Proc Contents / Proc Print • Proc Means • Proc Univariate / Histograms • Proc Corr • Scatterplots

  8. Data Exploration • Proc Contents / Proc Print • 33 observations, All numeric variables • No missing values • 2 observations with CPM = 3.306 (potential error?)

  9. Data Exploration • Proc Means • Low / High values • Spread of data

  10. Data Exploration • Proc Means (by Type) • 33 Total Observations14 short range (42%) 19 long range (58%) • 3.11 cents = Avg CPM for all flights3.34 cents = Avg CPM for short range 2.94 cents = Avg CPM for long range • .58 SD for all flights.58 SD for short range.53 SD for long range • Average Utilization is not very different between short/long range planes • Short range planes are flying several legs a day

  11. CPM ALF Data Exploration • Proc Univariate • Focusing on extreme values and distribution of data • Extreme high value in CPM (outliers??) • Extreme low value in ALF

  12. ASL: Length Seats Data Exploration • More Histograms • 75% of observations have between 100 and 150 seats roughly • Not a lot of variation ALS in what is considered short range planes • Much bigger variation in long range planes

  13. Data Exploration • Proc Corr • Inverse relationship b/w independents and dependent • UTL and ALF appear to have strongest correlation with CPM • Multicollinearity potential

  14. Data Exploration • Scatterplots • Data Screening / Seeking Outliers • Nonlinearities / Data transformations

  15. Data Exploration • Clustering of data pointsindicates need for data transformation(? Log Form?)

  16. Data Exploration • Scatterplot Matrix – Very Cool!

  17. Outliers • Suspects • Extreme high values of CPM (4.737, 4.024)) • Extreme low value of ALF (.287) • Same value of CPM (3.306) / coincidence or data error? • Our Approach • Wait and See! • Even though CPM = 4.737 is more than 2 SD’s from the mean, the other values of the observation may make if plausible • Low Seats (SPA) combined with low Load Factor (ALF) make for few passengers transported • Low Flight Length (UTL) combined with Low Hrs of Use (UTL) make for few total miles travelled • ….which lead us to “What do we think causes high/low CPM? What are our underlying theories?” • But first… one more note on outliers…

  18. Outliers: Spotlight on Innovative Code • If we had more observations, we might consider….

  19. Outliers: Spotlight on Innovative Code • Identification of extreme values (<1% or >99%) using previous code would point to the following outliers in our data sample

  20. Model Specification: Hypothesized Relationships • So what are our “hunches”, “theories”,“hypotheses” regarding what drivesCost Per PassengerMile (CPM)? • Airline industry “produces” Passenger Miles (our unit) • Large fixed costs of production • Cost of a running a plane • Cost of plane itself, pilot /staff, related infrastructure (slips/spots at airports, etc) • Economies of Scale • An increase in passengers  decreases cost per passenger (nonlinear) • An increase in miles  decreases cost per mile (nonlinear) • Holding constant gas mileage and gas costs • CPM = Function (Fixed Costs, Passengers, Miles)

  21. Model Specification: Independent Variables • Variables listed below in category (Passenger, Miles, Fixed Costs) • Note calculated variables and use of Binning method

  22. Model Specification: Independent Variables • Performed Data Transformations to explore different functional forms • Created log form of all variables Lx = Log(x) • Created squared variablesSx = x**2; • In sum, we created a slate of potential independent variablesto explore • Variables calculated from others • Binning technique • Log and Squared transformations

  23. Model Specification: Regression Equations • While we had our ideas …we explored many…many…many… models! • So many models, that we needed to write code to compare them all

  24. Model Specification: Selected Models • Explored many functional forms and independent variables

  25. Best Regression Model • CPM = ASL LUTL LFilled SPA • Explains nearly 70% of the variance of CPM (Rsquare = .685, Adj Rsq=.64) • LUTL, LFilled, SPA are highly statistically significant at the .01 level • ASL mildly significant at the .1 (10%) level • CPM = 18.363 + 0.0002596 ASL-1.534 LUTL–3.3222 LFilled + 10.56 SPA Intercept Avg Flight Length Log of UTL Log of Filled Seats Avg # of Seats Intercept Flight Length Miles Traveled Passengers Plane Size/ Fixed Cost

  26. Best Regression Model CPM = 18.363 + 0.0003 ASL-1.534 LUTL–3.322 LFilled + 10.560 SPA • Additional Explanation: • CPM increases as Flight Length (ASL) increase • Staffing costs, food costs • CPM decreases as Utilization (LUTL) increases. • UTL is proxy for Miles Traveled. Relationship is nonlinear. • CPM decreases as filled seats (Lfilled) increases. • Filled seats equates to # of passengers. Relationship is nonlinear • CPM increases as seats per aircraft (SPA) increases. • Seats per aircraft is indicate plane size. Larger planes cost more to purchase and to operate(i.e. larger staffing, more gas)

  27. Best Regression Model • Key Performance Statistics • Rsquare, Adj Rsq, T-tests • Variance of Influence (VIF) measures multicollinearity. Value less than 10 is OK. When we remove outlier CPM=3.306, RSquare = .79, Adj RSq = .76

  28. Best Model : Coefficients Standardized • What does a 3.32 change in LFilled really mean?How strong is that coefficient compared to others? • Standardized values are used to Compare the relative strength of variables • Standardized Values (Beta coefficients) are measured in the standard deviations, instead of the units of the variables, thus they can be used to compared variables • How to interpret • Raw Coefficient: A one unit increase in LFilled would yield a 3.32 unit decrease in CPMBeta Coefficient: A one SD increase in LFilled would yield a 2.08 SD decrease in CPM • LFilled is our strongest predictor • Use STB option in Model statement to get Beta Coefficients

  29. Mean Value of CPM at 95% Confidence Level • CPM = SPA The 95% confidence intervalshown to the right indicates where mean value of CPM is most likely to lie Another less often used confidence interval shows the interval for individual data points. (not shown)

  30. Regression Diagnostics • Plotting Residuals • Check assumptions made in the modeling process by examining the residuals • Residuals are the difference between the observed and fitted values of the response variable. (CPM) • Three Types of Residual Plots • A plot of residuals against predicted values of the response variable • A plot of the residuals against each explanatory variable in the model • A normal probability plot of the residuals

  31. Should appear completely random If the variance of the response appears to increase with the predicted value, a transformation of the response variable may be necessary Looks good! Regression Diagnostics • Plot of residuals against predicted values of CPM • One of the assumptions of linear regression is that the residuals are normally distributed. This assures that the p-values for the t-tests will be valid.

  32. Regression Diagnostics • Plot of residuals against each explanatory variable • The presence of a curvilinear relationship, for example, would suggest that a higher-order term (quadratic) in the explanatory variable is needed in the model LUTL LFilled

  33. Regression Diagnostics • Plot of residuals against each explanatory variable (continued) Horse shoe shape: Potential issue (?)Should look at partial regression plots (?) ASL SPA

  34. Regression Diagnostics • Normal probability plot of the residuals • After all systematic variation has been removed from the data, the residuals should look like a sample from the normal distribution • The NPP is a graphical technique for assessing whether or not a data set is approximately normally distributed. • The data is plotted against a theoretical distribution such that the points should form an approximate straight line. • Departures from a straight line indicate departures from normality

  35. Regression Diagnostics • Index plot of Cooks Distance Statistics • Cook’s D is a distance measure that helps us determine how strongly a particular data point affects the overall regression. • Large absolute values of D (2 or more) indicate possible problems with model or data points that require scrutiny. Only .6

  36. Alternative Models • Short Range vs Long Range Modeling • Is the CPM function different for short range planes and long range planes? • A short range plane might be able to achieve an acceptable CPM running flights w/ fewer passengers on average (ALF * SPA)--- as long as it operates many flights (multiple legs) per day. • On the other hand, long range planes may NOT be able to maintain an acceptable CPM with fewer passengers b/c of the longer distances they fly.

  37. Alternative Models • Regresson Model for Short Range Airplanes • All observations included, Beta Coeffficients displayed below

  38. Alternative Models • Regression for Short Range Planes: Performance

  39. Alternative Models • Regresson Model for Long Range Airplanes • All observations included, Beta Coeffficients displayed below

  40. Alternative Models • Regression for Long Range Planes: Performance

  41. Alternative Models • Model with good Adj-R-Sq, robust regarding outliers • Model 4 (working name: im1) • CPM = -1.093 +0.2003 ISPA +1.265 IALF (33 observations, Adj-R-sq = 0.51)CPM = -0.5231 +0.1574 ISPA +1.112 IALF (31 observations, Adj-R-sq = 0.49)CPM = -0.9557 +0.1470 ISPA +1.350 IALF (27 observations, Adj-R-sq = 0.50) • ISPA = (1/SPA), IALF = (1/ALF) • Good t-scores, good VIF • Rationale • CPM = $ / (passengers * miles) • Perhaps Passengers = SPA * ALF * K (for some constant K) • Then CPM is roughly inversely proportional to SPA and ALF • Model 4 is roughly inversely proportional to SPA and ALF

  42. Alternative Models • Model with high Adj-R-Sq excluding certain outliers • Model 3 (working name: Long3) • CPM = 12.13 –2.639 LFilled –2.595 Lpsize (31 observations) • LFilled is log of number of seats filled • Lpsize is log of the plane’s size bucket • Adj R-Sq = 0.63, good t-scores, good VIF • 2 outliers removed by “star” method • Model with good Adj-R-Sq, robust regarding outliers • Model 5 (working name: Long2) • CPM = 7.079 –2.017 LFilled +1.039 Lempty (33 observations, Adj-R-sq = 0.47) • CPM = 6.168 +1.822 LFilled +1.040 Lempty (31 observations, Adj-R-sq = 0.50) • Good t-scores, good VIF • Serendipity • We found them while searching for specialized Long-Flight and Short-Flight models • Generally applicable to all flights.

  43. Final Thoughts • xxxxx • xxxxx

  44. Thank YouQuestions ?

More Related