240 likes | 485 Views
Looking Ahead of the Curve: an ARIMA Modeling Approach to Enrollment Forecasting. John G. Zhang, Ph.D. Harper College jzhang@harpercollege.edu. Topics. Why forecast How to forecast Why ARIMA What is ARIMA How to ARIMA How ARIMA did Discussion. Why Forecast.
E N D
Looking Ahead of the Curve: an ARIMA Modeling Approach to Enrollment Forecasting John G. Zhang, Ph.D. Harper College jzhang@harpercollege.edu
Topics • Why forecast • How to forecast • Why ARIMA • What is ARIMA • How to ARIMA • How ARIMA did • Discussion 47th AIR Annual Forum
Why Forecast • Queries and Reports: what was • Dashboard: what is • Forecasts: what will be • Forecast for enrollment: more valuable for resources planning 47th AIR Annual Forum
How to forecast • Naïve forecast: random walk, moving average • Exponential smoothing • Markov chain • Regression • ARIMA • Others • Combining methods 47th AIR Annual Forum
Why ARIMA • Naïve forecast: best guess if no patterns • Exponential Smoothing: usually designed for one-step ahead forecast • Markov chain: see reference • Regression: frequently violates the assumption of uncorrelated errors • ARIMA: worked well, more later • Others: see reference • Combining Methods: non-directional 47th AIR Annual Forum
What is ARIMA • AutoRegressive Integrated Moving Average • Generally, the model is given by 47th AIR Annual Forum
where Xtis a time series value at time t, • 0 is a constant, • B is a backshift or lag operator, • i is a number of lags or spans, • is an error term at time t, • and θ are AR and MA parameters, and • p, d, and q are the orders of AR, I, MA 47th AIR Annual Forum
if p = 1, d = 0, q = 1, ARMA(1, 1): (1 - 1B)(Xt – θ0) = (1 - θ1B) t • If p = 1, d = 0, θ1 = 0, AR(1) model: (1 - 1B)(Xt – θ0) = t • If p = 1, 1 = 1, d = 0, θ1= 0, random walk: • (1 - B)(Xt – θ0) = t • If 1 = 0, d = 0, θ1 = 0, constant: (Xt – θ0) = t 47th AIR Annual Forum
How to ARIMA • Box and Jenkins (1976) notation: (p d q)(p d q)s • Four stages: Identification Estimation Validation Forecasting 47th AIR Annual Forum
How to ARIMA • SPSS Trends module: version 12 worked well version 13 and 14: algorithms changed same data, same program, different forecast • SAS ETS module: ARIMA procedure more flexible forecast consistant automation possible thanks to macros 47th AIR Annual Forum
Identification • Series Plot • Autocorrelation plot • Dickey-Fuller test of unit root hypothesis • AR models to compare the log likelihood values for a series and its transformed series 47th AIR Annual Forum
Identification • Degree of differencing • Order of AR • Order of MA • Seasonality if any 47th AIR Annual Forum
Estimation • Q statistics • Goodness-of-fit criteria: variance estimate Akaike information criterion Schwartz Bayesian criterion • Significance of parameters • Residuals analysis • Mean Absolute Percent Error 47th AIR Annual Forum
Data • Time series data • Date variable: year, quarter, month, week, day, hour, minute, second • Enrollment data: FTE, headcount, seatcount • Data points • Nature of the series determines the forecast 47th AIR Annual Forum
Patterns of Data • Trend: steady increase or decrease in the values of a times series • Cycle: long-term patterns of rising and falling data • Seasonality: regular change in the data values that occurs at the same time in a given period 47th AIR Annual Forum
FTE 47th AIR Annual Forum
FTE Pattern • Trendy: FTE increasing from 1998 to 2006, suggesting non-stationary and differencing necessary • Seasonal: higher in the Fall and Spring and lower in the Summer each and every year, implying a seasonal factor present as part of the model building process 47th AIR Annual Forum
ACF Lag Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 0 1.00000 | |********************| 1 0.64901 | . |************* | 2 0.29267 | . |****** | 3 -.06855 | . *| . | 4 -.42111 | ********| . | 5 -.42944 | *********| . | 6 -.43520 | *********| . | 7 -.40880 | ********| . | 8 -.38067 | ********| . | 9 -.06784 | . *| . | 10 0.25681 | . |***** . | 11 0.55983 | . |*********** | 12 0.85774 | . |***************** | 13 0.55625 | . |*********** | 14 0.24975 | . |***** . | 15 -.06186 | . *| . | 16 -.36715 | . *******| . | 17 -.37708 | . ********| . | 18 -.38454 | . ********| . | 19 -.36197 | . *******| . | 20 -.33780 | . *******| . | 21 -.07144 | . *| . | 22 0.20576 | . |**** . | 23 0.46222 | . |********* . | PACF Lag Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 1 0.64901 | . |************* | 2 -0.22210 | ****| . | 3 -0.28449 | ******| . | 4 -0.37073 | *******| . | 5 0.18006 | . |**** | 6 -0.26468 | *****| . | 7 -0.29117 | ******| . | 8 -0.45581 | *********| . | 9 0.72564 | . |*************** | 10 0.06626 | . |* . | 11 0.26005 | . |***** | 12 0.18460 | . |**** | 13 -0.22575 | *****| . | 14 0.14806 | . |***. | 15 0.10247 | . |** . | 16 0.16423 | . |***. | 17 -0.18254 | ****| . | 18 0.15059 | . |***. | 19 -0.04279 | . *| . | 20 0.11045 | . |** . | 21 -0.18268 | ****| . | 22 0.08106 | . |** . | 23 -0.06703 | . *| . | Autocorrelations and Partial Autocorrelations (ACF and PACF) 47th AIR Annual Forum
Q Statistics • Autocorrelation Check of Residuals • To Chi- Pr > • Lag Square DF ChiSq --------------------Autocorrelations-------------------- • 6 385.69 6 <.0001 0.937 0.874 0.808 0.743 0.727 0.711 • 12 777.02 12 <.0001 0.709 0.707 0.752 0.799 0.833 0.866 • 18 1107.12 18 <.0001 0.811 0.755 0.697 0.640 0.624 0.608 • 24 1436.47 24 <.0001 0.605 0.603 0.640 0.679 0.706 0.732 • Q Statistics show autocorrelations among various lags highly statistically significant • Autocorrelations were very high • Further actions needed 47th AIR Annual Forum
FTE Forecast 47th AIR Annual Forum
How ARIMA Did • Accuracy: what matters most • 2-period ahead: 0.74% (FTE) 0.50% (HC) • 6-period ahead: 1.43% (FTE) 1.65% (HC) • 10-period ahead: 1.40% (FTE) 2.52%(HC) • Forecast error bigger into distant future • Eleanor S. Fox (2005) 1.2% (4) 4.1% (8) • NCES (2003) 1.9% (2) 3.6% (6) 47th AIR Annual Forum
Discussion • Theoretically factors includable along with the time series itself like in regression • Unemployment rate • Consumer Price Index (CPI) • High school student population • District population • Tuition • Forecasts used for forecasting? 47th AIR Annual Forum
Discussion • Stationarity and homogeneity • Scarcity and spuriousness • Seasonality and outliers • Raw or cooked data • Data mining and stepwise • Fit and accuracy • Additive or multiplicative (subset/factored) 47th AIR Annual Forum
Discussion • Science and art • Objective and Subjective • Quantitative and qualitative • Over-differencing and over-fitting • Parsimony and uncertainty • Simple or complex 47th AIR Annual Forum