Time Series Forecasting With Feed-Forward Neural Networks: Guidelines And Limitations

Time Series Forecasting WithFeed-Forward Neural Networks:Guidelines And Limitations Eric Plummer Computer Science Department University of Wyoming March 13, 2014

Topics • Thesis Goals • Time Series Forecasting • Neural Networks • K-Nearest-Neighbor • Test-Bed Application • Empirical Evaluation • Data Preprocessing • Contributions • Future Work • Conclusion • Demonstration Eric Plummer

Thesis Goals • Compare neural networks and k-nearest-neighbor for time series forecasting • Analyze the response of various configurations to data series with specific characteristics • Identify when neural networks and k-nearest-neighbor are inadequate • Evaluate the effectiveness of data preprocessing Eric Plummer

Time Series Forecasting –Description • What is it? • Given an existing data series, observe or model the data series to make accurate forecasts • Example data series • Financial (e.g., stocks, rates) • Physically observed (e.g., weather, sunspots) • Mathematical (e.g., Fibonacci sequence) Eric Plummer

Time Series Forecasting –Difficulties • Why is it difficult? • Limited quantity of data • Observed data series sometimes too short to partition • Noise • Erroneous data points • Obscuring component • Moving Average • Nonstationarity • Fundamentals change over time • Nonstationary mean: “Ascending” data series • First-difference preprocessing • Forecasting method selection • Statistics • Artificial intelligence Eric Plummer

Time Series Forecasting –Importance • Why is it important? • Preventing undesirable events by forecasting the event, identifying the circumstances preceding the event, and taking corrective action so the event can be avoided (e.g., inflationary economic period) • Forecasting undesirable, yet unavoidable, events to preemptively lessen their impact (e.g., solar maximum w/ sunspots) • Profiting from forecasting (e.g., financial markets) Eric Plummer

Neural Networks – Background • Loosely based on the human brain’s neuron structure • Timeline • 1940’s – McCulloch and Pitts – proposed neuron models in the form of binary threshold devices and stochastic algorithms • 1950’s & 1960’s – Rosenblatt – class of learning machines called perceptrons • Late 1960’s – Minsky and Papert – discouraging analysis of perceptrons (linearly separable classes) • 1980’s – Rumelhart, Hinton, and Williams – generalized delta rule for learning by back-propagation for training multilayer perceptrons • Present – many new training algorithms and architectures, but nothing “revolutionary” Eric Plummer

Neural Networks –Architecture • A feed-forward neural network can have any number of: • Layers • Units per layer • Network inputs • Network outputs • Hidden layers (A, B) • Output layer (C) Eric Plummer

Neural Networks –Units • A unit has: • Connections • Weights • Bias • Activation function • Weights and bias are randomly initialized before training • Unit’s input consists of: • Sum of the products of each connection value and associated weight • Add the bias • Input is then fed into unit’s activation function • Unit’s output is the output of activation function • Hidden layers: Sigmoid • Output layer: Linear Eric Plummer

Neural Networks –Training • Partition data series into: • Training set • Validation set (optional) • Test set (optional) • Typically, the training procedure is: • Perform backpropagation training with training set • After n epochs, compute total squared error on training set and validation set • If consistently validation error  and training error , stop training. • Overfitting: Training set learned too well • Generalization: Given inputs not in training and validation sets, able to accurately forecast Eric Plummer

Backpropagation training: First, examples in the form of <input, output> pairs are extracted from the data series Then, the network is trained with backpropagation on the examples: Present an example’s input vector to the network inputs and run the network sequentially forward Propagate the error sequentially backward from the output layer For every connection, change the weight modifying that connection in proportion to the error When all three steps have been performed for all examples, one epoch has occurred Goal is to converge to a near-optimal solution based on the total squared error Neural Networks –Training Eric Plummer

Neural Networks –Training Backpropagation training cycle Eric Plummer

Neural Networks –Forecasting • Forecasting method depends on examples • Examples depend on step-ahead size If step-ahead size is one: Iterative forecasting If step-ahead size is greater than one: Direct forecasting Eric Plummer

Neural Networks –Forecasting Iterative forecasting Can continue this indefinitely Eric Plummer

Neural Networks –Forecasting Directly forecasting n steps This is the only forecast Eric Plummer

K-Nearest-Neighbor –Forecasting • No model to train • Simple linear search • Compare reference to candidates • Select k candidates with lowest error • Forecast is average of k next values Eric Plummer

Test-Bed Application –FORECASTER • Written in Visual C++ with MFC • Object-oriented • Multithreaded • Wizard-based • Easily modified • Implements feed-forward neural networks & k-nearest-neighbor • Used for time series forecasting • Eventually will be upgraded for classification problems Eric Plummer

Empirical Evaluation – Data Series Less Noisy Original More Noisy Ascending Sunspots

Empirical Evaluation –Neural Network Architectures • Number of network inputs based on data series • Need to make unambiguous examples • For “sawtooths”: • 24 inputs are necessary • Test networks with 25 & 35 inputs • Test networks with 1 hidden layer with 2, 10, & 20 hidden layer units • One output layer unit • For sunspots: • 30 inputs • 1 hidden layer with 30 units • For real-world data series, selection may be trial-and-error! Eric Plummer

Heuristic method: Start with aggressive learning rate Gradually lower learning rate as validation error increases Stop training when learning rate cannot be lowered anymore Simple method: Use conservative learning rate Training stops when: Number of training epochs equals the epochs limit -or- Training error is less than or equal to error limit Empirical Evaluation –Neural Network Training Eric Plummer

Empirical Evaluation –Neural Network Forecasting • Metric to compare forecasts: Coefficient of Determination • Value may be (-, 1] • Want value between 0 and 1, where 0 is forecasting the mean of the data series and 1 is forecasting the actual value • Must have actual values to compare with forecasted values • For networks trained on original, less noisy, and more noisy data series, forecast will be compared to original series • For networks trained on ascending data series, forecast will be compared to continuation of ascending series • For networks trained on sunspots data series, forecast will be compared to test set Eric Plummer

Empirical Evaluation –K-Nearest-Neighbor • Choosing window size analogous to choosing number of neural network inputs • For sawtooth data series: • k = 2 • Test window sizes of 20, 24, and 30 • For sunspots data series: • k = 3 • Window size of 10 • Compare forecasts via coefficient of determination Eric Plummer

Empirical Evaluation –Candidate Selection • Neural networks • For each training method, data series, and architecture, 3 candidates were trained • Also, average of 3 candidates’ forecasts was taken: forecasting by committee • Best forecast was selected based on coefficient of determination • K-nearest-neighbor • For each data series, k, and window size, only one search was performed (only one needed) Eric Plummer

Empirical Evaluation – Original Data Series Heuristic NN Simple NN Smaller NN K-N-N

Empirical Evaluation – Less Noisy Data Series Heuristic NN Simple NN K-N-N

Empirical Evaluation – More Noisy Data Series Heuristic NN Simple NN K-N-N

Empirical Evaluation – Ascending Data Series Heuristic NN Simple NN

Empirical Evaluation – Longer Forecast Heuristic NN

Empirical Evaluation – Sunspots Data Series Simple NN & K-N-N

Empirical Evaluation –Discussion • Heuristic training method observations: • Networks train longer (more epochs) on smoother data series like the original and ascending data series • The total squared error and unscaled error are higher for noisy data series • Neither the number of epochs nor the errors appear to correlate well with the coefficient of determination • In most cases, the committee forecast is worse than the best candidate's forecast • When actual values are unavailable, choosing the best candidate is difficult! Eric Plummer

Empirical Evaluation –Discussion • Simple training method observations: • The total squared error and unscaled error are higher for noisy data series with the exception of the 35:10:1 network trained on the more noisy data series • The errors do not appear to correlate well with the coefficient of determination • In most cases, the committee forecast is worse than the best candidate's forecast • There are four networks whose coefficient of determination is negative, compared with two for the heuristic training method Eric Plummer

Empirical Evaluation –Discussion • General observations: • One training method did not appear to be clearly better • Increasingly noisy data series increasingly degraded the forecasting performance • Nonstationarity in the mean degraded the performance • Too few hidden units (e.g., 35:2:1) forecasted well on simpler data series, but failed for more complex ones • Excessive numbers of hidden units (e.g, 35:20:1) did not hurt performance • Twenty-five network inputs was not sufficient • K-nearest-neighbor was consistently better than the neural networks • Feed-forward neural networks are extremely sensitive to architecture and parameter choices, and making such choices is currently more art than science, more trial-and-error than absolute, more practice than theory! Eric Plummer

Data Preprocessing • First-difference • For ascending data series, a neural network trained on first-difference can forecast near perfectly • In that case, it is better to train and forecast on first-difference • FORECASTER reconstitutes forecast from its first-difference • Moving average • For noisy data series, moving average would eliminate much of the noise • But would also smooth out peaks and valleys • Series may then be easier to learn and forecast • But in some series, the “noise” may be important data (e.g., utility load forecasting) Eric Plummer

Contributions • Filled a void within feed-forward neural network time series forecasting literature: know how networks respond to various data series characteristics in a controlled environment • Showed that k-nearest-neighbor is a better forecasting method for the data series used in this research • Reaffirmed that neural networks are very sensitive to architecture, parameter, and learning method changes • Presented some insight into neural network architecture selection: selecting number of network inputs based on data series • Presented a neural network training heuristic that produced good results Eric Plummer

Future Work • Upgrade FORECASTER to work with classification problems • Add more complex network types, including wavelet networks for time series forecasting • Investigate k-nearest-neighbor further • Add other forecasting methods, (e.g., decision trees for classification) Eric Plummer

Conclusion • Presented: • Time series forecasting • Neural networks • K-nearest-neighbor • Empirical evaluation • Learned a lot about the implementation details of the forecasting techniques • Learned a lot about MFC programming Thank You Eric Plummer

Demonstration Various files can be found at: http://w3.uwyo.edu/~eplummer Eric Plummer

Unit Output, Error, and Weight Change Formulas

Forecast Error Formulas

Related Work • Drossu and Obradovic (1996): hybrid stochastic and neural network approach to time series forecasting • Zhang and Thearling (1994): parallel implementations of neural networks and memory-based reasoning • Geva (1998): multiscale fast wavelet transform and an array of feed-forward neural networks • Lawrence, Tsoi, and Giles (1996): encodes the series with a self-organizing map and uses recurrent neural networks • Kingdon (1997): automated intelligent system for financial forecasting and uses neural networks and genetic algorithms Eric Plummer

Time Series Forecasting With Feed-Forward Neural Networks: Guidelines And Limitations