Comparative Study: Data Mining and Machine Learning for Weather Forecasting

Application of Data Mining and Machine Learning for Weather Forecasting: A Comparative Study Nasimul Hasan C121046 Nayan Chandra Nath C121038 Department of CSE International Islamic University Chittagong

Outline • Introduction • Motivation and Goal • Methodology • Experiment Design • Result analysis • Conclusion

Introduction • Weather: • Has great significance over our agriculture. • Deterministically chaotic system • Lack of proper data • Continuous change of climate

Problem • The main challenge is to predict weather with most accuracy • Lots of work done before • Change of seasons

Previous Work • A. Mellit, A. Massi Pavan & M. Benghanem developed a SVM model which can produce up to 99% accurate prediction for different models. • Hall and Tony proposed A neural network model using input from the Eta model and upper air soundings for the probability of precipitation (PoP) and quantitative precipitation forecast (QPF) for the Dallas-Fort Worth, Texas area. Their model forecasts with over 70% of the PoP forecasts being less than 5% or greater than 95%.

Motivation & Goal • Motivation: • SVR and ANN is a powerful machine learning technique for pattern recognition • Introducing of using different kinds of windowing function as data preprocess is a new idea • Combining windowing function and support vector regression can make good model for time series prediction. • Goal: • Propose a good Machine Learning model to predict rainfall and temperature.

Methodology • Support Vector Regression • Support vector machine (SVM), a novel artificial intelligence-based method developed from statistical learning theory • SVM has two major features: classification (SVC) & regression (SVR). • In SVM regression, the input is first mapped onto a m-dimensional feature space using some fixed (nonlinear) mapping, and then a linear model is constructed in this feature space. • A margin of tolerance (epsilon) is set in approximation. • This type of function is often called – epsilon intensive – loss function. • Usage of slack variables to overcome noise in the data and nonseparability

Methodology (cont..) The regression problem of SVM can be expressed as the following optimization problem. Minimize: Subject to:

Methodology (cont..) Artificial Neural Network Neural Network has its starting points in endeavors to discover numerical representations of data processing in biological systems[31]. Without a doubt, it has been utilized extensively to cover an extensive variety of various models, a lot of them have been the subject of misrepresented cases with respect to their biological credibility. From the viewpoint of applications of pattern recognition, however, biological authenticity would force totallysuperfluous limitations.

The ANN Network

Methodology (cont..) • Parameters: • Horizon (h) • Window size • Step size • Training window width • Testing window width • Windowing operator: • Transform the time series data into • a generic data set • Convert the last row of a window • within the time series into a label • or target variable • Fed the cross sectional values as • inputs to the machine learning • technique such as liner regression, • Neural Network, Support vector • machine and so on.

Methodology (cont..) Moving Average:

Experiment Design • Data • Experiment dataset had been collected from Meteorological Department, Bangladesh. • 7 year’s historical data (2008-2014) of Chittagong were collected. • Six attributes, Date, total, avg, max, min, MA were used in experiment.

Experiment Design • Data Preprocessing • Prepared for ML using • Missing value replacement • 80% for training and 20% for testing

Experiment Design Rectifier

Experiment Flowchart Training Test

Experiment Result Result evolution technique: Here, = original value of a point for a given time period t n = the total number of fitted points = the fitted forecast value for the time period t

Correlation between features using Pearson Correlation matrix

= the actual observations time series, is the estimated or forecasted time series, SAE = the sum of the absolute errors (or deviations), N = the number of non-missing data points.

SVM produced best result with almost 98.65% accuracy for rainfall and 95% for temperature prediction ANN produced best result with almost 97.45% accuracy for rainfall and 96.7% for temperature prediction

Results for different models using SVR

ANN Monthly Rainfall Horizon 1

ANN Monthly Temperature Horizon 1

SVM Monthly Rainfall Horizon 1

SVM Monthly Temperature Horizon 1

Conclusion • Discussions : • Different windowing function can produce different prediction results. • Limitations & Future works: • Used only Moving Average and windowing operators. • Only one station data set were used to undertake the experiments. • Did not compare with other machine learning techniques. • In future, we will apply our model to other rainfall data set and will also • compare our research result with other types of data mining techniques.

Thank You

Comparative Study: Data Mining and Machine Learning for Weather Forecasting