150 likes | 167 Views
The Detection of Outliers . Mr. Faiz Alsuhail, Statistics Finland faiz.alsuhail@stat.fi . Outline of the presentation. Faiz Alsuhail. The setup of the problem Description of the method Some results with Finnish turnover data Discussion. 1.1.2020. 2. Setup. Faiz Alsuhail.
E N D
The Detection of Outliers Mr. Faiz Alsuhail, Statistics Finland faiz.alsuhail@stat.fi
Outline of the presentation Faiz Alsuhail The setup of the problem Description of the method Some results with Finnish turnover data Discussion 1.1.2020 2
Setup Faiz Alsuhail A large number of time series are collected to produce turnover indecies. At the moment data validation requires a large amount of resources, -both labour and time. Good data quality is vital to produce good indecies, therefore data validation is an important step in index production. The goal is to make the validation less time consuming, yet accurate. 1.1.2020 3
About the method Faiz Alsuhail A long history of observations is available for each company that reports its turnover to Statistics Finland. Hence, we can model each time series in order to forecast the future values. If a company reports a figure which differs significantly from the forecast, we may believe that the observation is an outlier. 1.1.2020 4
The first step Faiz Alsuhail • Create a time series model for each time series and forecats the next value. • Compare the observations and the forecast you have computed by using a t-test. • Now one can rank the observations according to the p-values of their t-tests. • 1 minus the p-value tells how likely it is that the observation is an outlier. 1.1.2020 5
Then what? Faiz Alsuhail • If an observation is suspicious then one may wish to contact the company that has reported the figure. • However there are a huge number of companies. Therefore it would be desireable to know, which observations are (potentially) more harmfull to the index. • Hence the t-test itself is not enough. 1.1.2020 6
The potential harm Faiz Alsuhail • We multiply the forecast error by 1-(p-value)t-test and by the company’s share in the aggregate index. • By multiplying the forecast error by 1-(p-value)t-test we get the potential error of one obervation. • By multiplying this with the companys share in the whole turnout we get the observations potential harm to Finland’s turnover index. • Now one can rank different observations by their potential harm. 1.1.2020 7
In terms of mathematical formulas Faiz Alsuhail • What we want to compute is: (forec. error)*(1 - p-valuet-test)*(share in the aggregate turnout) where: • (forecast error) requires the use of a time series model • (1 - p-valuet-test) is obtained from a t-test • (share in the aggregate turnout) is straightforward to calculate. 1.1.2020 8
Results: Manucacture of food products and beveridges, Potential error (%). Faiz Alsuhail 1.1.2020 9
Results: Manufacture of fabricated metal products, except machinery and equipuipment, Potential error (%). Faiz Alsuhail 1.1.2020 10
Benefits Faiz Alsuhail The method can rank different observations according to their suspiciousness. This tells the statistician which observations to check if time is limited. The method is computationally quite simple and can be integrated to the production system of indecies. The time series approach can take into account the seasonal nature of the time series. 1.1.2020 11
Challenges Faiz Alsuhail There must be special expertise to create time series models and to run their diagnostic tests. The time series models don’t quite function if the future doesn’t behave like the past. The method only tells which observations are potentially harmfull but doesn’t reveal the outliers. The statisticial must still use his/her insight to tell wheter an obervation is suspicious or not. 1.1.2020 12
More challenges Faiz Alsuhail There should be enough time and resources to model the time series. Model update must take place with regular basis, at least once a year. If one wants to go through all the companies then the time series must be modelled automatically, for example with the help of a BIC-criteria. 1.1.2020 13
Questions to discuss Faiz Alsuhail How do you tackel the probelm of outlier detection in your statistical officies? Do you believe that modelling a time-series with a linear model is a good starting point for data validation? If yes, then should one model time series on a two, three or four digit level? Are there more challenges or benefits that were not listed in this presentation? 1.1.2020 14
Faiz Alsuhail Thank you for your attention 1.1.2020 15