1 / 18

HY436: Mobile Computing and Wireless Networks Data sanitization

HY436: Mobile Computing and Wireless Networks Data sanitization. Tutorial: November 7, 2005 Elias Raftopoulos Ploumidis Manolis Prof. Maria Papadopouli Assistant Professor Department of Computer Science

blila
Download Presentation

HY436: Mobile Computing and Wireless Networks Data sanitization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HY436: Mobile Computing and Wireless NetworksData sanitization Tutorial: November 7, 2005 Elias Raftopoulos Ploumidis Manolis Prof. Maria Papadopouli Assistant Professor Department of Computer Science University of North Carolina at Chapel Hill

  2. Data Analysis • Discovery of Missing Values • Data treatment • Outliers Detection • Outliers Removal [Optional] • Data Normalization [Optional] • Statistical Analysis

  3. Why Data Preprocessing? • Data in the real world is dirty • incomplete • noisy • inconsistent • No quality data, no quality statistical processing • Quality decisions must be based on quality data

  4. Data Cleaning Tasks • Handle missing values, due to • Sensor malfunction • Random disturbances • Network Protocol [eg UDP] • Identify outliers, smooth out noisy data

  5. Recover Missing ValuesLinear Interpolation

  6. Recover Missing Values Moving Average • A simple moving average is the unweighted mean of the previous n data points in the time series • A weighted moving average is a weighted mean of the previous n data points in the time series • A weighted moving average is more responsive to recent movements than a simple moving average • An exponentially weighted moving average (EWMA or just EMA) is an exponentially weighted mean of previous data points • The parameter of an EWMA can be expressed as a proportional percentage - for example, in a 10% EWMA, each time period is assigned a weight that is 90% of the weight assigned to the next (more recent) time period

  7. Recover Missing Values Moving Average (cont’d) • Symmetric Linear Filters • Moving Average

  8. What are outliers in the data? • An outlier is an observation that lies an abnormal distance from other values in a random sample from a population • It is left to the analyst (or a consensus process) to decide what will be considered abnormal • Before abnormal observations can be singled out, it is necessary to characterize normal observations

  9. Outliers • An outlier is a data point that comes from a distribution different (in location, scale, or distributional form) from the bulk of the data • In the real world, outliers have a range of causes, from as simple as • operator blunders • equipment failures • day-to-day effects • batch-to-batch differences • anomalous input conditions • warm-up effects

  10. Scatter Plot: Outlier • Scatter plot here reveals • A basic linear relationship between X and Y for most of the data • A single outlier (at X = 375)

  11. Symmetric Histogram with Outlier • A symmetric distribution is one in which the 2 "halves" of the histogram appear as mirror-images of one another. • The above example is symmetric with the exception of outlying data near Y = 4.5

  12. Normalization • Normalization is a process of scaling the numbers in a data set to improve the accuracy of the subsequent numeric computations • Most statistical tests and intervals are based on the assumption of normality • This leads to tests that are simple, mathematically tractable, and powerful compared to tests that do not make the normality assumption • Most real data sets are in fact not approximately normal • An appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution • This increases the applicability and usefulness of statistical techniques based on the normality assumption.

  13. Box-Cox Transformation • The Box-Cox transformation is a particulary useful family of transformations

  14. Measuring Normality • Given a particular transformation such as the Box-Cox transformation defined above, it is helpful to define a measure of the normality of the resulting transformation • One measure is to compute the correlation coefficient of a normal probability plot • The correlation is computed between the vertical and horizontal axis variables of the probability plot and is a convenient measure of the linearity of the probability plot (the more linear the probability plot, the better a normal distribution fits the data). • The Box-Cox normality plot is a plot of these correlation coefficients for various values of the parameter. The value of λ corresponding to the maximum correlation on the plot is then the optimal choice for λ

  15. Measuring Normality (cont’d) • The histogram in the upper left-hand corner shows a data set that has significant right skewness • And so does not follow a normal distribution • The Box-Cox normality plot shows that the maximum value of the correlation coefficient is at = -0.3 • The histogram of the data after applying the Box-Cox transformation with = -0.3 shows a data set for which the normality assumption is reasonable • This is verified with a normal probability plot of the transformed data.

  16. Normal Probability Plot • The normal probability plot is a graphical technique for assessing whether or not a data set is approximately normally distributed • The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality • The normal probability plot is a special case of the probability plot

  17. Normal Probability Plot (cont’d)

  18. CDF Plot • Plot of empirical cumulative distribution function

More Related