180 likes | 215 Views
HY436: Mobile Computing and Wireless Networks Data sanitization. Tutorial: November 7, 2005 Elias Raftopoulos Ploumidis Manolis Prof. Maria Papadopouli Assistant Professor Department of Computer Science
E N D
HY436: Mobile Computing and Wireless NetworksData sanitization Tutorial: November 7, 2005 Elias Raftopoulos Ploumidis Manolis Prof. Maria Papadopouli Assistant Professor Department of Computer Science University of North Carolina at Chapel Hill
Data Analysis • Discovery of Missing Values • Data treatment • Outliers Detection • Outliers Removal [Optional] • Data Normalization [Optional] • Statistical Analysis
Why Data Preprocessing? • Data in the real world is dirty • incomplete • noisy • inconsistent • No quality data, no quality statistical processing • Quality decisions must be based on quality data
Data Cleaning Tasks • Handle missing values, due to • Sensor malfunction • Random disturbances • Network Protocol [eg UDP] • Identify outliers, smooth out noisy data
Recover Missing Values Moving Average • A simple moving average is the unweighted mean of the previous n data points in the time series • A weighted moving average is a weighted mean of the previous n data points in the time series • A weighted moving average is more responsive to recent movements than a simple moving average • An exponentially weighted moving average (EWMA or just EMA) is an exponentially weighted mean of previous data points • The parameter of an EWMA can be expressed as a proportional percentage - for example, in a 10% EWMA, each time period is assigned a weight that is 90% of the weight assigned to the next (more recent) time period
Recover Missing Values Moving Average (cont’d) • Symmetric Linear Filters • Moving Average
What are outliers in the data? • An outlier is an observation that lies an abnormal distance from other values in a random sample from a population • It is left to the analyst (or a consensus process) to decide what will be considered abnormal • Before abnormal observations can be singled out, it is necessary to characterize normal observations
Outliers • An outlier is a data point that comes from a distribution different (in location, scale, or distributional form) from the bulk of the data • In the real world, outliers have a range of causes, from as simple as • operator blunders • equipment failures • day-to-day effects • batch-to-batch differences • anomalous input conditions • warm-up effects
Scatter Plot: Outlier • Scatter plot here reveals • A basic linear relationship between X and Y for most of the data • A single outlier (at X = 375)
Symmetric Histogram with Outlier • A symmetric distribution is one in which the 2 "halves" of the histogram appear as mirror-images of one another. • The above example is symmetric with the exception of outlying data near Y = 4.5
Normalization • Normalization is a process of scaling the numbers in a data set to improve the accuracy of the subsequent numeric computations • Most statistical tests and intervals are based on the assumption of normality • This leads to tests that are simple, mathematically tractable, and powerful compared to tests that do not make the normality assumption • Most real data sets are in fact not approximately normal • An appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution • This increases the applicability and usefulness of statistical techniques based on the normality assumption.
Box-Cox Transformation • The Box-Cox transformation is a particulary useful family of transformations
Measuring Normality • Given a particular transformation such as the Box-Cox transformation defined above, it is helpful to define a measure of the normality of the resulting transformation • One measure is to compute the correlation coefficient of a normal probability plot • The correlation is computed between the vertical and horizontal axis variables of the probability plot and is a convenient measure of the linearity of the probability plot (the more linear the probability plot, the better a normal distribution fits the data). • The Box-Cox normality plot is a plot of these correlation coefficients for various values of the parameter. The value of λ corresponding to the maximum correlation on the plot is then the optimal choice for λ
Measuring Normality (cont’d) • The histogram in the upper left-hand corner shows a data set that has significant right skewness • And so does not follow a normal distribution • The Box-Cox normality plot shows that the maximum value of the correlation coefficient is at = -0.3 • The histogram of the data after applying the Box-Cox transformation with = -0.3 shows a data set for which the normality assumption is reasonable • This is verified with a normal probability plot of the transformed data.
Normal Probability Plot • The normal probability plot is a graphical technique for assessing whether or not a data set is approximately normally distributed • The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality • The normal probability plot is a special case of the probability plot
CDF Plot • Plot of empirical cumulative distribution function