300 likes | 440 Views
Data Mining. By Elizabeth Leon. KDD Issues. Collect data Outliers Overfitting Large Datasets High Dimensionality Data Quality. Collect Data. Data generation under the control of an expert (modeler) -> Designed experiment Random data generation -> Observational approach
E N D
Data Mining By Elizabeth Leon
KDD Issues • Collect data • Outliers • Overfitting • Large Datasets • High Dimensionality • Data Quality
Collect Data • Data generation under the control of an expert (modeler) -> Designed experiment • Random data generation -> Observational approach • Sampling distribution • Data used for estimating the model and data used for testing come from the same sampling distribution
Overfitting • Occurs when the model does not fit future states. • Assumptios about data • Small size of the training dataset
Outliers • Unusual data values that are not consistent with most observations: • Measurement error • Coding and recording errors • Abnormal values • These values can seriously affect the model produced later • Two strategies: • Remove outliers in the preprocessing • Develop a model that is robust to outliers
Large Datasets • Algorithms designed for small data set. • Data set grow exponentially • Inefficient for large datasets-> Scalability problem • Sampling, parallelization
High Dimensionality Features (attributes) • Many attributes (features) • Not all attributes are needed to solve a data mining problem • Increase the overall complexity and decrease the efficiency of an algorithm • Reduce number of attributes (dimensionality reduction) Samples
High Dimensionality • Problem knows as “curse of dimensionality” Produced because the geometric of high dimensional spaces: • Counterintuitive (our experience is in two or three dimensions) • Conceptually, objects in high-dimentional spaces have a larger surface area for a given volume than object in low-dimentional spaces
Data Quality • Accurate (spelled corractly, value is complete, etc) • Data type • Integrity • Consistent • Not redundant • Complete
Row data • Types of data • Numeric (real, integer): age, speed Properties: order relation ( 5<7 ) distance relation ( d(2.3, 4.2) = 1.9 ) • Categorical (characters): sex, color, country Properties: equality relation (blue = blue, blue ≠ red)
Row data (cont) Categorical variables can be converted to numeric variable: Ex. 2 values, binary variable (0 or 1). Ex. Variable color (4 values): black, blue, green and brown, can be coded with four binary digits: Black 1000 Blue 0100 Green 0010 Brown 0001
Row data (cont) • Values of variable • Continuous variable: (quantitative or metric values). Measured using: • interval scale (temperature) or • ratio scale (height) • Discrete variable: (qualitative values). Measured using a nonmetric scale: • Nominal scale: use symbols to represent (customer type identifier: A, B, C,…(doesn’t have metric characteristic)) • Ordinal scale: categorical variable with order relation (ranks: gold, silver, and bronze)
Row data (cont) • Behavior with respect to time: • Static data: not change with time • Dynamicor temporal data: change with time Majority of data mining methods are suitable for static data, special considerations and preprocessing are required to mine dynamic data
Transformation of Row Data Small changes on the features can produce significant improvement in data mining performance • Normalizations If distance computation is needed (best results). The measured values can be scaled to specific range: [-1,1] or [1,0]
Transformation of Row Data • Decimalscaling: moves the decimal point but still preserves most of the original digit value. v’(i) = v(i)/10K The maximum |v’(i)| is found, and the decimal value is moved until the value is less than 1. Then, the divisor is applied to all over v(i). e.g. 455 and -834 k=??
Transformation of Row Data • Min-Max normalization: Obtain better distribution of values on a whole normalized interval, e.g. [0,1]. Min-max formula: v’(i) = ( v(i) – min(v(i)) ) / ( max(v(i)) – min(v(i))) Exercise: range between 150 and 250. For v(i) =80 Using decimal scaling?? Using Min-max??
Transformation of Row Data • Standar deviation normalization: works well with distance measures, but transform the data into a form unrecognized from the original data. For feature v, • mean value is mean(v) and • standard deviation is sd(v) calculated for the entired data set For v(i), v’(i) = (v(i) – mean(v)) /sd(v))
Transformation of Row Data • E.g. v = { 1,2,3 } then, • mean(v) = 2 sd(v) = 1 • v* = { -1, 0, 1} //normalized values
Transformation of Row Data • Data smoothing Minor differences between the values are not significant, and may degrade the perform of the method an final results. • Average similar measured values • Rounding V = {0.93, 1.01, 1.001, 3.02, 2.99, 5,03, 4.98} v’ = {1, 1, 1, 3, 3, 5, 5}
Transformation of Row Data • Differences and Ratios • Increase or decrease values in the feature may improve the performance • Using s(t+1)/s(t) as output instead of s(t+1) • Changes in time for one feature or composition of different input features: e.g. medical data: height, weight -> body mask index (BMI), weighted ratio between weight and height.
Missing data • Many data mining techniques required complete data • Simplest solution: reduction of the data set, elimination of all samples with missing values (possible if large data, and missing samples are small percentage of total of samples) • Find values for missing data: • Manually examine, and enter a reasonable, probable or expected value (small number of missing values, small data). Can be introduced noise. • Automatic replacement with some constant:
Missing data • Single global constant • Feature mean • Feature mean for a given class (possible if the samples are classified in advance) • Artificial samples. For each new sample, the missed value is replaced with one of the possible feature values. (combinatorial explosion of artificial samples). E.g. • Generate a predictive model to predict missing values: regression, bayes, decision trees. • Generate multiple solutions of data mining (with and without features that have missing values) and the analyze and interpret them.
Time dependent data • Feature measured over time (series of values over fixed time units). E.g temperature every hour, sales of a product every day. • Classical univariate time-series problem • Is expected that the values of X a given time be related to previous values. • The series values can be expresed as X = {t(1), t(2), t(3),…, t(n)} t(n) is the most recently value.
Time dependent data (cont) • Goal forecast t(n+1) • Preprocessing of row: Specification of a window or a time lag (number of the previous values that influence the prediction). • Every window represents one sample of data
Time dependent data (cont) • Example: time series of eleven measures X = {t(0), t(1), t(2), t(3), t(4), t(5), t(6), t(7), t(8), t(9), t(10)} Window is 5 How many samples?? Which? (tabular representation) • Goal predict values in the future (several time units in advance) t(n+j). • Example if j=3 which are the samples??
Time dependent data (cont) • Besides tabular representation, transformations and summarizations are needed. • Predict the difference t(n+1) - t(n) • Predict the percentage of changes t(n+1)/t(n) • Exercise pag 32
Time dependent data (cont) • Summarization of features: • average them producing “moving averages” (MA): MA(i,m) = 1/m . Σij=i-m+1 t(j) • Summarizes de most recent m features values for each case and for each increment in time I • Smoot neighboring time points to reduce the random variations and noise components MA(i,m) = t(i) = mean(i) + error • Exponential moving Average (EMA): more weight to recent time periods:
Time dependent data (cont) EMA(i,m) = p*t(i) +(1-p)* EMA(i-1, m-1) EMA(I,1) = t(i) Comparative features: • Difference between current value an MA • Difference between two moving averages • Ratio between the current value and MA