Chapter 2 Overview of the Data Mining Process

Chapter 2Overview of the Data Mining Process

Introduction • Data Mining • Predictive analysis • Tasks of Classification & Prediction • Core of Business Intelligence • Data Base Methods • OLAP • SQL • Do not involve statistical modeling

Core Ideas in Data Mining • Analytical Methods Used in Predictive Analytics • Classification • Used with categorical response variables • E.g. Will purchase be made / not made? • Prediction • Predict (estimate) value of continuous response variable • Prediction used with categorical as well • Association Rules • Affinity analysis – “what goes with what” • Seeks correlations among data

Core Ideas in Data Mining • Data Reduction • Reduce variables • Group together similar variables • Data Exploration • View data as evidence • Get “a feel” for the data • Data Visualization • Graphical representation of data • Locate tends, correlations, etc.

Supervised Learning • “Supervised learning" algorithms are those used in classification and prediction. • Data is available in which the value of the outcome of interest is known. • “Training data" are the data from which the classification or prediction algorithm “learns," or is “trained," about the relationship between predictor variables and the outcome variable. • This process results in a “model” • Classification Model • Predictive Model

Supervised Learning • Model is then run with another sample of data • “validation data" • the outcome is known but we wish to see how well the model performs • If many different models are being tried out, a third sample of known outcomes -“test data” is used with the final, selected model to predict how well it will do. • The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown.

Supervised Learning • Linear regression analysis is an example of supervised Learning • The Y variable is the (known) outcome variable • The X variable is some predictor variable. • A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. • The regression line can now be used to predict Y values for new values of X for which we do not know the Y value.

Unsupervised Learning • No outcome variable to predict or classify • No “learning” from cases • Unsupervised leaning methods • Association Rules • Data Reduction Methods • Clustering Techniques

The Steps in Data Mining • 1. Develop an understanding of the purpose of the data mining project • It is a one-shot effort to answer a question or questions or • Application (if it is an ongoing procedure). • 2. Obtain the dataset to be used in the analysis. • Random sampling from a large database to capture records to be used in an analysis • Pulling together data from different databases. • Internal (e.g. Past purchases made by customers) • External (credit ratings). • Usually the analysis to be done requires only thousands or tens of thousands of records.

The Steps in Data Mining • 3. Explore, clean, and preprocess the data • Verifying that the data are in reasonable condition. • How missing data should be handled? • Are the values in a reasonable range, given what you would expect for each variable? • Are there obvious “outliers?" • Data are reviewed graphically – • For example, a matrix of scatter plots showing the relationship of each variable with each other variable. • Ensure consistency in the definitions of fields, units of measurement, time periods, etc.

The Steps in Data Mining • 4. Reduce the data • If supervised training is involved separate them into training, validation and test datasets. • Eliminating unneeded variables, • Transforming variables • Turning “money spent" into “spent > $100" vs. “Spent · $100"), • Creating new variables • A variable that records whether at least one of several products was purchased • Make sure you know what each variable means, and whether it is sensible to include it in the model. • 5. Determine the data mining task • Classification, prediction, clustering, etc. • 6. Choose the data mining techniques to be used • Regression, neural nets, hierarchical clustering, etc.

The Steps in Data Mining • 7. Use algorithms to perform the task. • Iterative process - trying multiple variants, and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm). • When appropriate, feedback from the algorithm's performance on validation data is used to refine the settings. • 8. Interpret the results of the algorithms. • Choose the best algorithm to deploy, • Use final choice on the test data to get an idea how well it will perform. • 9. Deploy the model. • Integrate the model into operational systems • Run it on real records to produce decisions or actions. • For example, the model might be applied to a purchased list of possible customers, and the action might be “include in the mailing if the predicted amount of purchase is > $10."

Preliminary Steps • Organization of datasets • Records in rows • Variables in columns • In supervised learning one of these will be the outcome variable • Labels the first or last column • Sampling from a database • Use a samples to create, validate, & test model • Oversampling rare events • If response variable value is seldom found in data then sample size increase • Adjust algorithm as necessary

Preliminary Steps(Pre-processing and Cleaning the Data) • Types of variables • Continuous – assumes a any real numerical value (generally within a specified range) • Categorical – assumes one of a limited number of values • Text (e.g. Payments e {current, not current, bankrupt} • Numerical (e.g. Age e {0 … 120} ) • Nominal (payments) • Ordinal (age)

Preliminary Steps(Pre-processing and Cleaning the Data) • Handling categorical variables • If categorical is ordered then it can be used as continuous variable (e..G. Age, level of credit, etc.) • Use of “dummy” variables when range of values not large • e.g. Variable occupation e {student, unemployed, employed, retired} • Create binary (yes/no) dummy variables • Student – yes/no • Unemployed – yes/no • Employed – yes/no • Retired – yes/no • Variable selection • The more predictor variables the more records need to build the model • Reduce number of variables whenever appropriate

Preliminary Steps(Pre-processing and Cleaning the Data) • Overfitting • Building a model - describe relationships among variables in order to predict future outcome (dependent) values on the basis of future predictor (independent) values. • Avoid “explaining“ variation in the data that was nothing more than chance variation. Avoid mislabeling “noise” in the data as if it were a “signal” • Caution - if the dataset is not much larger than the number of predictor variables, then it is very likely that a spurious relationship like this will creep into the model

Overfitting

Preliminary Steps (Pre-processing and Cleaning the Data) • How many variables & how much data • A good rule of thumb is to have ten records for every predictor variable. • For classification procedures • At least 6xmxp records, • Where m = number of outcome classes, and p = number of variables • Compactness or parsimony is a desirable feature in a model. • A matrix of x-y plots can be useful in variable selection. • Can see at a glance x-y plots for all variable combinations. • A straight line would be an indication that one variable is exactly correlated with another. • We would want to include only one of them in our model. • Weed out irrelevant and redundant variables from our model • Consult domain expert whenever possible

Preliminary Steps(Pre-processing and Cleaning the Data) • Outliers • Values that lie far away from the bulk of the data are called outliers • no statistical rule can tell us whether such an outlier is the result of an error • these are judgments best made by someone with “domain" knowledge • if the number of records with outliers is very small, they might be treated as missing data.

Preliminary Steps(Pre-processing and Cleaning the Data) • Missing values • If the number of records with missing values is small, those records might be omitted • The more variables, the more records to dropped • Solution - use average value computed from records with valid data for variable with missing data • Reduces variability in data set • Human judgment can be used to determine best way to handle missing data

Preliminary Steps(Pre-processing and Cleaning the Data) • Normalizing (standardizing) the data • To normalize the data, we subtract the mean from each value, and divide by the standard deviation of the resulting deviations from the mean • Expressing each value as “number of standard deviations away from the mean“ – the z-score • Needed if variables are in different units e.G. Hours, thousands of dollars, etc. • Clustering algorithms measure variables values in distance from each other – need a standard value for distance. • Data mining software, including XLMiner, typically has an option that normalizes the data in those algorithms where it may be required

Preliminary Steps • Use and creation of partition • Training partition • The largest partition • Contains the data used to build the various models • Same training partition is generally used to develop multiple models. • Validation partition • Used to assess the performance of each model, • Used to compare models and pick the best one. • In classification and regression trees algorithms the validation partition may be used automatically to tune and improve the model. • Test partition • Sometimes called the “holdout" or “evaluation" partition is used to assess the performance of a chosen model with new data.

The Three Data Partitions and Their Role in the Data Mining Process

Simple Regression Example

Simple Regression Model • Make prediction about the starting salary of a current college graduate • Data set of starting salaries of recent college graduates Data Set Compute Average Salary How certain are of this prediction? There is variability in the data.

Simple Regression Model • The smaller the amount of total variation the more accurate (certain) will be our prediction. • Use total variation as an index of uncertainty about our prediction Compute Total Variation

Simple Regression Model • How “explain” the variability - Perhaps it depends on the student’s GPA Salary GPA

Simple Regression Model • Find a linear relationship between GPA and starting salary • As GPA increases/decreases starting salary increases/decreases

Simple Regression Model • Least Squares Method to find regression model • Choose a and b in regression model (equation) so that it minimizes the sum of the squared deviations – actual Y value minus predicted Y value (Y-hat)

Simple Regression Model a= 4,779 & b = 5,370 A computer program computed these values • How good is the model? • u-hat is a “residual” value • The sum of all u-hats is zero • The sum of all u-hats squared is the total variance not explained by the model • “unexplained variance” is 7,425,926

Simple Regression Model Total Variation = 23,000,000

Simple Regression Model Total Unexplained Variation = 7,425,726

Simple Regression Model • Relative Goodness of Fit • Summarize the improvement in prediction using regression model • Computer R2 – coefficient of determination Regression Model (equation) a better predictor than guessing the average salary The GPA is a more accurate predictor of starting salary than guessing the average R2 is the “performance measure“ for the model. Predicted Starting Salary = 4,779 + 5,370 * GPA

Building a Model - An Example with Linear Regression

Problems • Problem 2.11 Page 33

Chapter 2 Overview of the Data Mining Process