180 likes | 670 Views
Introduction to Data Mining with XLMiner. Business Intelligence. Where Data Mining fits in. Getting started with Data Mining. Preparing for Data Mining Types of DM methods Descriptive Predictive Prescriptive. Data Mining Methods in XLMiner. Affinity Analysis Association Rules
E N D
Introduction to Data Mining with XLMiner Business Intelligence
Getting started with Data Mining • Preparing for Data Mining • Types of DM methods • Descriptive • Predictive • Prescriptive
Data Mining Methods in XLMiner • Affinity Analysis • Association Rules • Segmentation • Cluster Analysis • Data Preparation and Exploration • Data Preparation • Data Visualization • Dimension Reduction • Prediction • Linear Regression • K-nearest Neighbors • Regression Trees • Neural Nets • Time Series Forecasting • Smoothing Methods • Regression Based • Classification • Naïve Bayes • K-nearest Neigbors • Classification Trees • Logistic Regression • Neural Nets • Discriminant Analysis
First Part of DM: Data • Data Selection • Operational Data • External sources • Preprocessing • Remove duplicates • Remove common errors or noisy data • Domain Consistency • Coding etc. • ETL and Data Mining Tools can help
Enrichment Adding additional data Sometimes only to a subset of the data Need to organize the data to enable adding the additional values Coding Modify the data to make it work more effectively Without losing the meaning Take out some detail while retaining relative value Categorical data to dummy variables Categorize Address to region Birth date to age Scale appropriately Divide by 1000 for $ values where appropriate Binary Attributes Yes-No to 0-1 Time Series Convert date to month numbers starting from a fixed point (1900) Dimension Reduction Details on Data Preprocessing
Dealing with Data: Excel and XLMiner • Transformation • Flat file (CSV, Delimited, etc) • Various Functions in Excel (Text, Date, etc.) • Missing Data identification • Excel functions • Count and Countif • Countblank • Absurd Data • Date of Birth as 01-01-01 • Phone number as 123-456-7890, etc. • Needs review and sometime domain knowledge • Pivot Tables and conditional formatting • Outliers Identification • XLMiner Boxplots • Excel Descriptive Statistics • Histograms • Normalization • Needed when you have mixed data and the actual scale does not matter (you do not have to have the income reported to the exact dollars, it is as good when reported in units of $1000) • Essentially creation of z-score [ (X-mean)/sigma]
Changing the Categorical Variables • Create Dummy Variables • Depending on the categories, you may have to create a lot of dummy variables • Rule: • Total number of dummy variables should be equal to the nominal categories. • Usually the last dummy is not needed as it is determined by the absence of other categorical values • Example: • Color = Red, Blue and Green. • We will need three dummies, XR, XB, and XG • For Red the values will be 1, 0, 0 • For Blue the values will be 0, 1, 0 • And for Green the values will be 0, 0, 1 • As you can see, we could have done without the dummy for the green as the value of 0, 0 for the first two variables will automatically indicate green (assuming only three colors), so XR and XB would have been enough. • For ordinal categorical variables, use one dummy and give it various values based on the order of the categories. • Example: Grades can be F, D, C, B and A • Possible values of XGrade will be 0, 1, 2, 3, and 4.
Examples of Creating Dummy • Experience values: 1, 2, 3 or 4 • Dummies are: • 000 = 1, 100 = 2, 010 = 3, 001 = 4 • Education Values are: Bachelors, Masters and PhD • Dummies are:
XLMiner Data Preparation • Dummy Variables • XLMiner-> Data Utilities->Transform Categorical Data • Try it with Universal Bank Data with Education and Family • Outliers • Histogram (needs Binning) • Box Plot (under charting Data) • Can also be done in Excel • Rule is to find the outright outliers and then cutting off the values which are asymmetric on either side • Box Plot and Histogram tell you the skewness of the distribution
Data Mining Data Groups • Given a collection of records (training set ) • A model is created for a particular prediction or classification • Goal: previously unseen records should be predicted or classified as accurately as possible. • A validation/test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. • Finally, a previously unseen data group is used for final model performance (called Test set in XLMiner)
XLMiner: Opening and Partitioning Datasets • The demo version of the program only allows 600 rows. We will use the educational version to go beyond that. • Partitioning of the data • Training • Used for building the model • Validation • Used for validating the quality of the model • Test • Used for testing the model, specially for the algorithms that may use both training and validation data recursively to build the model • Oversampling • Employed for spare record sets.
XL Miner Outputs • Module dependent • Typically has a summary report • Has the option of full reports • All Excel worksheets, so can be used for further calculations • Includes Lift Charts and Decile Lift Charts