660 likes | 836 Views
IRMAC: Data Warehouse SIG. November 5, 2002. Data Mining. A Practical Look at Data Preparation. Jason Brown Cognicase Inc. Agenda. Crash Course in Data Mining What Why How The virtuous cycle Data Preparation Case Study Background Going through the cycle Data Preparation Q&A.
E N D
IRMAC: Data Warehouse SIG November 5, 2002 Data Mining A Practical Look at Data Preparation Jason Brown Cognicase Inc.
Agenda • Crash Course in Data Mining • What • Why • How • The virtuous cycle • Data Preparation • Case Study • Background • Going through the cycle • Data Preparation • Q&A
The Crash Course • What • Why • How
Definitions Data Mining: The process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. Knowledge Discovery Data Mining is not Data Warehousing, OLAP etc.
Definitions Modeling: • Not an ER type Data Model • A data mining model is computational, full of algorithms • A model can be descriptive or predictive. • A descriptive model helps in understanding underlying processes or behavior. • A predictive model uses known values (input) to predict an unknown value (output)
Input Output Two Types of Data Mining • Directed • Know specifically what we are looking for • Who is likely to respond to our offer? • What our customers going to be worth to us over their lifetime? • Model is a Black Box
Input Output Two Types of Data Mining • Undirected • Not exactly sure what we are looking for • How should we define our Customer Segments? • What is interesting about all of our point of sale data? • Model is a Transparent Box
Decision Trees Rule Induction If …….. Then …….. Neural Networks Clustering Nearest Neighbour Modeling Techniques
Modeling Techniques Decision Trees • The tree is built based on the input of a training data set • Training Data Set is based on historical data • Over sample the data that reflects your question • Each record of the Model Set is run through the branches of the tree until the record reaches a leaf
Modeling Techniques Decision Trees Age < 23 Income < 12 000 N Y Male ? N Y N Y 28% 37%
Modeling Techniques Neural Networks • Neural networks are a nonlinear model -similar to a ‘brain’. • The network is built based on the input of a training set. • Model sets run through this network will return accurate results based on the patterns identified in the training set. • Very Complex
Modeling Techniques Clustering • Clustering finds groups of records that are similar. • For example, customers can be clustered by: • income • age • ytd revenue
Modeling Techniques Clustering Male Income < 12 000 Age < 23 Coke Buyers Male Income < 12 000 Age < 23 Non Coke Buyers
Modeling Techniques Nearest Neighbour • Model is built based on the input of a training set. • Classifies a record by calculating the distances between the record criteria and the training data set • Then it assigns the record to the class that is most common among its nearest neighbours
Modeling Techniques Nearest Neighbour Records plotted based on: IncomeGenderAge Did Not Did Not Bought Coke Bought Coke Bought Coke
Modeling Techniques Rule Induction • A technique that infers generalizations from the information in the data IF age < 19 AND purchase is coke THEN 40% purchase chips • Describes the data, allows us to visualize what is going on
The Crash Course • What • Why • How
Expenses Revenues The Reasons to Mine Data Increase Profit
The Reasons to Mine Data • For Marketing/CRM • Targeting prospects • Predicting future customer behaviour • Costs Revenues • For Research • Identify drugs likely to be successful • Costs • For Process Improvement • Identify causes of production failures • Costs
The Crash Course • What • Why • How
Process • Many different processes for Data Mining • Vendor Driven • SAS - SEMMA • Sample, Explore, Modify, Model, Assess • SPSS - 5 A’s • Assess, Access, Analyze, Act, Automate • Consulting Companies • The Virtuous Cycle • Michael Berry and Gordon Linoff
Transform Data Business Problem Act Measure Process The Virtuous Cycle
Business Problem • Define the business problem • Understand the business and the rules • Determine if Data Mining fits the need • Understand the value to the business of solving the problem
Data for Data Mining • Type of Data Values • Categorical • Defined set of values • Ontario, Quebec, PEI … • Ranks • High, Medium, Low • 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000 • Intervals • Date • Time • Temperature • True Numeric • Values that support numeric operations
Conduct Modeling 7 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1 Transform Data Steps
Transform Data Step 1 - Identify Data • What data is required to meet the modeling need? • What data is available? Identify Data 1
Transform Data Step2 - Obtain Data • OLTP • Data Warehouse • Data Marts and OLAP • Self Reported • External Obtain Data 2 Identify Data 1
Transform Data Step 3: Validate & Clean • Solutions: • Change Source • Filter Out • Ignore • Integrate • Predict • Derive a New Variable • Data Issues: • Missing • Fuzzy • Incorrect • Outliers Validate & Clean 3 Obtain Data 2 Identify Data 1
Transform Data Step 4: Transpose to right granularity • Data sets for Data Mining need one view, one record • Grain must be consistent throughout • Aggregates can be problematic • Atomic data is often required to build data set • Training data sets cast from point in time of event looking back 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1
Transform Data Step 5: Add Derived Variables • Combined Columns • Summarizations • Features from Columns • Time Series 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1
Transform Data Step 6: Prepare Model Set • The Actual Input to the modeling 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1
Transform Data Step 7: Conduct Modeling • Get our result • Decision Trees • Neural Networks • Clustering • Nearest Neighbour • Rule Induction 7 Conduct Modeling 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1
Act The Business has to actually do something with the results or what was the point? Marketing or Retention Campaigns Business Changes
Measure • Answer 2 Questions • Was the Data Mining effort accurate? • Were the Business Actions successful? • Use different sets of data to compare real results • Actioned Customers vs. Non Actioned • Accuracy Types • Absolute • Our prediction was 80% of Group D would buy Coke and 78% really did • Relative • Our prediction was 80% of Group D would buy Coke but 57 % really did, however Group C which we predicted had a 60% propensity to buy Coke actually bought Coke 42% of the time
And back around Transform Data Business Problem Act Measure
The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure
The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure
The Business • One of the top 3 (4?) cellular phone providers in Canada • Recent Acquisitions • Clearnet • Quebectel • Important Business Concepts • Handset • Subscriber • Client • Activity - Activations, Deactivations • Churn • Usage
Strengths at • Commitment to Data Warehousing • Prior Experience in Data Mining • Tools already Established • Strong business support for the outcomes Data Mining would provide
Challenges at • Data Warehouse still in midst of major re-architecture effort • Ongoing billing system integration projects • A data mart for data mining had existed (Clearnet) but it was a victim of both of the above • Successful at Churn Prediction
The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure
Using the Virtuous Cycle Transform Data Business Problem Act Measure
Business Problems • Churn Modeling • predict which subscriber is likely to leave • Behavioural Segmentation • clustering subscribers into subgroups based on some commonality • revenue, usage, demographic • Client Value Estimation • the present value of all future profits generated throughout the lifetime of that client
Transform Data Steps Conduct Modeling 7 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1
Identify Data • Business Wanted: EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER! • They provided a detailed list, by subject area, of the variables that they believed were required to conduct the kind of Data Mining Activities desired. 1
Identify Data IT Challenges • 19 Subject Areas identified with up to 75 variables each - What is the Priority? • Avoid the big bang - How much can we actually do? • Where to source the data from? • Resources - Who is going to do it? 1
Identify Data Prioritizing • First we asked the business to rate each variable as H, M or L priority • Almost everything was given an H • Then we asked the business to rank the subject areas in order of importance • Hard to convince them of the value • Hard to find consensus • Necessary for determining a release strategy 1
Obtain Data • For each Subject Area and each variable we assessed and documented the following: • Where can it be sourced from (and when)? • What are the known issues? • Q&A back and forth on the variables with business • Identified possible additional variables 2