1 / 65

A Practical Look at Data Preparation

Dive into the world of data mining with this comprehensive guide covering definitions, modeling techniques, decision trees, neural networks, clustering, and nearest neighbor methodologies. Learn about the virtuous cycle, data preparation, and real-world case studies.

gmajor
Download Presentation

A Practical Look at Data Preparation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IRMAC: Data Warehouse SIG November 5, 2002 Data Mining A Practical Look at Data Preparation Jason Brown Cognicase Inc.

  2. Agenda • Crash Course in Data Mining • What • Why • How • The virtuous cycle • Data Preparation • Case Study • Background • Going through the cycle • Data Preparation • Q&A

  3. The Crash Course • What • Why • How

  4. Definitions Data Mining: The process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. Knowledge Discovery Data Mining is not Data Warehousing, OLAP etc.

  5. Definitions Modeling: • Not an ER type Data Model • A data mining model is computational, full of algorithms • A model can be descriptive or predictive. • A descriptive model helps in understanding underlying processes or behavior. • A predictive model uses known values (input) to predict an unknown value (output)

  6. Input Output Two Types of Data Mining • Directed • Know specifically what we are looking for • Who is likely to respond to our offer? • What our customers going to be worth to us over their lifetime? • Model is a Black Box

  7. Input Output Two Types of Data Mining • Undirected • Not exactly sure what we are looking for • How should we define our Customer Segments? • What is interesting about all of our point of sale data? • Model is a Transparent Box

  8. Decision Trees Rule Induction If …….. Then …….. Neural Networks Clustering Nearest Neighbour Modeling Techniques

  9. Modeling Techniques Decision Trees • The tree is built based on the input of a training data set • Training Data Set is based on historical data • Over sample the data that reflects your question • Each record of the Model Set is run through the branches of the tree until the record reaches a leaf

  10. Modeling Techniques Decision Trees Age < 23 Income < 12 000 N Y Male ? N Y N Y 28% 37%

  11. Modeling Techniques Neural Networks • Neural networks are a nonlinear model -similar to a ‘brain’. • The network is built based on the input of a training set. • Model sets run through this network will return accurate results based on the patterns identified in the training set. • Very Complex

  12. Modeling Techniques Clustering • Clustering finds groups of records that are similar. • For example, customers can be clustered by: • income • age • ytd revenue

  13. Modeling Techniques Clustering Male Income < 12 000 Age < 23 Coke Buyers Male Income < 12 000 Age < 23 Non Coke Buyers

  14. Modeling Techniques Nearest Neighbour • Model is built based on the input of a training set. • Classifies a record by calculating the distances between the record criteria and the training data set • Then it assigns the record to the class that is most common among its nearest neighbours

  15. Modeling Techniques Nearest Neighbour Records plotted based on: IncomeGenderAge Did Not Did Not Bought Coke Bought Coke Bought Coke

  16. Modeling Techniques Rule Induction • A technique that infers generalizations from the information in the data IF age < 19 AND purchase is coke THEN 40% purchase chips • Describes the data, allows us to visualize what is going on

  17. The Crash Course • What • Why • How

  18. Expenses Revenues The Reasons to Mine Data Increase Profit

  19. The Reasons to Mine Data • For Marketing/CRM • Targeting prospects • Predicting future customer behaviour • Costs Revenues • For Research • Identify drugs likely to be successful • Costs • For Process Improvement • Identify causes of production failures • Costs

  20. The Crash Course • What • Why • How

  21. Process • Many different processes for Data Mining • Vendor Driven • SAS - SEMMA • Sample, Explore, Modify, Model, Assess • SPSS - 5 A’s • Assess, Access, Analyze, Act, Automate • Consulting Companies • The Virtuous Cycle • Michael Berry and Gordon Linoff

  22. Transform Data Business Problem Act Measure Process The Virtuous Cycle

  23. Business Problem • Define the business problem • Understand the business and the rules • Determine if Data Mining fits the need • Understand the value to the business of solving the problem

  24. Data for Data Mining • Type of Data Values • Categorical • Defined set of values • Ontario, Quebec, PEI … • Ranks • High, Medium, Low • 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000 • Intervals • Date • Time • Temperature • True Numeric • Values that support numeric operations

  25. Conduct Modeling 7 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1 Transform Data Steps

  26. Transform Data Step 1 - Identify Data • What data is required to meet the modeling need? • What data is available? Identify Data 1

  27. Transform Data Step2 - Obtain Data • OLTP • Data Warehouse • Data Marts and OLAP • Self Reported • External Obtain Data 2 Identify Data 1

  28. Transform Data Step 3: Validate & Clean • Solutions: • Change Source • Filter Out • Ignore • Integrate • Predict • Derive a New Variable • Data Issues: • Missing • Fuzzy • Incorrect • Outliers Validate & Clean 3 Obtain Data 2 Identify Data 1

  29. Transform Data Step 4: Transpose to right granularity • Data sets for Data Mining need one view, one record • Grain must be consistent throughout • Aggregates can be problematic • Atomic data is often required to build data set • Training data sets cast from point in time of event looking back 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

  30. Transform Data Step 5: Add Derived Variables • Combined Columns • Summarizations • Features from Columns • Time Series 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

  31. Transform Data Step 6: Prepare Model Set • The Actual Input to the modeling 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

  32. Transform Data Step 7: Conduct Modeling • Get our result • Decision Trees • Neural Networks • Clustering • Nearest Neighbour • Rule Induction 7 Conduct Modeling 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

  33. Act The Business has to actually do something with the results or what was the point? Marketing or Retention Campaigns Business Changes

  34. Measure • Answer 2 Questions • Was the Data Mining effort accurate? • Were the Business Actions successful? • Use different sets of data to compare real results • Actioned Customers vs. Non Actioned • Accuracy Types • Absolute • Our prediction was 80% of Group D would buy Coke and 78% really did • Relative • Our prediction was 80% of Group D would buy Coke but 57 % really did, however Group C which we predicted had a 60% propensity to buy Coke actually bought Coke 42% of the time

  35. And back around Transform Data Business Problem Act Measure

  36. The Case Study

  37. The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure

  38. The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure

  39. The Business • One of the top 3 (4?) cellular phone providers in Canada • Recent Acquisitions • Clearnet • Quebectel • Important Business Concepts • Handset • Subscriber • Client • Activity - Activations, Deactivations • Churn • Usage

  40. DW Environment

  41. Strengths at • Commitment to Data Warehousing • Prior Experience in Data Mining • Tools already Established • Strong business support for the outcomes Data Mining would provide

  42. Challenges at • Data Warehouse still in midst of major re-architecture effort • Ongoing billing system integration projects • A data mart for data mining had existed (Clearnet) but it was a victim of both of the above • Successful at Churn Prediction

  43. The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure

  44. Using the Virtuous Cycle Transform Data Business Problem Act Measure

  45. Business Problems • Churn Modeling • predict which subscriber is likely to leave • Behavioural Segmentation • clustering subscribers into subgroups based on some commonality • revenue, usage, demographic • Client Value Estimation • the present value of all future profits generated throughout the lifetime of that client

  46. Transform Data Steps Conduct Modeling 7 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

  47. Identify Data • Business Wanted: EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER! • They provided a detailed list, by subject area, of the variables that they believed were required to conduct the kind of Data Mining Activities desired. 1

  48. Identify Data IT Challenges • 19 Subject Areas identified with up to 75 variables each - What is the Priority? • Avoid the big bang - How much can we actually do? • Where to source the data from? • Resources - Who is going to do it? 1

  49. Identify Data Prioritizing • First we asked the business to rate each variable as H, M or L priority • Almost everything was given an H • Then we asked the business to rank the subject areas in order of importance • Hard to convince them of the value • Hard to find consensus • Necessary for determining a release strategy 1

  50. Obtain Data • For each Subject Area and each variable we assessed and documented the following: • Where can it be sourced from (and when)? • What are the known issues? • Q&A back and forth on the variables with business • Identified possible additional variables 2

More Related