1 / 13

Chapter 3 Data Mining Methodology and Best Practices

Chapter 3 Data Mining Methodology and Best Practices. Data Mining’s Virtuous Cycle. Identify the business opportunity* Mining data to transform it into actionable information Acting on the information Measuring the results. * Textbook interchanges “problem” with “opportunity”. It’s time to….

Download Presentation

Chapter 3 Data Mining Methodology and Best Practices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3Data Mining Methodology and Best Practices

  2. Data Mining’s Virtuous Cycle • Identify the business opportunity* • Mining data to transform it into actionable information • Acting on the information • Measuring the results * Textbook interchanges “problem” with “opportunity”

  3. It’s time to… • Turn our attention to translating business opportunities (problems) into data mining opportunities (problems) including: • Transforming data into information via: • Hypothesis testing • Profiling • Predictive modeling • Taking action • Model deployment • Scoring • Measurement • Assessing a model’s stability & effectiveness before it is used

  4. DM General Guidelines • The DM virtuous cycle (4 steps) is iterative • No steps should be skipped • Common sense prevails with respect to how rigorous each step is carried out • Simplest approach: ad-hoc queries to test hypotheses • Rigorous approach: The 4 steps of the virtuous cycle expand to become an 11-step methodology

  5. Why have a Methodology? • A DM methodology which includes DM Best Practices helps to avoid: • Learning things that are not true • Learning things that are true, but not useful • Learning things that are not true is more dangerous than the other. Why is that? …

  6. Learning Things that are not True • Patterns may not represent any underlying rule • Sample may not reflect its parent population, hence bias • Data may be at the wrong level of detail (granularity; aggregation) Examples?

  7. Learning Things that are True, but not Useful • Learning things that are already known Examples? • Learning things that cannot be used Examples?

  8. Hypothesis Testing • A hypothesis is a proposed explanation whose validity can be tested by analyzing data • Purpose is to validate or invalidate preconceived ideas • Usually included in all DM projects • Data collection done via: • Observation • Experiment (lab, survey) • Bias must be avoided and usually requires both analytical and business knowledge to do so • Hypothesis testing is useful, but often insufficient which leads us to…

  9. Models • Model: An explanation or description of how something works that reflects reality well enough that it can be used to make inferences about the real world. • We use models every day…Examples? • DM uses models of data called Model Set • Applying model set to new data is called Score Set • Model Set includes: • Training Set – used to build a set of DM models • Validation Set – used to choose best DM model • Test Set – used to determine how the model performs • Models – 3 kinds of DM models for 3 kinds of tasks…next slide

  10. Profiling and Prediction • Profiling • describes what is in the data • Demographic variables • Inability to distinguish cause and effect (eg. Beer drinkers and males) • Focus is on the past to explain it (timing = past) • Prediction • Finding patterns in data from prior period(s) that are capable of explaining or anticipating outcomes in a later period (timing = future) • Predictive models require separation in time between the model inputs and output.

  11. Data Mining Methodology • Translate biz opportunity (problem) into DM opportunity (problem) • Select appropriate data • Get to know the data • Create a model set • Fix problems with the data • Transform data to bring information to the surface • Build models • Assess models • Deploy models • Assess results • Begin again

  12. In-Class Exercise • 10 Teams • Each team take one of the 1-10 methodology steps (step 11 is skipped) • Discuss it and prepare a 5 minute (or less) summary for your colleagues • Have each team present its summary Discussion: 15 minutes Present: 45 minutes

  13. End of Chapter 3

More Related