Introduction to Data Mining and Knowledge Discovery Process

Data mining and the knowledge discovery process Summer Course 2007 H.H.L.M. Donkers

Content • Opening / acquaintance • What is data mining • Data mining methodology • Course perspective • Course contents

Data - Information - Knowledge - • Data: symbols • Information: data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions • Knowledge: application of data and information; answers "how" questions • Understanding: appreciation of "why" • Wisdom: evaluated understanding. (Russell Ackoff - http://www.outsights.com/systems/dikw/dikw.htm)

Data - Information - Knowledge - http://www.outsights.com/systems/dikw/dikw.htm

What is Data Mining – Traditionally “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” Witten & Frank (2000). Data Mining.

What is Data Mining – Traditionally “The application of specific algorithms for extracting patterns from data, it is a part of knowledge discovery from databases” Fayyad (1997). From data mining to knowledge discovery in databases.

What is Data Mining – Traditionally “Data mining is a process, not just a series of statistical analyses.” SAS Institute (2003). Finding the solution to data mining.

Computer Science (Semi-)automated application of algorithms for pattern discovery Algorithms developed in the field of Artificial Intelligence (machine learning) Part of the process of knowledge discovery Statistics Process of discovering patterns in data (Manual) application of a series of statistical techniques (among which machine learning) Incorporates Exploration Sampling Modeling Validation What is Data Mining – Traditionally Data mining = Statistics + Marketing

What is Data Mining – A Fusion “An analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal is prediction.” Statsoft (2003). Data Mining Techniques.

What is Data Mining – A Fusion “An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results.” Rudjer Boskovic Institute (2001). DMS Tutorial.

Data Mining in this Course • We use the book of Witten & Frank • Computer science (machine learning) approach • Emphasis on algorithms for pattern discovery and rule extraction • What are the underlying models • What are the properties of the algorithms • When to use (for which tasks) • How to apply and to tune • How to interpret and assess the results

Data Mining Process • These algorithms are only part of a process that computer scientists call Knowledge Discovery and the statisticians call Data Mining • The process starts with the recognition of a problem and ends with the control of a deployed solution • The whole process needs to be supported for a successful application

Methodologies for Data Mining • As Data Mining is coming of age, several methodologies have been developed, each with their own perspective. We will discuss three of them: • Fayyad et al. (Computer science) • E.g., WEKA • SEMMA (SAS) (Statistics) • SAS Enterprise Miner, R • CRISP-DM (SPSS, OHRA, a.o.) (Business) • SPSS Clementine

Knowledge Transformed data Patterns Target data Processed data Interpretation Evaluation Data Mining Transformation & feature selection Preprocessing & cleaning Selection Fayyad’s KDD Methodology data

SAMPLE EXPLORE MODIFY MODEL ASSESS Input data, Sampling, Data partition Transform variable, Filter outliers, Clustering, SOM / Kohonen Assessment, Score, Report Distribution explorer, Multiplot, Insight, Association, Variable selection Regression, Tree, Neural Network, Ensemble SEMMA Methodology Supported by SAS Enterprise Mining environment

CRISP-DM Methodology • Developed by data-mining companies (SPSS, NCR, OHRA, ChryslerDaimler), funded by the European Commission • Tool-independent / industry-independent • Hierarchical process model 1 Generic phases 2 Generic tasks 3 Specific tasks 4 Task instances • Supported by SPSS Clementine environment

CRISP-DM Methodology TASKS Business objective Assess situation Data mining goals Project plan Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

CRISP-DM Methodology TASKS Collect data Describe data Explore data Verify data quality Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

CRISP-DM Methodology TASKS Select data Clean data Construct data Integrate data Format data Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

CRISP-DM Methodology TASKS Select modeling techniques Design the test Build model Assess model Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

CRISP-DM Methodology TASKS Evaluate results Review process Determine next steps Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

CRISP-DM Methodology TASKS Plan deployment Plan monitoring and maintenance Final report Review project Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

Knowledge Transformed data Patterns SAMPLE EXPLORE MODIFY MODEL ASSESS data Target data Processed data Interpretation Evaluation Transform variable, Filter outliers, Clustering, SOM / Kohonen Input data, Sampling, Data partition Assessment, Score, Report Data Mining Transformation & feature selection Distribution explorer, Multiplot, Insight, Association, Variable selection Regression, Tree, Neural Network, Ensemble Preprocessing & cleaning Selection A Comparison Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

A Small Poll (July 2002) Source: http://www.kdnuggets.com/polls/2002/methodology.htm

Poll repeated (2004) Source: http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm

Course perspective and goal • The perspective is from computer science (machine learning): Fayyad’s approach • The emphasis is on techniques for the automated discovery of patterns in data and the automated extraction of rules (the model phase of SEMMA and CRISP) • The goal is to get acquainted with these techniques, so you can use them in the methodology of your choice

Course contents • Data preparation (Tuesday) • Selection, preprocessing, transformation • Techniques, algorithms and models • Decision trees (Monday) • Instance based and Bayesian learning (Wednesday) • Neural networks (Wednesday) • Association rules (Thursday) • Clustering (Thursday) • Support Vector Machines (Friday) • Evaluation of learned models (Tuesday)

Course contents • For each technique you learn • For which tasks it is suitable • Classification, rules, prediction, … • Restrictions on input data (numerical, symbolic, etc.) • What algorithms are available • What parameters should be tuned • How to interpret the results • How to evaluate the model

Introduction to Data Mining and Knowledge Discovery Process