460 likes | 992 Views
FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT. Ernestina Menasalvas Facultad de Informática Universidad Politecnica de Madrid. Spain emenasalvas@fi.upm.es November 2004. Background(I). 1995: doctoral student.
E N D
FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad de Informática Universidad Politecnica de Madrid. Spain emenasalvas@fi.upm.es November 2004
Background(I) • 1995: doctoral student. • Visit University of Regina (Prof. Ziarko) • Visit Warsaw University (Prof. Pawlak) • 1998: Defend thesis. Data Mining process model (Anita Wasilewska & C. Fernandez-Baizan) • Since then: • Data Bases Professor: Data bases, data mining • Coordinator of the Data Mining group at Facultad de Informática UPM • Techniques: Rough Sets, Bayes, … • Methodologies for data mining process management • Evaluation in Data Mining • Experimentation in Web Mining • Web Mining: Web Goal Mining
Background(II) • Projects developed: • Pure Research: • Data Mining to be integrated on RDBMS • Web Profiler • Methodology for Data Mining process management • Research and application: • Data Mining applied on different domains: • Car dealers • Travel agency • ….
Data Mining Project Development • Methodologies for Data Mining project development • Is it really Data Mining a Science? • Are we developing proyects as an art? • Has the research got the same results in all the areas?? • Algorithms • Data Preparation • Data enrichment • Conceptualization of Data Mining problems
Data Mining: an art, a science? • Since it appeared a lot of algorithms have been programmed • Standards: • Crisp-DM • SEMMA • PMML 3.0 • Process depends on the expertise of the data miner • User speaks about business problems • Data Miner speaks about algorithms
Data Mining as a project • Data Mining is data intensive activity • Data understanding • Data Preparation • Database manager: • Transactional databases • Datawarehouses • The end result of a data mining project is a tool (software project) for better decision making process: • Software development project • IT department has to be involved
Project Management • Why? • In order to organize the process of develpoment and to produce a project plan • How? • Establish how the process is going to be develop: • Sequential • Incremental • What? • Establish how is the process is splitted into phases and define the tasks to be developed in each step: • RUP • XP • COMMONKADS • Way of making things • Independent of the process being developed LIFECYCLE MODELS • Particular tasks • Detail of tasks to be developed METHODOLOGY
Common pitfall of data mining implementation • The common pitfall of data mining implementation the following: • Not being able to efficiently communicate mining results within an organization. • Not having the right data to conduct effective analysis. • Not using existing data correctly. • Not being able to evaluate results • Questions that arise: • Can the adequateness of a set of data for a problem be established when preparing the project plan? • How the set of data can be used to produce the expected results? • How we can evaluate the results? • Cost estimation?
Data Mining Approaches • Vendor independent: • CRISP-DM • Based on the commercial tools: • CAT’s • SEMMA • CRM Methodology: • CRM Catalyst Model Process Not Real Methodology Based on Crisp-DM Globlal CRM process Does not concentrate on Data Mining step
Data Mining as a project: CATs • CATs :ClementineApplicationTemplates : [CATs] • Specific libraries of best practices that provide inmediate value right out of the box • Following the CRISP-DM standard. Every CAT stream is assigned to a CRISP-DM phase • They provide long term value as they can always be used with a new data set for new insight in other projects. • Available as an add-on module to Clementine, include: • Telco CAT - improve retention and cross-selling efforts for telecommunications • CRM CAT - understand and predict customer migration between segments, • Microarray CAT - accelerate biological discoveries, find genes Fraud CAT - predict and detect instances of fraud in financial transactions, claims, tax returns … • Web CAT
SEMMA(1) • SEMMA (Sample, Explore, Modify, Model, Assess): [SEMMA] • Is not a data mining methodology • Rather a logical organization of the functional tool set of SAS Enterprise Miner for carrying out the core tasks of data mining. • Enterprise Miner can be used as part of any iterative data mining methodology adopted by the client. • Naturally steps such as formulating a well defined business or research problem and assembling quality representative data sources are critical to the overall success of any data mining project.
SEMMA(2) • SEMMA is focused on the model development aspects of data mining:[SEMMA] • Sample the data to extract a portion of a large data set big enough to contein significant information, yet small to manipulate quickly. • Explore the data by searching for anticipated trends and anomalies in order to gain understanding and ideas. • Modify the data by creating selecting and transforming the variables to focus the model selection problem. • Model the data allowing the software to search automatically for a combination of data that reliably predicts a desired outcome. Modelling techniques include neural networks, tree-clasiffiers, statistical models, etc. • Assess the data by evaluating the usefulness and reliability of the findings from the data mining process and estimate how well it performs.
Methods for Project Management:CRM Catalyst(1) • Developed jointly by CustomISe, MACS and SalesPathways. Together they have formed the Catalyst Foundation http://www.crmmethodology.com/ Motivations: • CRM projects are difficult to execute successfully because of the wide range of factors influencing their success. So it can take a long time to make CRM work properly for an organisation. • Solution: CRM Catalyst. • Methodology acts as a catalyst for CRM projects enabling them to achieve their objectives more reliably and in less time. • It gives a project life cycle with a set of defined phases broken down into steps with clearly stated inputs and outputs.
Methods for Project Management: CRM Catalyst(2) Implementation requires Data Mining development process Progressive Lifecycle Model The resutls are obtained in a progressive way Implementation is Knowledge intensive In some steps Knowledge Intensive Methdology could be appropriate
Main steps in a Data Mining Project • Define the goals: • Business and data mining experts together have to define the goals • Each goal must be defined with measurements for success • Obtain the models: • Apply data mining algorithms. • Preprocesing is important • Evaluate results: • ascertaine the value of an object according to specified criteria, operationalised in terms of measures. • Deploy: • Decide patterns and models that can be deployed • Evaluate • After product working it should be contrasted the result
1. Define the goals • Distinguish between : • Data Mining goals • Business goals • How do we translate? Increase the lifetime value of valuable customers ¿? ¿? ¿? Clasification Estimation Association It has to be solved in the Business Understanding step of CRISP-DM
Business Understandingin the CRISP-DM Process Business Understanding Business Success Criteria Background Business Objectives Determine Business Objectives Inventory & Resources Reqs, Assumptions &Constraints Risks & Contingencies Terminology Costs & Benefits Assess Situation Determine Data Mining Goals Data Mining Goals Data Mining Success Criteria Produce Project Plan Initial Assessment of Tools & Techniques Project Plan
1.1 Determine Business objectives and success criteria • Not only business objectives have to be established but measures in order to be able to evaluate the results • Business objectives: • What is the customer's primary objective? • Increase the number of loyal customers • Selling more of a certain product • Have a positive marketing campaing • Business success criteria: • What constitutes a successful outcome of the project? • Objectives measures so that the success can be established • ROI
1.2 Costs & Benefits • Perform a cost-benefits analysis • Compute the benefits of the project • Which measures do we have? • ROI • APEX • OPEX.... • Compute the costs of the project (equipment, human resources...) • Which methodology do we have? • COCOMO for sortware • Quantify the risk that the project fails • Knowledge not available • Data Not available • Proper tools
Data Mining Estimation Model • Establishing a parametrical estimation model for Data Mining (Marban’03) DMCOMO (Data Mining COst MOdel)
Data Mining Cost Estimation • Main factors in a Data Mining project • Data Sources (number, kind, nature, …) • Data mining problem to be solved (descriptive, predictive, …) • Development platform • Available tools • Expertise of the development team • Drivers • Data Drivers • Model Drivers • Platform Drivers • Tools and techniques Drivers • Project Drivers • People Drivers
1.3 Data Mining goals and success Data mining goals: • Translate the customer's primary objective into a data mining goal, e.g. • Loyalty program translated into segmentation problem • Decreasing the attrition rate transformed into classification problem • Data mining success criteria: • Determine success in technical terms • Translate the notion of sucess into confidence, support and lift and other parameteres • Determine de cost of errors • How do we make the translation?
Methodology • Which is the methodology to be followed to translate business objectives into data mining objectives? • Unluckily, there is no such methodology. First we have to solve: • How a business objective is expressed? • What is a data mining goal? • How are data mining goals achieved? • Which are the requirements of data mining functions? In order to describe everything in a standard way: Conceptualize the problem
Conceptualization in other disciplines • Data Bases: • E/R diagrams • Independent of the domain • A tool for business understanding and for data base designer • Translation from E/R to implementation External view n External view 1 Conceptual Schema Internal Schema
3 levels proposed architecture Business problem Business problem Requirements of algorithms will be solved at this level Conceptual Schema Internal Schema Tools requirements to be solved SAS, WEKA, Clementine…
3 layers architecture for data mining • It is the bridge: • Between business goals and the final tool • Independent of the domain • Provides independence: • Changes in the tool do not reflect to the solution • It has to be decided what to model in the conceptualization • Automatic translation of business goals into data mining goals • Data Mining goals +constraints = feasible data mining goals
Elements to conceptualize • Elements to be taken into account: • Data: • Quality from data mining point of view • Adequateness for the problem • Classification for data mining purposes • Knowledge: • Related to the process being analyzed • Related to the data used • People • Owners of data • Experts in the process • Data mining problems requirements • Data mining methods requirements
DMMO • Data Mining Modelling Objects: • Data • Knowledge • Constraints of data and applications • Data Mining objects • Algorithms • Measures • Methods • To bridge the gap between data miners and business users
Are data adequate for analysis? • The adequateness of the data is analyzed taking into account goals to fulfil. • Data together with the knowledge extracted from the experts can be transformed so that just by being the input of a certain data mining algorithm will produce the required patterns. • Quality of the data, in this context: • is not only related to the technical quality: proper model, percentage of null values, • but also has to do with: • meaning of the attributes, • Where each piece of data comes from, • relationship among data, and • finally how the data fulfil the requirements of the data mining functions
2. Data Mining: obtain models • Apply data mining process model • Associated problems solved by the 3 layers architecture: • Comparison of approaches • Evaluate costs • Pros and cons of approaches • Only experience or a conceptualization can help • The conceptual model will help to establish the process to obtain each feasible model. • Requirements and transformations implicit in the model
2.1 Determine type of problem • What are data mining problems? • Classification • Estimation • Association • Segmentation • In the conceptual model requirements for each type will be settled
2.2 Apply CRISP-DMprocess model • Data Mining problem has to be settled before going into modeling step • Requierements will be established in Business understanding • Requierements will be checked in Data Understanding and data Preparation • Preparation will be guided by conceptual model • Evaluation on feasibility can be done before applying the model Business Understanding Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
3. Evaluate results [Spilipopou, Berendt] • Evaluation: the act of ascertaining the value of an object according to specified criteria, operationalised in terms of measures. • Object= model already obtained • Criteria and Measures and has to do with goals • Evaluation requires a well-defined notion of success, which must be in place before • the evaluation takes place • the data mining phase starts • any work with the data starts • i.e. already during the business understanding process. • Here once again conceptualization plays its role
Evaluation in the CRISP-DM Process • The CRISP-DM process is • a non-ending circle of iterations • a non-sequential process, where backtracking at previous phases is usually necessary • In each sequential instantiation evaluation takes place: • But it is a cycle • In all the iterations all the steps should be revisited • Results have to be evaluated!! Business Understanding Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
4. Deployment • All the models that have possitive evaluation can be deployed • For measurements of success to trust deployment has to follow rules established at the beginning of the project • The real evaluation has not yet been performed
5. Evaluate after deployment • After deployment there is the need to proof that the improvements are really due to the actions taken after a data mining discovery and not to any other factor or action carried out in the company • None of the obvious claims about success of data mining have ever been systematically tested. • Experiments are crucial to establish if the impact of the deployment is really positive or negative • Experiments have to be designed at the beginning of the project
Conclusions • Data mining projects are being developed more as art than a science • Many algorithms have been implemented but no systematically proof of one better than another in real case is done after deployment • Conceptual model is required: • To map business goals to the model • To map data mining algorithms to a conceptual model • Achievements of the model: • Will be used along the process to guide the project • Evaluation tool
Future works • Conceptual model • Define DMMO objects • Evaluation techniques related to the model: • Evaluate data mining goals • Evaluate business goals • Experimentation methods: • obstursively and • non obstrusivelsly
References • Evaluation in Web mining Tutorial at ECML/PKDD 2004 Pisa, Italy; 20th September, 2004. Bettina Berendt, Myra Spiliopoulou, Ernestina Menasalvas • Towards a Methodology for Data mining Project Development : The Importance of Abstraction. Menasalvas, Millán, Gonzalez-Aranda, Segovia • Bettina Berendt, Andreas Hotho, Dunja Mladenic, Maarten van Someren, Myra Spiliopoulou, Gerd Stumme: Web Mining: From Web to Semantic Web, First European Web Mining Forum, EMWF 2003, Cavtat-Dubrovnik, Croatia, September 22, 2003, Revised Selected and Invited Papers Springer 2004 • Myra Spiliopoulou, Carsten Pohle: Modelling and Incorporating Background Knowledge in the Web Mining Process. Pattern Detection and Discovery 2002: 154-169 • www.crisp-dm.org • www.spss.com/clementine/cats.htm • www.sas.com/technologies/analytics/datamining/miner/semma.html • www.crmmethodology.com • www.emetrics.org/articles/whitepaper.html