Issues in Data Mining Infrastructure

Issues in Data Mining Infrastructure Authors: Nemanja Jovanovic, nemko@acm.org Valentina Milenkovic, tina@eunet.yu Voislav Galic, vgalic@bitsyu.net Dusan Zecevic, zdusan@titan.etf.bg.ac.yu Sonja Tica, ticaz@eunet.yu Prof. Dr. Dusan Tosic, dtosic@matf.bg.ac.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Data Mining in the Nutshell • Uncovering the hidden knowledge • Huge n-p complete search space • Multidimensional interface NOTICE: All trademarks and service marks mentioned in this document are marks of their respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted presentation of their process model.

A Problem … You are a marketing manager for a cellular phone company • Problem: Churn is too high • Turnover (after contract expires) is 40% • Customers receive free phone (cost 125$) • You pay a sales commission of 250$ per contract • Giving a new telephone to everyone whose contract is expiring is expensive • Bringing back a customer after quitting is both difficult and expensive

… A Solution • Three months before a contract expires, predict which customers will leave • If you want to keep a customer that is predicted to churn, offer them a new phone • The ones that are not predicted to churn need no attention • If you don’t want to keep the customer, do nothing • How can you predict future behavior? • Tarot Cards? • Magic Ball? • Data Mining?

Still Skeptical?

The Definition The automated extraction of predictive information from (large) databases • Automated • Extraction • Predictive • Databases

History of Data Mining

Repetition in Solar Activity • 1613 – Galileo Galilei • 1859 – Heinrich Schwabe

The Return of theHalley Comet Edmund Halley (1656 - 1742) 1531 1607 1682 239 BC 1910 1986 2061 ???

Data Mining is Not • Data warehousing • Ad-hoc query/reporting • Online Analytical Processing (OLAP) • Data visualization

Data Mining is • Automated extraction of predictive informationfrom various data sources • Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines

Data Mining can • Answer question that were too time consuming to resolve in the past • Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision

Data Mining Models

Neural Networks • Characterizes processed data with single numeric value • Efficient modeling of large and complex problems • Based on biological structures - Neurons • Network consists of neurons grouped into layers

Neuron Functionality W1 I1 W2 I2 Output W3 I3 f In Wn Output = f (W1*I1, W2*I1, …, Wn*In)

Training Neural Networks

Neural Networks • Once trained, Neural Networks can efficiently estimate the value of an output variable for given input • Neurons and network topology are essentials • Usually used for prediction or regression problem types • Difficult to understand • Data pre-processing often required

Decision Trees • A way of representing a series of rules that lead to a class or value • Iterative splitting of data into discrete groups maximizing distance between them at each split • Classification trees and regression trees • Univariate splits and multivariate splits • Unlimited growth and stopping rules • CHAID, CHART, Quest, C5.0

Decision Trees Balance>10 Balance<=10 Age<=32 Age>32 Married=NO Married=YES

Decision Trees

Rule Induction • Method of deriving a set of rules to classify cases • Creates independent rules that are unlikely to form a tree • Rules may not cover all possible situations • Rules may sometimes conflict in a prediction

Rule Induction If balance>100.000 then confidence=HIGH & weight=1.7 If balance>25.000 and status=marriedthen confidence=HIGH & weight=2.3 If balance<40.000 then confidence=LOW & weight=1.9

K-nearest Neighbor and Memory-Based Reasoning (MBR) • Usage of knowledge of previously solved similar problems in solving the new problem • Assigning the class to the group where most of the k-”neighbors” belong • First step – finding the suitable measure for distance between attributes in the data • + Easy handling of non-standard data types • - Huge models

K-nearest Neighbor and Memory-Based Reasoning (MBR)

Data Mining Algorithms • Many other available models and algorithms • Logistic regression • Discriminant analysis • Generalized Adaptive Models (GAM) • Genetic algorithms • The Apriori algorithm • Etc… • Many application specific variations of known models • Final implementation usually involves several techniques

The Apriori Algorithm • The task – mining association rules by finding large itemsets and translating them to the corresponding association rules; • A  B, or A1A2… Am B1  B2… Bn, where A  B =  • The terminology • Confidence • Support • k-itemset – a set of k items; • Large itemsets – the large itemset {A, B} corresponds to the following rules (implications): A  B and B  A;

The Apriori Algorithm • The  operator definition • n = 1: S2 = S1 S1 = {A}, {B}, {C}}  {{A}, {B}, {C}} = {{AB}, {AC}, {BC}} • n = k: Sk+1 = Sk Sk = {X  Y| X, Y  Sk, |X  Y| = k-1} • X and Y must have the same number of elements, and must have exactly k-1 identical elements; • Every k-element subset of any resulting set element (an element is actually a k+1 element set) has to belong to the original set of itemsets;

The Apriori Algorithm • Example:

The Apriori Algorithm • Step 1 – generate a candidate set of 1-itemsets C1 • Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in advance (á priori ); • The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set; • C1 = {{A}, {B}, {C}, {D}, {E}} • Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets);

The Apriori Algorithm • Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets);

The Apriori Algorithm • Step 2 – generate a set of large 1-itemsets L1 • Each element in C1 with support that exceeds some adopted minimum support (for example 50%) becomes a member of L1; • L1 = {{A}, {B}, {C},{E}} and we can omit D in further steps (if D doesn’t have enough support alone, there is no way it could satisfy requested support in a combination with some other element(s));

The Apriori Algorithm • Step 3 – generate a candidate set of large 2-itemsets, C2 • C2 = L1 L1 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}} • Count the corresponding appearances • Step 4 – generate a set of large 2-itemsets, L2; • Eliminate the candidates without minimum support; • L2 = {{AC}, {BC}, {BE}, {CE}}

The Apriori Algorithm • Step 5 (C3) • C3 = L2 L2 = {{BCE}} • Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are not the elements of large 2-itemset set L2 (calculation is made according to the operator  definition); • Step 6 (L3) • L3 = {{BCE}}, since {BCE} satisfies the required support of 50% (two appearances); • There can be no further steps in this particular case, because L3 L3 = ; • Answer = L1  L2 L3;

The Apriori Algorithm L1 = {large 1-itemsets} for (k=2; Lk-1 ; k++) Ck = apriori-gen(Lk-1); forall transactions t  D dobegin Ct = subset (Ck, t); forall candidates c  Ctdo c.count++; end; Lk = {c  Ck | c.count  minsup} end; Answer = k Lk

The Apriori Algorithm • Enhancements to the basic algorithm • Scan-reduction • The most time consuming operation in Apriori algorithm is the database scan; it is originally performed after each candidate set generation, to determine the frequency of each candidate in the database; • Scan number reduction – counting candidates of multiple sizes in one pass; • Rather than counting only candidates of size k in the kth pass, we can also calculate the candidates C’k+1, where C’k+1 is generated from Ck (instead Lk), using the  operator;

The Apriori Algorithm • Compare: C’k+1 = Ck Ck Ck+1 = Lk Lk • Note that C’k+1 Ck+1 • This variation can pay off in later passes, when the cost of counting and keeping in memory additional C’k+1 - Ck+1 candidates becomes less than the cost of scanning the database; • There has to be enough space in main memory for both Ck and C’k+1; • Following this idea, we can make further scan reduction: • C’k+1is calculated from Ckfor k > 1; • There must be enough memory space for all Ck’s (k > 1); • Consequently, only two database scans need to be performed (the first to determine L1, and the second to determine all the other Lk’s);

The Apriori Algorithm • Abstraction levels • Higher level associations are stronger (more powerful), but also less certain; • A good practice would be adopting different thresholds for different abstraction levels (higher thresholds for higher levels of abstraction)

References • Devedzic, V., “Inteligentni informacioni sistemi,” Digit, FON, Beograd, 2003. • http://www.marconi.com • http://www.blueyed.com • http://www.fipa.org • http://www.rpi.edu • http://research.microsoft.com • http://imatch.lcs.mit.edu

DM Process Model • 5A – used by SPSS Clementine(Assess, Access, Analyze, Act and Automate) • SEMMA – used by SAS Enterprise Miner(Sample, Explore, Modify, Model and Assess) • CRISP – tends to become a standard

CRISP - DM • CRoss-Industry Standard for DM • Conceived in 1996 by three companies:

CRISP – DM methodology Four level breakdown of the CRISP-DM methodology: Phases Generic Tasks Specialized Tasks Process Instances

Mapping generic modelsto specialized models • Analyze the specific context • Remove any details not applicable to the context • Add any details specific to the context • Specialize generic context according toconcrete characteristic of the context • Possibly rename generic contents to provide more explicit meanings

CRISP – DM model • Business understanding Business understanding Data understanding • Data understanding • Data preparation • Modeling Datapreparation Deployment • Evaluation • Deployment Evaluation Modeling

Business Understanding • Determine business objectives • Assess situation • Determine data mining goals • Produce project plan

Data Understanding • Collect initial data • Describe data • Explore data • Verify data quality

Data Preparation • Select data • Clean data • Construct data • Integrate data • Format data

Modeling • Select modeling technique • Generate test design • Build model • Assess model

Evaluation results = models + findings • Evaluate results • Review process • Determine next steps

Deployment • Plan deployment • Plan monitoring and maintenance • Produce final report • Review project

At Last…

Issues in Data Mining Infrastructure