780 likes | 795 Views
This article provides an overview of data mining, including its definitions, basic functions, and its applications in association rule mining and classification. It also discusses the differences between data mining and knowledge discovery in databases (KDD).
E N D
Data Mining -Introduction By Dr.M.Nandhini Assistant Professor,Dept of CS, Government Arts College,Udumalpet.
Topics to be discussed What is Data Mining? ● Definitions ● Views on the Process ● Basic Functions
Data Mining- Definitions - “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data.” – "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams.” – “...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” – “...finding hidden information in a database.” – “...the process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data contained within a database.”
Data Mining Functions All data mining functions can be thought of as attempting to find a model to fit the data. – Predictive models predict unknown values based on known data – Descriptive modelsidentify patterns in data Each type has several sub-categories, each of which has many algorithms.
Data Mining vs KDD Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD. Dunham: • KDD is the process of finding useful information and patterns in data. • Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.
What Motivated Data Mining? Huge amount of Raw DATA is available. The motivation for data mining is to • Analyse, • Classify, • Cluster, • Characterize the Data etc...
Association Rule Mining • Association rule mining is a popular and well researched method for discovering interesting relationships between item sets in large databases. • Using different measures , strong rules are discovered for finding frequent item sets. • A Database (D) consist of Transactions • Each Transactions consist of Itemsets I= (X, Y, Z …) • X Y is an Association Rule where X,Y belongs to I & X ٨ Y = Ǿ
Measures: Support and Confidence • Support of X Y: Percentage of transactions that contain X Y • Confidence of X Y:Ratio of number of transactions that contain X Y to the number that contain X • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.
Literature Survey Basic Methodologies Apriori Algorithm - A k-itemset is frequent only if all of its sub item sets are frequent (Downward Closure Property) Frequent Pattern tree based method - A divide-and-conquer methodology: decompose mining tasks into smaller ones - Avoid candidate generation ECLAT (Equivalence Class Transformation) - Vertical data format - Frequent Item sets are generated similar to apriori with depth first computation
Apriori Algorithm Consider a particular Departmental store where it shows some transactions
Apriori Algorithm The above table can be represented in binary format as below
Apriori Algorithm Then the 1-item sets are generated from the binary table ..i.e.
Apriori Algorithm Then by taking the support threshold value as 60% (5 transactions) i.e. minimum support count as 3, the cola and eggs are discarded from the item sets as they have less than 3(threshold). i.e. out of the 6 items in 1-item set, 2 are discarded and 4 are remaining therefore Number of frequent 2-itemsets are 4
Apriori Algorithm In the 2-item sets, (beer, bread) & (beer, milk) are discarded as they have less than 3(threshold). And the 3-item sets are generated from candidate itemsets. The item set which is having more than 3 (threshold). we will not have 4-item set as there is only 1 pair formed in 3-item set. if we have 2 pairs, then the 4-item set can be formed by combining 2 pairs.
Apriori Algorithm As a conclusion, Apriori generated 1,2,3-item sets as 6c1+4c2+4c3 = 6+6+4 =16 But according to brute force strategy, the same is done as 6c1+6c2+6c3=6+15+20 = 41. Hence, Apriori generates optimum and required number of item sets. Click to Exercises
CLASSIFICATION - Knowledge for Making Decisions There are three ways in which humans acquire knowledge: • learning by being told • learning by observation and discovery • learning by induction. • The knowledge engineer acquires knowledge from the domain expert by the first two means. • Knowledge acquisition is the activity whereby the machine acquires expert knowledge in order to solve some specific problems. It results in a knowledge base. • Expert systems need large chunks of knowledge to achieve high levels of performances, and the process to obtain knowledge from experts is tedious and time-consuming;
The Map of Knowledge Acquisition See figure , from human know-how, there are two routes to reach the machine know-how. One is the articulation aided by knowledge engineers. The other is the rule induction performed by efficient inductive inference algorithms.
A Comparison between Induction and Deduction • In logic terms, the induction task is: given Q, find a P such that P Q • The deduction task, where given P and P Q then know Q. • Induction is generalizing from the particular, whereas deduction is specializing from the general. • Induction is determining the formula from a set of values whereas deduction is finding a value from a formula. • Induction is the process of extracting a pattern from examples.
-contd The induction task has been characterized by : • given a training set consisting of an object described by a fixed set of attributes. • given an outcome known for each object example. • find a rule that expresses the objects class in terms of the values of its attributes
A specific example: • Find a means of predicting which company profiles will lead to a increase or decrease in profits based on the following data: Age, Competition and Type
This problem has three attributes (Age, Competition, and Type) and one outcome/class attribute (Profit). • A simple algorithm would be to select the first attribute as the root, the next at the next level etc. till the leaf node becomes the goal. • Such a situation leads to the following rule structure: • If Age = X and Competition = Y and Type = Z then Profit = ?
Classification-Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification –A Two Step Process • Model construction: Describing a set of predetermined classes. • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction: training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model
Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?
Examples of Classification Task • Predicting tumor cells as malignant or not • Classifying credit card transactions as legitimate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
Methods of classification • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines
Decision Tree Classification Task Decision Tree
categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data
NO Another Example of Decision Tree categorical categorical continuous class Single, Divorced MarSt Married NO Refund No Yes TaxInc < 80K > 80K YES NO There could be more than one tree that fits the same data!
Decision Tree Classification Task Decision Tree
Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree.
Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data
Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO
Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO
Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO
Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO
Evaluating the Classifier Confusion Matrix A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is NxN, where N is the number of target values (classes).Performance of such models is commonly evaluated using the data in the matrix.
Confusion Matrix Example of Confusion Matrix: 44
Evaluating the Classifier • Accuracy : the proportion of the total number of predictions that were correct. Accuracy = (TP + TN)/All • Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified. • Negative Predictive Value : the proportion of negative cases that were correctly identified. =TN/ (TN+FN) • Sensitivity or Recall : the proportion of actual positive cases which are correctly identified. Sensitivity = TP/P • Specificity : the proportion of actual negative cases which are correctly identified. Specificity = TN/N • Click for exercises
ID3- Classification using Decision trees • Creates tree using information theory concepts and tries to reduce expected number of comparison.. • ID3 chooses split attribute with the highest information gain using entropy as base for calculation.
Classification by Decision Trees (ID3) • The following table consists of training data from mobile store. Construct a Decision Tree based on this data, using the basic algorithm for decision tree induction. Classify the records by the “Customer satisfaction” attribute.
Model Construction Iteration 1 • Root node is to identified for constructing a decision tree • Highest Information Gain attribute is selected as root node. • Overall Information Gain of the Training dataset with binary classes • Class P: customer satisfaction = “yes” • Class N: customer satisfaction = “no” • I(p, n)= • I(9,6)= -9/15 log(9/15)-6/15log(6/15) = 0.971
Step-1 Attribute(A1)=“Battery Life” Entropy (A1) = 5/15 *I(5, 0)+ 5/15*I(2, 3)+ 5/15 *I(2, 5) = 5/15[ -5/5 log (5/5) -0/5 log (0/5)] + 5/15[ -2/5 log (2/5) – 3/5 log(3/5)] + 5/15[ -2/5 log (2/5) – 3/5 log(3/5)] = 0.6473 Gain(A1)= I(p, n)- E(A1)= 0.3237
Step-1 Attribute(A2)= ”Memory” Entropy (A2) = 7/15 *I(3, 5)+ 8/15*I(6,2) = 0.8925 Gain(A2)= I(p, n)-E(A2)= 0.0785