350 likes | 363 Views
Chapter 3 Data Mining : Classification. Ms. Malak Bagais [textbook]: Chapter 4. Objectives. By the end of this lecture, student will be able to: Lists information retrieval components Describe document representation Apply Porter’s Algorithm Compare and apply different retrieval models
E N D
Chapter 3Data Mining: Classification Ms. MalakBagais [textbook]: Chapter 4
Objectives • By the end of this lecture, student will be able to: • Lists information retrieval components • Describe document representation • Apply Porter’s Algorithm • Compare and apply different retrieval models • Evaluate the performance of retrieving
Introduction Data mining is a component of a wider process called knowledge discovery from databases. Data mining techniques include: • Classification • Clustering
What is Classification? Classification is concerned with generating a description or model for each class from the given dataset of records. Classification can be: • Supervised (Decision Trees and Associations) • Unsupervised (more in next chapter)
Supervised Classification How can we evaluate how good the classifier is at classifying unknown records? Using a test dataset Training set (pre-classified data) Use the training set, the classifier generates a description/model of the classes which helps to classify unknown records.
Decision Trees A decision tree is a tree with the following properties: • An inner noderepresents an attribute • An edge represents a teston the attribute of the father node • A leafrepresents one of the classes The Construction of a decision tree: • Based on the training data • Top-Down strategy
Decision Trees The set of records available for classification is divided into two disjoint subsets: • a training set • a testset Attributes whose domain is numerical are called numerical attributes Attributes whose domain is not numerical are called categorical attributes.
Decision Tree Splitting Attribute Splitting Criterion/ condition RULE 1 If it is sunny and the humidity is not above 75%, then play. RULE 2 If it is sunny and the humidity is above 75%, then do not play. RULE 3 If it is overcast, then play. RULE 4 If it is rainy and not windy, then play. RULE 5 If it is rainy and windy, then don't play.
Example outlook=rain; temp=70; humidity=65; wind=true; Play or No Play?
Confidence Confidence (The accuracy of the classifier) in the classifier is determined by the percentage of the test data that is correctly classified.
Activity Compute the confidence in Rule-1 Compute the confidence in Rule-2 Compute the confidence in Rule-3
Test dataset We can see that for Rule 1 there are two records of the test data set satisfying Outlook = sunny and humidity < 75, and only one of these is correctly classified as play The accuracy of this rule is 0.5, 0.5, 0.66
Decision Tree Algorithms ID3 algorithm Rough Set Theory
Decision Trees • ID3 Iterative Dichotomizer(Quinlan 1986), represents concepts as decision trees. • A decision tree is a classifier in the form of a tree structure where each node is either: • a leaf node, indicating a class of instances OR • a decision node, which specifies a test to be carried out on a single attribute value, with one branch and a sub-tree for each possible outcome of the test
Decision Tree development process • Construction phase Initial tree constructed from the training set • Pruning phase Removes some of the nodes and branches to improve performance • Processing phase The pruned tree is further processed to improve understandability
Construction phase Use Hunt’s method: T : Training dataset with class labels { C1, C2,…,Cn} The tree is built by repeatedly partitioning the training data, based on the goodness of the split. The process is continued until all the records in a partition belong to the same class.
Best Possible Split Evaluation of splits for each attribute. Determination of the splitting condition on the selected splitting attribute. Partitioning the data using best split. The best split: the one that does the best job of separating the records into groups, where a single class predominates.
Splitter choice To choose a best splitter, we consider each attribute in turn. If an attribute has multiple values, we sort it, measure the goodness, and evaluate each split. We compare the effectiveness of the split provided by the best splitter from each attribute. The winner is chosen as the splitter for the root node.
Iterative Dichotimizer (ID3) Uses Entropy: Information theoretic approach to measure the goodness of a split. The algorithm uses the criterion of information gain to determine the goodness of a split. The attribute with the greatest information gain is taken as the splitting attribute, and the data set is split for all distinct values of the attribute.
Information measure Information needed to identify the class of an element in T, is given by: Info(T)= Entropy (P) Where P is the probability distribution of the partition C1, C2, C3,…Cn. ,,,..,)
Example-1 T : Training dataset, C1=40, C2=30, C3=30 Compute Entropy of (T) or Info(T)
Info (X,T) If T is partitioned based on attribute X, into sets T1, T2, …Tn, then the information needed to identify the class of an element of T becomes:
Example-2 If T is divided into 2 subsets S1, S2, with n1, and n2 number of records according to attribute X. If we assume n1=60, and n2=40, the splitting can be given by: Compute Entropy(X,T) or Info (X,T) after segmentation
Example-3 Gain (X,T) =Info(T)-Info(X,T) =1.57-1.15 =0.42
Example-4 Assume we have another splitting on attribute Y: Info (Y,T)=1.5596 Gain= Info(T)-Info(Y,T)=0.0104
Splitting attribute X or Y? Gain (X,T) =0.42 Gain (Y, T)=0.0104 The splitting attribute is chosen to be the one with the largest gain. X
Gain-Ratio Gain tends to support attributes which have a large number of values If attribute X has a distinct value for each record, then Info(X,T)=0 Gain (X,T)=maximal To balance this, we use the gain-ratio instead of gain.
Index of Diversity A high index of diversity set contains even distribution of classes A low index of diversity Members of a single class predominates )
Which is the best splitter? The best splitter is one that decreases the diversity of the record set by the greatest amount. We want to maximize: [Diversity before split- (diversity-left child + diversity right child)]
Numerical Example For the play golf example, compute the following: • Entropy of T. • Information Gain for the following attributes: outlook, humidity, temp, and windy. Based on ID3, which will be selected as the splitting attribute?