Data Mining Spring 2007

Data MiningSpring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Noisy Data • Noise: Random error, Data Present but not correct. • Data Transmission error • Data Entry problem • Removing noise • Data Smoothing (rounding, averaging within a window). • Clustering/merging and Detecting outliers. • Data Smoothing • First sort the data and partition it into (equi-depth) bins. • Then the values in each bin using Smooth by BinMeans,Smooth by Bin Median,Smooth by Bin Boundaries, etc.

Noisy Data (Binning Methods) Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Noisy Data (Clustering) • Outliers may be detected by clustering, where similar values are organized into groups or “clusters”. • Values which falls outside of the set of clusters may be considered outliers.

Data Discretization • The task of attribute (feature)-discretization techniques is to discretize the values of continuous features into a small number of intervals, where each interval is mapped to a discrete symbol. • Advantages:- • Simplified data description and easy-to-understand data and final data-mining results. • Only Small interesting rules are mined. • End-results processing time decreased. • End-results accuracy improved.

Effect of Continuous Data on Results Accuracy Discover only those rules which contain support (frequency) greater >= 1 Data Mining • If ‘age <= 30’ and income = ‘medium’ and age = ‘9’ then buys_computer = ‘no’ • If ‘age <= 30’ and income = ‘medium’ and age = ‘10’ then buys_computer = ‘no’ • If ‘age <= 30’ and income = ‘medium’ and age = ‘11’ then buys_computer = ‘no’ • If ‘age <= 30’ and income = ‘medium’ and age = ‘12’ then buys_computer = ‘no’ Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”

Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is • Where pi is the probability of class i in S1, determined by dividing the number of samples of class i in S1 by the total number of samples in S1.

Entropy-Based Discretization • The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. • The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,

Example 1 (22+24) / 2 = 23 • Let Grade be the class attribute. Use entropy-based discretization to divide the range of ages into different discrete intervals. • There are 6 possible boundaries. They are 21.5, 23, 24.5, 26, 31, and 38. • Let us consider the boundary at T = 21.5. Let S1 = {21} Let S2 = {22, 24, 25, 27, 27, 27, 35, 41} (21+22) / 2 = 21.5

Example 1 (cont’) • The number of elements in S1 and S2 are: |S1| = 1 |S2| = 8 • The entropy of S1 is • The entropy of S2 is

Example 1 (cont’) • Hence, the entropy after partitioning at T = 21.5 is

Example 1 (cont’) • The entropies after partitioning for all the boundaries are: T = 21.5 = E(S,21.5) T = 23 = E(S,23) . . T = 38 = E(S,38) Select the boundary with the smallest entropy Suppose best is T = 23 Now recursively apply entropy discretization upon both partitions

ChiMerge (Kerber92) • This discretization method uses a merging approach. • ChiMerge’s view: • First sort the data on the basis of attribute (which is being descretize). • List all possible boundaries or intervals. In case of last example there were 6 boundaries, 0, 21.5, 23, 24.5, 26, 31, and 38. • For all two near intervals, calculate 2 class independent test. • {0,21.5} and {21.5, 23} • {21.5, 23} and {23, 24.5} • . • . • Pick the best 2 two near intervals and merge them.

ChiMerge -- The Algorithm • Compute the 2 value for each pair of adjacent intervals • Merge the pair of adjacent intervals with the lowest 2 value • Repeat  and  until 2 values of all adjacent pairs exceeds a threshold

Chi-Square Test oij = observed frequency of interval i for class j eij = expected frequency (Ri * Cj) / N

ChiMerge Example Data Set Sample: F K 1 1 1 2 3 2 3 7 1 4 8 1 5 9 1 6 11 2 7 23 2 8 37 1 9 39 2 10 45 1 11 46 1 12 59 1 • Interval points for feature F are: 0, 2, 5, 7.5, 8.5, 10, etc.

ChiMerge Example (cont.) 2 was minimum for intervals: [7.5, 8.5] and [8.5, 10] K=1 K=2  Interval [7.5, 8.5] A11=1 A12=0 R1=1 Interval [8.5, 9.5] A21=1 A22=0 R2=1  C1=2 C2=0 N=2 Based on the table’s values, we can calculate expected values: E11 = 2/2 = 1, E12 = 0/2  0, E21 = 2/2 = 1, and E22 = 0/2  0 and corresponding 2 test: 2 = (1 – 1)2 / 1 + (0 – 0)2 / 0.1 + (1 – 1)2 / 1 + (0 – 0)2 / 0.1 = 0.2 For the degree of freedom d=1, and 2 = 0.0 < 2.706 ( MERGE !) oij = observed frequency of interval i for class j eij = expected frequency (Ri * Cj) / N

ChiMerge Example (cont.) Additional Iterations: K=1 K=2  Interval [0, 7.5] A11=2 A12=1 R1=3 Interval [7.5, 10] A21=2 A22=0 R2=2  C1=4 C2=1 N=5 E11 = 12/5 = 2.4, E12 = 3/5 = 0.6, E21 = 8/5 = 1.6, and E22 = 2/5 = 0.4 2 = (2 – 2.4)2 / 2.4 + (1 – 0.6)2 / 0.6 + (2 – 1.6)2 / 1.6 + (0 – 0.4)2 / 0.4 2 = 0.834 For the degree of freedom d=1, 2 = 0.834 < 2.706 (MERGE!)

ChiMerge Example (cont.) K=1 K=2  Interval [0, 10.0] A11=4 A12=1 R1=5 Interval [10.0, 42.0] A21=1 A22=3 R2=4  C1=5 C2=4 N=9 E11 = 2.78 E12 =2.22 E21 = 2.22 E22 = 1.78 and 2 = 2.72 > 2.706 (NO MERGE !) Final discretization: [0, 10], [10, 42], and [42, 60]

References • Text book of Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers, August 2000. (Chapter 3). • Data Mining: Concepts, Models, Methods, and Algorithms by Mehmed Kantardzic John Wiley & Sons 2003. (Chapter 3).

Data Mining Spring 2007

Data Mining Spring 2007

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining CSCI 307 Spring, 2019

ICS 421 Spring 2010 Data Mining 1

Data Mining: Data

ICS 421 Spring 2010 Data Mining 2

Spring 2007

Spring 2007

Spring 2007

Data Mining: Data

MMDSS 2007 Data stream management and mining

Data Mining: Data

Spring 2007

Data Mining: Data

Regional Meetings Data Update Spring 2007

CIS453/553, Spring 2018 Data Mining

Data Mining Spring 2013

Spring 2007

Data Mining Spring 2007

Data Mining: Data