G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 5: Discretisation methods

Outline of the lecture • Definition and taxonomy • Static discretisation techniques • The ADI representation: a dynamic, local, discretisation method • Resources

Definition • Discretisation: • Process of converting a continuous variable into a discrete one. That is, partitioning it into a finite (discrete) set of elements (intervals) • Every data point inside one interval will be treated equally I1 I2 I3

Definition • How? • A discretisation algorithm proposes a series of cut-points. These, plus the domains bounds define the intervals • Why? • Many methods in mathematics and computer science cannot deal with continuous variables • A discretisation process is required to be able to use them, despite the loss of information

Taxonomy • (Liu et al, 2002) • Supervised vs Non supervised • Supervised methods use the class label of each instances to decide • Dynamic vs Static • Dynamic discretisation is performed at the same time as the learning process. Static discretisation is performed before learning • Rule-based systems that generate arbitrary intervals can be considered a case of dynamic discretisation

Taxonomy • Global vs local • Global methods apply the same discretisation criteria (cut-points) to all instance • Local methods use different cut points for different groups of instances • Again, rule-learning methods that generate arbitrary intervals can be considered a case of local discretisation

Taxonomy • Splitting vs Merging methods • Splitting methods: Start with a single interval (no cut-point) and divide it • Merging methods: Start with every possible interval (n-1 cut points) and merge some of them Splitting Merging

Equal-length discretisation • Unsupervised classification • Given a domain with dl and du bounds and a number of bins b • It will generate b intervals of equal size (du-dl)/b

Equal-frequency discretisation • Unsupervised classification • Given a domain with dl and du bounds and a number of bins b • It will generate b intervals, each of them containing the same number of values

ID3 discretisation (Quinlan, 86) • Supervised, splitting method • Inspired in the ID3 decision trees induction algorithm • It chooses the cut-points that creates intervals with minimal Entropy, that is, maximising the Information Gain

ID3 splitting procedure • Start with a single interval • Identify the cut-point that creates two intervals with minimal entropy • Split the interval using the best cut-point • Recursively apply the method to S1 and S2 • Stop criterion: All instances in an interval have the same class S=original interval, S1,S2 candidate splits

(Fayyad & Irani, 93) discretisation • Refinement of ID3 to make it more conservative • ID3 generates lots of intervals because the stop criterion is very loose • In this method, in order to split an interval, the difference between Entropy(S) and EntropyPartition(S,S1,S2) needs to be large enough • Stop criteria based on the Minimum Description Length (MDL) principle (Rissanen, 78) • MDL is a modern reformulation of the classic Occam’s Razor principle: “If you have two equally good explanations, always choose the simplest one”

Minimum Description Length • MDL also comes from the information theory field, dealing with information transmission • The theory that minimizes its sizes and the size of the exceptions will be the best Receiver Sender Instances Instances + class How do we send the class of each instance? 1) Sending the classes + Theory description 2) Generating a theory and sending it plus its exceptions

MDL for discretisation • Stop partitioning if this inequality is true Gain of discretising Cost of discretizing N = number of instances in the original interval, k = number of classes k1,k2 = number of classes represented in the subpartitions 1 and 2

Unparametrized Supervised Discretizer (Giraldez et al., 02) • Supervised merging algorithm • It defines the quality of an interval by a measure called goodness • maxC(I) = number of examples belonging to the majority class in I • Errors(I) = number of examples not belonging to the majority class in I

Unparametrized Supervised Discretizer (Giraldez et al., 02) • discretisation process • Starts with every possible cut-point in the domain • Identifies candidate pairs of intervals to join (Ii,Ii+1) if both conditions are true • Ii and Ii+1 have the same majority class or there is a tie in Ii oir Ii+1 • goodness(Ii+Ii+1)> [goodness(Ii)+goodness(Ii+1)]/2 • Merges the candidate pair with highest goodness • Repeats steps 2-4 until no more intervals are merged

ChiMerge (Kerber, 92) • Merging supervised discretisation • Uses the c2 statistical test to decide whether to merge intervals or not • This test checks whether a discrete random variable presents an certain distribution • E.g. testing whether a dice is fair (all 6 outcomes are equally likely to occur) • Two intervals are merged if the null hypothesis (same distribution of classes) is true

ChiMerge • c2 formula • A p-value can be computed from c2 andN. A predefined confidence level is used to reject the test • Aij= examples in interval I from class j • p = number of classes • Ri = examples in interval I • Cj = examples from class j • N = total number of examples • It iteratively merges intervals until the statistical test fails for every possible pairs of consecutive intervals

Differences between discretisers (Bacardit, 2004)

Resources • Good survey on discretisation methods with empirical validation using C4.5 • Implementation of the methods described in this lecture is available in the KEEL package • List of the 27 discretisation algorithms (with references) in KEEL

Questions?

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT