2.38k likes | 2.53k Views
Data Mining Chapter 4 Algorithms: The Basic Methods. Kirk Scott. Did you have any idea that symbols like these could be inserted into PowerPoint presentations? Ѿ ҉ ҈ ۞ ۩ ҂. Basic Methods. A good rule of thumb is try the simple things first
E N D
Data MiningChapter 4Algorithms: The Basic Methods Kirk Scott
Did you have any idea that symbols like these could be inserted into PowerPoint presentations? • Ѿ • ҉ • ҈ • ۞ • ۩ • ҂
Basic Methods • A good rule of thumb is try the simple things first • Quite frequently the simple things will be good enough or will provide useful insights for further explorations • One of the meta-tasks of data mining is figuring out which algorithm is the right one for a given data set
Certain data sets have a certain structure • Certain algorithms are designed to elicit particular kinds of structures • The right algorithm applied to the right set will give straightforward results • A mismatch between algorithm and data set will give complicated, cloudy results
Chapter 4 is divided into 8 basic algorithm descriptions plus a 9th topic • 4.1 Inferring Rudimentary Rules • 4.2 Statistical Modeling • 4.3 Divide and Conquer: Constructing Decision Trees • 4.4 Covering Algorithms: Constructing Rules
4.5 Mining Association Rules • 4.6 Linear Models • 4.7 Instance-Based Learning • 4.8 Clustering • 4.9 Multi-Instance Learning
4.1 Inferring Rudimentary Rules • The 1R (1-rule) approach • Given a training set, make a one-level decision tree, based on classifications based on one attribute • For each different value of the attribute, have its predicted classification be that of the majority of the classifications in the training set
Do this for each attribute and pick the tree/rule set that has the lowest error rate • The error rate is simply the count of the total number of misclassified instances across the training set for the rule set • This simple approach frequently works well • This suggests that for a lot of data sets, one dominant attribute is a strong determinant
Missing Values and Numeric Attributes • Missing values are easily handled by 1R • Missing is just one of the branches in the decision tree • 1R is fundamentally nominal • How to decide how to branch on a numeric attribute?
One approach to branching on numerics: • Sort the numeric instances • Create a break point everywhere in the sequence that the classification changes • This partitions the domain
Overfitting • The problem is that you may end up with lots of break points/partitions • In the extreme case, there are as many determinant values as there are classifications • This is not good • You’ve essentially determined a 1-1 coding from the attribute in question to the classification
If this happens, instances in the future that do not have these values for the attributes can’t be classified by the system • They will not fall into any known partition • This is the classic case of overfitting
Dealing with Overfitting in 1R • Specify a minimum number, n, of instances/attribute values per partition • The potential problem now is mixed classifications among n neighboring attribute values in sorted order • The solution is to take the majority classification of the n as the rule
Suppose the previous steps result in neighboring partitions with the same classification • Merge those partititions • This potentially will reduce the number of partitions significantly
Notice how rough and ready this is • You are guaranteed misclassifications • However, on the whole, you hope for a significant proportion of correct classifications
Discussion • 1R is fast and easy • 1R quite often performs only slightly less well than advanced techniques • It makes sense to start simple in order to get a handle on a data set • Go to something more complicated if desired • In the text a slightly more complicated starting point is also mentioned…
This is basically a discussion of an application of Bayes’ Theorem • Bayes’ Theorem makes a statement about what is known as conditional probability • I will cover the same ideas as the book, but I will do it in a slightly different way • Whichever explanation makes the most sense to you is the “right” one
The book refers to this approach as Naïve Bayes • It is based on the simplifying assumption that the attributes in a data set are independent • Independence isn’t typical • Otherwise there would be no associations to mine • Even so, the technique gives good results
The Weather Example • Table 4.2, shown on the overhead following the next one, summarizes the outcome of play yes/no for each weather attribute for all instances in the training set • Note this in particular on first viewing: • 9 total yeses, 5 total nos
For the Outlook attribute: • Sunny-yes 2/9 • Overcast-yes 4/9 • Rainy-yes 3/9 • Sunny-no 3/5 • Overcast-no 0/5 • Rainy-no 2/5 • Given the outcome, yes/no, these fractions tell you the likelihood that there was a given outlook
Bayes’ Theorem • Bayes’ Theorem involves a hypothesis, H, and some evidence, E, relevant to the hypothesis • The theorem gives a formula for finding the probability that H is true under the condition that you know that E is true • The theorem is based on knowing some other probabilistic quantities
This is a statement of the theorem: • P(E|H)P(H) • P(H|E) = --------------- • P(E)
Illustrating the Application of the Theorem with the Weather Example • The book does its example with all of the attributes at once • I will do this with one attribute and then generalize • I will use the Outlook attribute • Let H = (play = yes) • Let E = (outlook = sunny)
Then P(H|E) • = P(play = yes | outlook = sunny) • By Bayes’ Theorem this equals • P(outlook = sunny | play = yes)P(play = yes) • ---------------------------------------------------------- • P(outlook = sunny)
You can pull actual arithmetic values out of the summarized weather table to plug into the expression • P(E|H) = P(outlook = sunny | play = yes) = 2/9 • P(H) = P(play = yes) = 9/14 • P(E) = P(outlook = sunny) = 5/9
P(H|E) • P(E|H)P(H) • = --------------- • P(E) • 2/9 * 9/14 1/7 9 • = ------------- = ---- = ---- = ~.257 • 5/9 5/9 35
Using the same approach, you can find P(H|E) where H = (play = no) • The arithmetic gives this result: ~.386 • Now consider, if the outlook is sunny, play is either yes or no • This is the universe of choices
That means you can normalize the two probabilities so that they’d sum to 1 • .257 * 1.555 = ~.4 • .386 * 1.555 = ~.6 • The grand conclusion is this: • If the outlook is sunny it is just about 50% more likely that play = no than that play = yes
What Does This Have to Do with Classification? • The preceding example illustrated Bayes’ Theorem applied to one attribute • A Bayesian expression (conditional probability) can be derived including evidence, Ei, for each of the i attributes • The totality of evidence, including all attributes could be noted as E
For any new instance there will be values for the i attributes • The fractions from the weather table corresponding to the instance’s attribute values can be plugged into the Bayesian expression • The result would be a probability, or prediction that play = yes or play = no for that set of attribute values
Statistical Independence • Recall that Naïve Bayes assumed statistical independence of the attributes • Stated simply, one of the results of statistics is that the probability of two events’ occurring is the product of their individual probabilities • This is reflected when forming the expression for the general case
The Example with Four Attributes • The weather data had four attributes, outlook, temperature, humidity, windy • Let E be the composite of E1, E2, E3, and E4 • In other words, an instance has values for each of these attributes, and fractional values for each of the attributes can be read from the table based on the training set
Bayes’ Theorem extended to this case looks like this: • P(E1|H)P(E2|H)P(E3|H)P(E4|H)P(H) • P(H|E) = ---------------------------------------------- • P(E)
You can compute the expressions for play = yes and play = no as illustrated in the simple, one-attribute case • You can normalize and compare them • If they are not equal, the larger value predicts the classification for the new instance
A Small Problem • If one of the probabilities in the numerator is 0, the whole numerator goes to 0 • This would happen when the training set did not contain instances with a particular set of values, but a new instance did • You can’t compare yes/no probabilities if they have gone to 0
The solution is to add constants to the top and bottom of fractions in the expression • This can be accomplished without changing the relative yes/no outcome • I don’t propose to go into this in detail (now) • To me it seems more appropriate for an advanced discussion later, if needed
Missing Values and Numeric Attributes • Missing values are handled easily under this scheme • They don’t become unwanted 0’s in a product • If an attribute value is missing, a fraction for its conditional probability is simply not included in the computation • When normalizing yes/no results, the missing fraction in each balances out
Handling numeric values involves a bit more work, but it’s straightforward • For the purposes of illustration, assume that the distribution of numeric attributes is normal • In the summary of the training set, instead of forming fractions like for occurrences of nominal values, find the mean and standard deviation for numeric ones
When a new instance arrives with a particular value for a numeric attribute, plug it into the normal probability density function
Plugging the attribute value into the p.d.f. will give a probability value • Now you can plug this into the Bayesian expression just like the fractions for the nominal attributes • In case it wasn’t apparent, I make no pretense to justifying any of this • Hopefully you can follow along, just like I do, based on the statistics classes you’ve taken
Naïve Bayes for Document Classification • The details for this appear in a box in the text, which means it’s advance and not to be covered in detail in this course • The basic idea is that documents can be classified by which words appear in them • The occurrence of a word can be modeled as a Boolean yes/no
The classification can be improved if the frequency of words is also taken into account • This is the barest introduction to the topic • It may come up again later in the book
Discussion • Like 1R, Naïve Bayes can often produce good results • The rule of thumb remains, start simple • It is true that dependent attributes will mess up a Bayesian analysis
The presence of dependent attributes means multiple factors for the same feature in the Bayesian expression • The solution is to try and select only a subset of independent ones to work with as part of preprocessing
If numeric attributes are not normal, as shown, then the normal p.d.f. can’t be used • However, if the attributes do fall into a known distribution, you can use its p.d.f. • As long as we’re simplifying, maybe the uniform distribution might be a useful starting point for an analysis
4.3 Divide-and-Conquer: Constructing Decision Trees • Note that like everything else in this course, this is a purely pragmatic presentation • Ideas will be given • Nothing will be proven • The book gives things in a certain order • I will try to cover pretty much the same things • I will do it in a different order
When Forming a Tree… • 1. The fundamental question is always which attribute to split on • 2. Suppose you can come up with a function, the information (info) function • This function is a measure of how much information is needed in order to make a decision at each node in a tree
3. You split on the attribute that gives the greatest information gain from level to level • 4. A split is good if it means that little information will be needed at the next level down • You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level