300 likes | 401 Views
Data Mining – Output: Knowledge Representation. Chapter 3. Representing Structural Patterns. There are many different ways of representing patterns 2 covered in Chapter 1 – decision trees and classification rules
E N D
Representing Structural Patterns • There are many different ways of representing patterns • 2 covered in Chapter 1 – decision trees and classification rules • Learned pattern is a form of “knowledge representation” (even if the knowledge does not seem very impressive)
Decision Trees • Make decisions by following branches down the tree until a leaf is found • Classification based on contents of leaf • Non-leaf node usually involve testing a single attribute • Usually for different values of nominal attributes, or for range of a numeric attribute (most commonly a two way split, > some value and < same value) • Less commonly, compare two attribute values, or some function of multiple attributes • Common for an attribute once used to not be used at a lower level of same branch
Decision Trees • Missing Values • May be treated as another possible value of a nominal attribute – if missing data may mean something • May follow most popular branch when data is missing from test data • More complicated approach – rather than going all-or-nothing, can ‘split’ the test instance in proportion to popularity of branches in test data – recombination at end will use vote based on weights
Classification Rules • Popular alternative to decision trees • LHS / antecedent / precondition – tests to determine if rule is applicable • Tests usually ANDed together • Could be general logical condition (AND/OR/NOT) but learning such rules is MUCH less constrained • RHS / consequent / conclusion – answer –usually the class (but could be a probability distribution) • Rules with the same conclusion essentially represent an OR • Rules may be an ordered set, or independent • If independent, policy may need to be established for if more than one rule matches (conflict resolution strategy) or if no rule matches
Rules / Trees • Rules can be easily created from a tree – but not the most simple set of rules • Transforming rules into a tree is not straightforward (see “replicated subtree” problem – next two slides) • In many cases rules are more compact than trees – particularly if default rule is possible • Rules may appear to be independent nuggets of knowledge (and hence less complicated than trees) – but if rules are an ordered set, then they are much more complicated than they appear
Figure 3.1 Decision tree for a simple disjunction. If a and b then x If c and d then x
Figure 3.3 Decision tree with a replicated subtree. If x=1 and y=1 then class = a If z=1 and w=1 then class = a Otherwise class = b Each gray triangle actually contains the whole gray subtree below
Association Rules • Association Rules are not intended to be used together as a set – in fact value is in the knowledge – probably no automatic use of rules • Large numbers of possible rules
Association Rule Evaluation • Coverage – the number of instances for which it predicts correctly – also called support • Accuracy – proportion of instances that it predicts correctly – also called confidence • Coverage sometimes expressed as percent of the total # instances • Usually methods or users specify minimum coverage and accuracy for rules to be generated • Some possible rules imply others – present the strongest supported
Example – My Weather – Apriori Algorithm Apriori Minimum support: 0.15 Minimum metric <confidence>: 0.9 Number of cycles performed: 17 Best rules found: 1. outlook=rainy 5 ==> play=no 5 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. temperature=hot windy=FALSE 3 ==> play=no 3 conf:(1) 4. temperature=hot play=no 3 ==> windy=FALSE 3 conf:(1) 5. outlook=rainy windy=FALSE 3 ==> play=no 3 conf:(1) 6. outlook=rainy humidity=normal 3 ==> play=no 3 conf:(1) 7. outlook=rainy temperature=mild 3 ==> play=no 3 conf:(1) 8. temperature=mild play=no 3 ==> outlook=rainy 3 conf:(1) 9. temperature=hot humidity=high windy=FALSE 2 ==> play=no 2 conf:(1) 10. temperature=hot humidity=high play=no 2 ==> windy=FALSE 2 conf:(1)
Rules with Exceptions • Skip
Rules involving Relations • More than the value for attributes may be important • See book example on next slide
Figure 3.6 The shapes problem. Shaded: standing Unshaded: lying
More Complicated – Winston’s Blocks World • House – 3 sided block & 4 sided block AND 3 sided is on top of 4 sided • Solutions frequently involve learning rules that include variables/parameters • E.g. 3sided(block1) & 4sided(block2) & ontopof(block1,block2) house
Easier and Sometimes Useful • Introduce new attributes during data preparation • New attribute represents relationship • E.g. for the standing / lying task could introduce new boolean attribute: widthgreater? which would be filled in for each instance during data prep • E.g. in numeric weather, could introduce “WindChill” based on calculations from temperature and wind speed (if numeric) or “Heat Index” based on temperature and humidity
Numeric Prediction • Standard for comparison for numeric prediction is the statistical technique of regression • E.g. for the CPU performance data the regression equation below was derived PRP = - 56.1 + 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH - 0.270 CHMIN + 1.46 CHMAX
Tree branches as in a decision tree (may be based on ranges of attributes) Regression Tree – leaf nodes contain average of training set values that the leaf applies to Model Tree – leaf nodes contain regression equations for the instances that the leaf applies to Trees for Numeric Prediction
Figure 3.7(b) Models for the CPU performance data: regression tree.
Figure 3.7(c) Models for the CPU performance data: model tree.
Instance Based Representation • Concept not really represented (except via examples) • Real World Example – some radio stations don’t define what they play by words, they play promos basically saying “WXXX music is:” <songs> • Training examples are merely stored (kind of like “rote learning”) • Answers are given by finding the most similar training example(s) to test instance at testing time • Has been called “lazy learning” – no work until an answer is needed
Instance Based – Finding Most Similar Example • Nearest Neighbor – each new instance is compared to all other instances, with a “distance” calculated for each attribute for each instance • Class of nearest neighbor instance is used as the prediction <see next slide and come back> • OR K-nearest neighbors vote, or weighted vote • Combination of distances – city block or euclidean (crow flies)
Nearest Neighbor • x • x x • y • x x • y • x • x T • x • y • y • z • z • z • y • z • y • z • y x • z • y • y • z • z • z
Additional Details • Distance/ Similarity function must deal with binaries/nominals – usually by all or nothing match – but mild should be a better match to hot than cool is! • Distance / Similarity function is simpler if data is normalized in advance. E.g. $10 difference in household income is not significant, while 1.0 distance in GPA is big • Distance/Similarity function should weight different attributes differently – key task is determining those weights
Further Wrinkles • May not need to save all instances • Very normal instances may not all need be be saved • Some approaches actually do some generalization
But … • Not really a structural pattern that can be pointed to • However, many people in many task/domains will respect arguments based on “previous cases” (diagnosis, law among them) • Book points out that instances + distance metric combine to form class boundaries • With 2 attributes, these can actually be envisioned <see next slide>
Figure 3.8 Different ways of partitioning the instance space. (a) (b) (c) (d)
Clustering • Clusters may be able to be represented graphically • If dimensionality is high, best representation may only be tabular – showing which instances are in which clusters • Show Weka – do njcrimenominal with EM and then do visualization of results • In some algorithms associate instances with clusters probabilistically – for every instance, list probability of membership in each of the clusters • Some algorithms produce a hierarchy of clusters and these can be visualized using a tree diagram • After clustering, clusters may be used as class for classification
Figure 3.9 Different ways of representing clusters. (a) (b) (c) 1 2 3 a 0.4 0.1 0.5 b 0.1 0.8 0.1 c 0.3 0.3 0.4 d 0.1 0.1 0.8 e 0.4 0.2 0.4 f 0.1 0.4 0.5 g 0.7 0.2 0.1 h 0.5 0.4 0.1 … (d)