1 / 30

Data Mining – Output: Knowledge Representation

Data Mining – Output: Knowledge Representation. Chapter 3. Representing Structural Patterns. There are many different ways of representing patterns 2 covered in Chapter 1 – decision trees and classification rules

Download Presentation

Data Mining – Output: Knowledge Representation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining – Output: Knowledge Representation Chapter 3

  2. Representing Structural Patterns • There are many different ways of representing patterns • 2 covered in Chapter 1 – decision trees and classification rules • Learned pattern is a form of “knowledge representation” (even if the knowledge does not seem very impressive)

  3. Decision Trees • Make decisions by following branches down the tree until a leaf is found • Classification based on contents of leaf • Non-leaf node usually involve testing a single attribute • Usually for different values of nominal attributes, or for range of a numeric attribute (most commonly a two way split, > some value and < same value) • Less commonly, compare two attribute values, or some function of multiple attributes • Common for an attribute once used to not be used at a lower level of same branch

  4. Decision Trees • Missing Values • May be treated as another possible value of a nominal attribute – if missing data may mean something • May follow most popular branch when data is missing from test data • More complicated approach – rather than going all-or-nothing, can ‘split’ the test instance in proportion to popularity of branches in test data – recombination at end will use vote based on weights

  5. Classification Rules • Popular alternative to decision trees • LHS / antecedent / precondition – tests to determine if rule is applicable • Tests usually ANDed together • Could be general logical condition (AND/OR/NOT) but learning such rules is MUCH less constrained • RHS / consequent / conclusion – answer –usually the class (but could be a probability distribution) • Rules with the same conclusion essentially represent an OR • Rules may be an ordered set, or independent • If independent, policy may need to be established for if more than one rule matches (conflict resolution strategy) or if no rule matches

  6. Rules / Trees • Rules can be easily created from a tree – but not the most simple set of rules • Transforming rules into a tree is not straightforward (see “replicated subtree” problem – next two slides) • In many cases rules are more compact than trees – particularly if default rule is possible • Rules may appear to be independent nuggets of knowledge (and hence less complicated than trees) – but if rules are an ordered set, then they are much more complicated than they appear

  7. Figure 3.1 Decision tree for a simple disjunction. If a and b then x If c and d then x

  8. Figure 3.3 Decision tree with a replicated subtree. If x=1 and y=1 then class = a If z=1 and w=1 then class = a Otherwise class = b Each gray triangle actually contains the whole gray subtree below

  9. Association Rules • Association Rules are not intended to be used together as a set – in fact value is in the knowledge – probably no automatic use of rules • Large numbers of possible rules

  10. Association Rule Evaluation • Coverage – the number of instances for which it predicts correctly – also called support • Accuracy – proportion of instances that it predicts correctly – also called confidence • Coverage sometimes expressed as percent of the total # instances • Usually methods or users specify minimum coverage and accuracy for rules to be generated • Some possible rules imply others – present the strongest supported

  11. Example – My Weather – Apriori Algorithm Apriori Minimum support: 0.15 Minimum metric <confidence>: 0.9 Number of cycles performed: 17 Best rules found: 1. outlook=rainy 5 ==> play=no 5 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. temperature=hot windy=FALSE 3 ==> play=no 3 conf:(1) 4. temperature=hot play=no 3 ==> windy=FALSE 3 conf:(1) 5. outlook=rainy windy=FALSE 3 ==> play=no 3 conf:(1) 6. outlook=rainy humidity=normal 3 ==> play=no 3 conf:(1) 7. outlook=rainy temperature=mild 3 ==> play=no 3 conf:(1) 8. temperature=mild play=no 3 ==> outlook=rainy 3 conf:(1) 9. temperature=hot humidity=high windy=FALSE 2 ==> play=no 2 conf:(1) 10. temperature=hot humidity=high play=no 2 ==> windy=FALSE 2 conf:(1)

  12. Rules with Exceptions • Skip

  13. Rules involving Relations • More than the value for attributes may be important • See book example on next slide

  14. Figure 3.6 The shapes problem. Shaded: standing Unshaded: lying

  15. More Complicated – Winston’s Blocks World • House – 3 sided block & 4 sided block AND 3 sided is on top of 4 sided • Solutions frequently involve learning rules that include variables/parameters • E.g. 3sided(block1) & 4sided(block2) & ontopof(block1,block2)  house

  16. Easier and Sometimes Useful • Introduce new attributes during data preparation • New attribute represents relationship • E.g. for the standing / lying task could introduce new boolean attribute: widthgreater? which would be filled in for each instance during data prep • E.g. in numeric weather, could introduce “WindChill” based on calculations from temperature and wind speed (if numeric) or “Heat Index” based on temperature and humidity

  17. Numeric Prediction • Standard for comparison for numeric prediction is the statistical technique of regression • E.g. for the CPU performance data the regression equation below was derived PRP = - 56.1 + 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH - 0.270 CHMIN + 1.46 CHMAX

  18. Tree branches as in a decision tree (may be based on ranges of attributes) Regression Tree – leaf nodes contain average of training set values that the leaf applies to Model Tree – leaf nodes contain regression equations for the instances that the leaf applies to Trees for Numeric Prediction

  19. Figure 3.7(b) Models for the CPU performance data: regression tree.

  20. Figure 3.7(c) Models for the CPU performance data: model tree.

  21. Instance Based Representation • Concept not really represented (except via examples) • Real World Example – some radio stations don’t define what they play by words, they play promos basically saying “WXXX music is:” <songs> • Training examples are merely stored (kind of like “rote learning”) • Answers are given by finding the most similar training example(s) to test instance at testing time • Has been called “lazy learning” – no work until an answer is needed

  22. Instance Based – Finding Most Similar Example • Nearest Neighbor – each new instance is compared to all other instances, with a “distance” calculated for each attribute for each instance • Class of nearest neighbor instance is used as the prediction <see next slide and come back> • OR K-nearest neighbors vote, or weighted vote • Combination of distances – city block or euclidean (crow flies)

  23. Nearest Neighbor • x • x x • y • x x • y • x • x T • x • y • y • z • z • z • y • z • y • z • y x • z • y • y • z • z • z

  24. Additional Details • Distance/ Similarity function must deal with binaries/nominals – usually by all or nothing match – but mild should be a better match to hot than cool is! • Distance / Similarity function is simpler if data is normalized in advance. E.g. $10 difference in household income is not significant, while 1.0 distance in GPA is big • Distance/Similarity function should weight different attributes differently – key task is determining those weights

  25. Further Wrinkles • May not need to save all instances • Very normal instances may not all need be be saved • Some approaches actually do some generalization

  26. But … • Not really a structural pattern that can be pointed to • However, many people in many task/domains will respect arguments based on “previous cases” (diagnosis, law among them) • Book points out that instances + distance metric combine to form class boundaries • With 2 attributes, these can actually be envisioned <see next slide>

  27. Figure 3.8 Different ways of partitioning the instance space. (a) (b) (c) (d)

  28. Clustering • Clusters may be able to be represented graphically • If dimensionality is high, best representation may only be tabular – showing which instances are in which clusters • Show Weka – do njcrimenominal with EM and then do visualization of results • In some algorithms associate instances with clusters probabilistically – for every instance, list probability of membership in each of the clusters • Some algorithms produce a hierarchy of clusters and these can be visualized using a tree diagram • After clustering, clusters may be used as class for classification

  29. Figure 3.9 Different ways of representing clusters. (a) (b) (c) 1 2 3 a 0.4 0.1 0.5 b 0.1 0.8 0.1 c 0.3 0.3 0.4 d 0.1 0.1 0.8 e 0.4 0.2 0.4 f 0.1 0.4 0.5 g 0.7 0.2 0.1 h 0.5 0.4 0.1 … (d)

  30. End Chapter 3

More Related