Rule Learning

Rule Learning Intelligent Systems – Lecture 10 Prof. D. Fensel & R. Krummenacher

Agenda • Motivation • Associative Rules • Decision Trees • Entropy • ID3 Algorithm • Refinement of Rule Sets • Generalization and Specialization • RELAX and JoJo Algorithms • Incremental Refinement of Rule Sets • Summary and Conclusions

Motivation • Data warehouses and other large scale data collections provide vaste amounts of data, and hide knowledge that is not obviously to discover • Rule learning (associative rule learning) is a popular means for discovering interesting relations between data sets. • Rules enable the inference of knowledge and make hidden facts explicit

Motivating Example • Association rule learning is very popular in marketing: • People shopping Coke and beer, are very likely to also buy potato chips: {Coke, beer} ⇒ {potato chips}. • Retail Advertising and Marketing Association: "For example, convenience stores realized that new fathers came in to buy Pampers and they also bought beer. So where are the Pampers now? Near the beer. It's about getting a feel for what other things you can merchandise“. • Other popular application domains include amongst others: Web usage, intrusion detection or bioinformatics.

Definition by Agrawal • Let I = {i1, i2, ..., in} be a set of binary attributes called items, and let D = {t1, t2, ..., tn} be a set of transactions called database. • Each transaction (tuple in the database) contains a subset of the items in I. • A rule is then defined to be of the form A → C (an implication), where A, C ⊆ I and A ∩ C = ∅. • The left-hand side (A) is called antecedent, the right-hand side (C) is referred to as consequent.

Simple Example • From Wikipedia (supermarket domain): • I = {milk, bread, butter, bear} • Example database • A possible rule would be {milk, bread} → butter • In reality, hundreds of itemsets must support a rule before it can be considered statistically significant

Significance of Rules • Association rule learning only makes sense in the context of very large data sets. • In very large data sets there are obviously hundreds if not thousands of implications discoverable. • Significance and interest in a rule is therefore an important selection criteria, and only those rules that represent a bigger share of the whole can be considered relevant. • The support of an itemset A is defined as the proportion of transactions ti⊆ D that contain A.

Confidence • The confidence in a rule depends on the support of the itemset A and the support of the union of A and C. • Confidence: conf(A → C) = supp(A ⋃ C) / supp(A). • The confidence is an estimated measure of probability, formally known as P(C|A), and gives an indication of the probability that the consequent holds if the antecedent is given.

Finding Rules • Association rules need to satisfy a given level of confidence, and a given degree of support the same time. • A two-step process is generally applied to discover rules that satisfy both requirements: • Mininmal support is used to determine frequent itemsets • Minimum confidence tresholds are applied to determine the rules. • The first step is significantly more challenging than the second one!

Finding Frequent Datasets • The number of possible datasets is given by the powerset of I, and is thus equal to: 2n – 1, with n = | I |. • Consequently, the number of potential datasets grows exponentially in the size of I. • Different algorithms allow nontheless to compute the frequent datasets efficiently: Apriori (BFS), Eclat (DFS), Frequent Pattern-Growth (FP-tree). • These algorithms exploit the downward-closure property (aka anti-monotonicity): frequent itemset have frequent subsets, which results in infrequent itemset to have infrequent supersets.

Alternative Indicators • All-confidence: all rules that can be derived from itemset A have at least a confidence of all-confidence(A) = supp(A) / max(supp(a ∈ A)) with max(supp(a ∈ A)) the support of the item with the highest support in A. • Collective strength is given by cs(A) = (1-v(A))/(1-E[v(A)]) * E[v(A)]/v(A) with v(Z) the violation rate and E[] the expected value for independent items; the violation rate is defined as the fraction of transactions which contain some of the items in an itemset but not all. Collective strength gives 0 for perfectly negative correlated items, infinity for perfectly positive correlated items, and 1 if the items co-occur as expected under independence.

Alternative Indicators (2) • Coverage (aka antecedent support) measures how often a rule is applicable in a database: coverage(A → C) = supp(A) = P(A) • The conviction of a rule indicates the ratio of appearences of the antecedent A without C being a consequence of A: conv(A → C) = 1 - supp(C) / 1 - conf(A → C). • Leverage measures the difference of A and C appearing together in the data set and what would be expected if A and C where statistically dependent: leverage(A → C) = P(A and C) - (P(A)P(B))

Alternative Indicators (3) • The lift of a rule is given as the ratio between the confidence and the pure chance of observing an observation, and thus measures how many times more often A and C occur together than expected if they where statistically independent: lift(A → C) = conf(A → C) / supp(C) = supp(A ⋃ C) / supp(A) * supp(C).

Decision Trees • Many inductive knowledge acquisition algorithms generate classifiers in form of decision trees. • A decision tree is a simple recursive structure for expressing a sequential classification process. • Leaf nodes denote classes • Intermediate nodes represent tests

Decision Trees and Rules • Rules can represent a decision tree: if item1 then subtree1 elseif item2 then subtree2 elseif... • There are as many rules as there are leaf nodes in the decision tree. • Advantage of rules over decision trees: • Rules are a widely-used and well-understood representation formalism for knowledge in expert systems; • Rules are easier to understand, modify and combine; and • Rules can significantly improve classification performance by eliminating unnecessary tests.

Decision-Driven Rules • The following definitions apply to rules that aim to conclude a fact out of a given set of attribute value assignments. • The decision tree takes the following form: if attribute1 = value1 then subtree1 elseif attribute1 = value2 then subtree2 elseif... • The critical question is then: which attribute should be the first one to evaluate, i.e. which attribute is the most selective determiner and should be the first one in the decision tree.

Entropy • Entropy is a measure of ‚degree of doubt‘ and is a well-studied concept in information theory. • Let {c1, c2, ..., cn} be a set of conclusions C of a rule (consequences); let {x1, x2, ..., xn} be a set of possible values of an attribute X. • The probability that ci is true given that X has value xj is given by p(ci|xj). • Entropy is then defined as entropy = - Σ p(ci|xj) log2[p(ci|xj)] for i ∈ 1...|C| • The logarithm (- log2[p(ci|xj)]) indicates the ‚amount of information‘ that xj has to offer about the conclusion ci.

Most Useful Determiner • The lower the entropy of xj with respect to C, the more information xj has to offer about C. • The entropy of an attribute X with respect to C is then given by - Σ p(xj) Σ p(ci|xj) log2[p(ci|xj)]. • The attribute with the lowest entropy is the most useful determiner, as it has the lowest ‚degree of doubt‘.

ID3 Algorithm by Quinlan • For each attribute, compute its entropy with respect to the conclusion; • Select the attribute with lowest entropy (say X) • Divide the data into separate sets so that within a set, X has a fixed value (X=x1 in one set, X=x2 in another...) • Build a tree with branches: if X=x1 then ... (subtree1) if X=x2 then ... (subtree2) ... • For each subtree, repeat from step 1. • At each iteration, one attribute gets removed from consideration. STOP if no more attributes are left, or if all attribute values lead to the same conclusion.

Robinson Crusoe Example • Identifying what is good to eat, sixteen rules: • Next step, determining the entropy of each attribute with respect to the conclusion.

Robinson Crusoe Example (2) • Considering Size as example, yields: • p(safe | large) = 5/7 • p(unsafe | large) = 2/7 • p(large) = 7/16 • p(safe | small) = 5/9 • p(unsafe | small) = 4/9 • p(small) = 9/16 • Entropy of size = 7/16 (5/7 log (5/7) + 2/7 log(2/7)) + 9/16 (5/9 log (5/9) + 4/9 log(4/9)) = 0.935095...

Robinson Crusoe Example (3) • Calculatin all entropies results in Color having the smallest one • Color is thus the most useful determiner • A resulting six rules determine the same space, as the initial bigger set of sixteen rules: • if color=brown then unsafe • if color=green and size=large then safe • if color=green and size=small then unsafe • if color=red and skin=hairy then safe • if color=red and skin=smooth and size=large then unsafe • if color=red and skin=smooth and size=small then safe

Counter Example • Consider the following data: • X=3, Y=3 ⇒ yes • X=2, Y=1 ⇒ no • X=3, Y=4 ⇒ no • X=1, Y=1 ⇒ yes • X=2, Y=2 ⇒ yes • X=3, Y=2 ⇒ no • The entropy-based ID3 algorithm is incapable to spot the obvious relationship if X = Y then yes else no, as only one attribute is considered at the time!

Accuracy of ID3 • ID3 forms rules by eliminating conditions from a path in the decision tree, and thus the rules tend to be over-generalized with respect to the training data. • Can rules keep up with the decision trees? • Experimental results by Quinlan in 1987 show that rules are not only simpler in the general case, but that they are sometimes even more accurate! J.R. Quinlan: Generating Production Rules From Decision Trees, IJCAI’87

Refinement of Rule Sets • There is a four step procedure for the refinment of rules: • Rules that become incorrect because of new examples are refined: incorrect rules are replaced by new rules that cover the positive examples, but not the new negative ones. • Complete the rule set to cover new positive examples. • Redundant rules are deleted to correct the rule set. • Minimize the rule set. • Steps 1 and 2 are subject to the algorithm JoJo that integrates generalization and specification via a heuristic search procedure.

Specialization • Specialization algorithms start from very general descriptions and specializes those until they are correct. • This is done by adding additional premises to the antecedent of a rule, or by restricting the range of an attribute which is used in an antecedent. • Algorithms relying on specialization generally have the problem of overspecialization: previous specialization steps could become unnecessary due to subsequent specialization steps. • This brings along the risk for ending up with results that are not maximal-general. • Some examples of (heuristic) specialization algorithms are the following: AQ, C4, CN2, CABRO, FOIL, or PRISM; references at the end of the lecture.

Generalization • Generalization starts from very special descriptions and generalizes them as long as they are not incorrect, i.e. in every step some unnecessary premises are deleted from the antecedent. • The generalization procedure stops if no more premises to remove exist. • Generalization avoids the maximal-general issue of specialization, in fact it guarantees most-general descriptions. • However, generalization of course risks to derive final results that are not most-specific. • RELAX is an example of a generalization-based algorithm; references at the end of the lecture.

RELAX • RELAX is a generalization algorithm, and proceeds as long as the resulting rule set is not incorrect. • Interesting to note, the motivation for RELAX were algorithms from a different domain: minimalization of electronic circuits. • In fact, the minimization of the description of two classes and a binary attribute is identifal to the discovery of a minimal boolean expression according to McClusky’56.

RELAX (2) • Every example is considered to be a specific rule that is generalized. • The algorithm then starts from a first rule and relaxes the first premise. • The resulting rule is tested against the negative examples. • If the new (generalized) rule covers negative examples, the premise is added again, and the next premise is relaxed. • A rule is considered minimal, if any further relaxation would destroy the correctness of the rule. • The search for minimal rules starts from any not yet considered example, i.e. examples that are not covered by already discovered minimal rules.

RELAX - Example • Consider the following positive example for a consequent C: (pos, (x=1, y=0, z=1)) • This example is represented as a rule: x ∩ ¬ y ∩ z → C • In case of no negative examples, RELAX constructs and tests the following set of rules: 1) x ∩ ¬ y ∩ z → C 5) x → C 2) ¬ y ∩ z → C 6) ¬ y → C 3) x ∩ z → C 7) z → C 4) x ∩ ¬ y → C 8) → C

Summary: Specialization and Generalization • Specialization and Generalization are dual search directions in a given rule set. • Specialization starts at the ‘Top‘ element and covers negative examples. • Generalization starts with the ‘Bottom‘ element and uncovers positive examples.

JoJo – Refinement of Rule Sets • In general, it cannot be determined which search direction is the better one. • Note that ID3 makes in fact already use of both by first constructing rules via specialization and second by generalizing the rule set. • JoJo is an algorithm that combines both search directions in one heuristic search procedure. • JoJo can start at an arbitrary point in the lattice of complexes and generalizes and specializes as long as the quality and correctness can be improved, i.e. until a local optimum can be found, or no more search resources are available (e.g., time, memory).

JoJo (2) • While specialization moves solely from ‘Top‘ towards ‘Bottom‘ and generalization from ‘Bottom‘ towards ‘Top‘, JoJo is able to move freely in the search space. • Either of the two strategies can be used interchangeable, which makes JoJo more expressive than comparable algorithms that apply the two in sequential order (e.g. ID3).

JoJo (3) • A starting point in JoJo is described by two parameters: • Vertical position (length of the description) • Horizontal position (chosen premises) • Reminder: JoJo can start at any arbitrary point, while specialization requires a highly general point and generalization requires a most specific point. • In general, it is possible to carry out several runs of JoJo with different starting points. Rules that were already produced can be used as subsequent starting points.

JoJo – Choosing a Starting Point • Criteria for choosing a vertical position: • Approximation of possible lenght or experience from earlier runs. • Random production of rules; distribution by means of the average correctness of the rules with the same length (so-called quality criterion). • Start with a small sample or very limited resources to discover a real starting point from an arbitrary one. • Randomly chosen starting point (same average expectation of success as starting with ‘Top‘ or ‘Bottom‘). • Heuristic: few positive examples and maximal-specific descriptions suggest long rules, few negative examples and maximal-generic descriptions rather short rules.

JoJo – Choosing a Starting Point (2) • Criteria for choosing a horizontal position: • From the vertical position, one can select the premise with the highest correlation with the goal concept (consequent)

JoJo Principle Components • JoJo consists of three components: • Specializer and Generalizer • Scheduler • The former two can be provided by any such components depending on the chosen strategies and preference criterias. • The Scheduler is responsible for selecting the next description out of all possible generalizations and specializations available (by means of a t-preference, total preference). • Simple example scheduler: • Specialize, if the error rate is above threshold; • Otherwise, choose the best generalization with allowable error rate; • Otherwise stop.

Incremental Refinement of Rules with JoJo • Refinement of rules refers to the modification of a given rule set based on additonal examples. • The input to the task is a so-called hypothesis (a set of rules) and a set of old and new positive and negative examples. • The output of the algorithm are a refined set of rules and the total set of examples. • The new set of rules is correct, complete, non-redundant and (if necessary) minimal.

Incremental Refinement of Rules with JoJo (2) • Correctness: • Modify overly general rules that cover too many negative examples. • Replace a rule by a new set of rules that cover the positive examples, but not the negative ones. • Completeness: • Compute new correct rules that cover the not yet considered positive examples (up to a threshold). • Non-redundancy: • Remove rules that are more specific than other rules (i.e. rules that have premises that are a superset of the premises of another rule).

Summary • Associative Rules help to discover otherwise hidden knowledge. • To discover rules it is important to understand the significance of data sets and rules, and to have relevance measures in place; e.g. coverage, leverage or confidence. • Alternative notations for rules are so-called decision trees. • The ID3 algorithm by Quilan is a very important example. • In this context, we need to understand the principle of entropy (information theory) to determine the ‘amount of information‘ that a given attribute brings along.

Summary (2) • Rules cover positive examples and should not cover negative examples. • There are two main approaches for determining rules: • Generalization • Specification • RELAX is a presented example of a generalization algorithm • JoJo combines the two and allows the algorithm to tranverse the entire search space by either generalizing or specializing rules inter-changeably. • JoJo can also be applied to incrementally refine rule sets.

Specialization and Generalization Algorithms • AQ: Michalski, Mozetic, Hong and Lavrac: The Multi-Purpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains. AAAI-86, pp. 1041-1045. • C4: Quinlan: Learning Logical Definitions from Relations: Machine Learning 5(3), 1990, pp. 239-266. • CN2: Clark and Boswell: Rule Induction with CN2: Some recent Improvement. EWSL-91, pp. 151-163. • CABRO: Huyen and Bao: A method for generating rules from examples and its application. CSNDAL-91, pp. 493-504. • FOIL: Quinlan: Learning Logical Definitions from Relations: Machine Learning 5(3), 1990, pp. 239-266. • PRISM: Cendrowska: PRISM: An algorithm for inducing modular rules. Journal Man-Machine Studies 27, 1987, pp. 349-370. • RELAX: Fensel and Klein: A new approach to rule induction and pruning. ICTAI-91.

Further References • Agrawal, Imielinsky and Swami: Mining Association Rules between Sets of Items in Large Databases. ACM SIGMOD Conference, 1993, pp. 207-216. • Quinlan: Generating Production Rules From Decision Trees. 10th Int’l Joint Conference on Artificial Intelligence, 1987, pp. 304-307. • Fensel and Wiese: Refinement of Rule Sets with JoJo. European Conference on Machine Learning, 1993, pp. 378-383.

Rule Learning

Rule Learning

Presentation Transcript

Delta-rule Learning

Perceptron Learning Rule

Derivation of a Learning Rule for Perceptrons

CS 391L: Machine Learning: Rule Learning

Rule

17.5 Rule Learning

Functional dissection of the CA3-CA1 learning rule

RULE

RULE 4 THE FORGOTTEN RULE

Specific Learning Disabilities Rule Overview

Efficient Rule Testing in Learning By Observation

Rule Learning – Overview

Derivative – Power Rule, Product Rule, Chain Rule, Quotient Rule

Encouraging Complementary Fuzzy Rules within Iterative Rule Learning

Derivation of a Learning Rule for Perceptrons

Learning Objectives for Section 11.4 The Chain Rule

Rule

Sine Rule and Cosine Rule

First-Order Rule Learning

Perceptron Learning Rule

Rule Languages and Rule Interchange

DIFFERENCES BETWEEN MACHINE LEARNING AND RULE BASED SYSTEMS