430 likes | 735 Views
Learning set of rules. Content. Introduction Sequential Covering Algorithms Learning Rule Sets: Summary Learning First-Order Rule Learning Sets of First-Order Rules: FOIL Summary. Introduction. If-Then Rules Are very expressive Easy to understand Rules with variables: Horn-clauses
E N D
Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules: FOIL • Summary
Introduction • If-Then Rules • Are very expressive • Easy to understand • Rules with variables: Horn-clauses • Set of Horn-clauses build up a PROLOG program • Learning of Horn-clauses: Inductive Logic Programming (ILP) • Example first-order rule set for the target concept AncestorIF Parent (x,y) THEN Ancestor (x,y)IF Parent(x,z) Ancestor(z,y) THEN Ancestor(x,y)
Introduction 2 • GOAL: Learning a target function as a set of IF-THEN rules • BEFORE: Learning with decision trees • Learning the decision tree • Translate the tree into a set of IF-THEN rules (for each leaf one rule) • OTHER POSSIBILITY: Learning with genetic algorithms • Each set of rule is coded as a bitvector • Several genetic operators are used on the hypothesis space • TODAY AND HERE: • First: Learning rules in propositional form • Second: Learning rules in first-order form (Horn clauses which include variables) • Sequential search for rules, one after the other
Content • Introduction • Sequential Covering Algorithms • General to Specific Beam Search • Variations • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules: FOIL • Summary
Sequential Covering Algorithms • Goal of such an algorithm:Learning a disjunct set of rules, which defines a preferably good classification of the training data • Principle:Learning rule sets based on the strategy of learning one rule, removing the examples it covers, then iterating this process. • Requirement for the Learn-One-Rule method: • As Input it accepts a set of positive and negative training examples • As Output it delivers a single rule that covers many of the positive examples and maybe a few of the negative examples • Required: The output rule has a high accuracy but not necessarily a high coverage
Sequential Covering Algorithms 2 • Procedure: • Learning set of rules invokes the Learn-One-Rule method on all of the available training examples • Remove every positive example covered by the rule • Eventually short the final set of the rules: more accurate rules can be considered first • Greedy search: It is not guaranteed to find the smallest or best set of rules that covers the training example.
Sequential Covering Algorithms 3 SequentialCovering( target_attribute, attributes, examples, threshold ) learned_rules { } rule LearnOneRule( target_attribute, attributes, examples ) while (Performance( rule, examples ) > threshold ) do learned_rules learned_rules + rule examples examples - { examples correctly classified by rule } rule LearnOneRule( target_attribute, attributes, examples ) learned_rules sort learned_rules according to Performance over examples return learned_rules
General to Specific Beam Search • Specialising search • Organises a hypothesis space search in general the same fashion as the ID3, but follows only the most promising branch of the tree at each step • Begin with the most general rule (no/empty precondition) • Follow the most promising branch: • Greedily adding the attribute test that most improves the measured performance of the rule over the training example • Greedy depth-first search with no backtracking • Danger of sub-optimal choice • Reduce the risk: Beam Search (CN2-algorithm)Algorithm maintains the list of the k best candidatesIn each search step, descendants are generated for each of these k-best candidatesThe resulting set is then reduced to the k most promising members
General to Specific Beam Search 2 • Learning with decision tree
General to Specific Beam Search 4 The CN2-Algorithm LearnOneRule( target_attribute, attributes, examples, k ) Initialise best_hypothesis to the most general hypothesis Ø Initialise candidate_hypotheses to the set { best_hypothesis } while ( candidate_hypothesis is not empty ) do 1. Generate the next more-specific candidate_hypothesis 2. Update best_hypothesis 3. Update candidate_hypothesis return a rule of the form „IFbest_hypothesisTHENprediction“ where prediction is the most frequent value of target_attribute among those examples that match best_hypothesis. Performance( h, examples, target_attribute ) h_examples the subset of examples that match h return -Entropy( h_examples ), where Entropy is with respect to target_attribute
General to Specific Beam Search 5 • Generate the next more specific candidate_hypothesis all_constraintsset of all constraints (a = v), where aÎattributes and v is a value of a occuring in the current set of examples new_candidate_hypothesis for each h in candidate_hypotheses, for each c in all_constraints create a specialisation of h by adding the constraint c Remove from new_candidate_hypothesis any hypotheses which are duplicate, inconsistent or not maximally specific • Update best_hypothesis for all h in new_candidate_hypothesisdo if statistically significant when tested on examples Performance( h, examples, target_attribute ) > Performance( best_hypothesis, examples, target_attribute ) ) thenbest_hypothesis h
General to Specific Beam Search 6 • Update the candidate-hypothesis candidate_hypothesis the k best members of new_candidate_hypothesis, according to Performance function • Performance function guides the search in the Learn-One -Rule • s: the current set of training examples • c: the number of possible values of the target attribute • : part of the examples, which are classified with the ith. value
Example for CN2-Algorithm LearnOneRule(EnjoySport, {Sky, AirTemp, Humidity, Wind, Water, Forecast, EnjoySport}, examples, 2) best_hypothesis = Ø candidate_hypotheses = {Ø} all_constraints = {Sky=Sunny, Sky=Rainy, AirTemp=Warm, AirTemp=Cold, Humidity=Normal, Humidity=High, Wind=Strong, Water=Warm, Water=Cool, Forecast=Same, Forecast=Change} Performance = nc / n n = Number of examples, covered by the rule nc = Number of examples covered by the rule and classification is correct
Example for CN2-Algorithm (2) Pass 1 Remove delivers no result candidate_hypotheses = {Sky=Sunny, AirTemp=Warm} best_hypothesis ist Sky=Sunny
Example for CN2-Algorithm (3) Pass 2 Remove (duplicate, inconsistent, not maximally specific) candidate_hypotheses = {Sky=Sunny AND AirTemp=Warm, Sky=Sunny AND Humidity=High} best_hypothesis remains Sky=Sunny
Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules:FOIL • Summary
Learning Rule Sets: Summary • Key dimension in the design of the rule learning algorithm • Here sequential covering: learn one rule, remove the positive examples covered, iterate on the remaining examples • ID3 simultaneous covering • Which one should be prefered? • Key difference: choice at the most primitive step in the searchID3: chooses among attributes by comparing the partitions of the data they generatedCN2: chooses among attribute-value pairs by comparing the subsets of data they cover • Number of choices: learn n rules each containing k attribute-value tests in their preconditionCN2: n*k primitive search stepsID3: fewer independent search steps • If the data is plentiful, then it may support the larger number of independent decisons • If the data is scarce, the sharing of decisions regarding preconditions of different rules may be more effective
Learning Rule Sets: Summary 2 • CN2: general-to-specific (cf. Find-S specific-to-general) • Advantage: there is a single maximally general hypothesis from which to begin the search <=> there are many specific ones • GOLEM: choosing several positive examples at random to initialise and to guide the search. The best hypothesis obtained through multiple random choices is the selected one • CN2: generate then test • Find-S, CN2, AQ and Cigol are example-driven • Advantage: each choice in the search is based on the hypothesis performance over many examples • How are the rules post-pruned?
Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • First-Order Horn Clauses • Terminology • Learning Sets of First-Order Rules: FOIL • Summary
Learning First-Order Rule • Previous: Algorithms for learning sets of propositional rules • Now: Extension of these algorithms for learning first-order rules • In particular: inductive logic programming (ILP): • Learning of first-order rules or theories • Use PROLOG: general purpose Turing-equivalent programming language in which programs are expressed as collection of Horn clauses.
First-Order Horn Clauses • Advantage of the representation: learning Daughter(x,y) • Result of C4.5 or CN2 • Result ILP • Problem: Propositional representations offer no general way to describe the essential relation among the values of the attributes • In first-order Horn clauses variables that do not occur in the postcondition may occur in the precondition • Possible: Use the same predicates in the rule postconditions and preconditions (recursive rules)
Terminology • Every well formed expression is composed of constants, variables, predicates and functions • A term is any constant, any variable, or any function applied to any term • Literal is any predicate (or its negation) applied to any set of terms • A ground literal is a literal that does not contain any variables • A negative literal is a literal containing a negated predicate • A positive literal is a literal with no negative sign • A clause is any disjunction of literals: whose variables are universally quantifed • A Horn clause is an expression of the formwhere are positive literals. H is called the head, postcondicions or consequent of the Horn clause. The conjunction of the literal is called the body precondition or ancendents of the Horn clause
Terminology 2 • For any literals A and B, the expression is equivalent to , and the expression is equivalent to=> a Horn clause can equivalently be written as • A substitution is any function that replaces variables by terms. For example, the substitution replaces the variable x by the term 3 and replaces the variable y by the term z. Given a substitution and a literal L => denotes the result of applying the substitution to L. • An unifying substitution for two literals and is any substitution such that
Content • Introduction • Sequential Covering Algorithms • Learning Rule Sets: Summary • Learning First-Order Rule • Learning Sets of First-Order Rules:FOIL • Generating Candidate Specialisation in FOIL • Guiding Search in FOIL • Learning Recursive Rule Sets • Summary of FOIL • Summary
Learning Sets of First-Order Rules:FOIL FOIL( target_predicate, predicates, examples ) pos those examples, for which target_predicate is true Neg those examples, for which target_predicate is false learned_rules{ } while ( pos ) do /* learn a new rule */new_rule the rule that predicts target_predicate with no preconditionsnew_rule_negnegwhile ( new_rule_neg ) do /* Add a new literal to specialize new_rule */candidate_literals generate candidate new literals for new_rule, based on predicatesbest_literalargmaxL candidate_literals FoilGain( L, new_rule ) add best_literal to preconditions of new_rulenew_rule_neg subset of new_rule_neg that satisfies new_rule‘s preconditionslearned_ruleslearned_rules + new_rulepospos - { members of pos covered by new_rule } returnlearned_rules
Learning Sets of First-Order Rules:FOIL 2 • External loop : „Specific-to-General“ • Generalise the set of rules: • Add a new rule to the existing hypothesis • Each new rule „generalises“ the actual hypothesis to increase the number of the positive classified instances • Start with the special precondition, in which there is not any positive target concept
Learning Sets of First-Order Rules:FOIL 3 FOIL( target_predicate, predicates, examples ) pos those examples, for which target_predicate is true neg those examples, for which target_predicate is false learned_rules { } while ( pos ) do /* learn a new rule */new_rule the rule that predict target_predicate with no preconditionsnew_rule_neg negwhile ( new_rule_neg ) do /* Add a new literal to specialize new_rule */candidate_literals generate candidate new literals for new_rule, based on predicatesbest_literal argmax L candidate_literals FoilGain( L, new_rule ) add best_literal to preconditions of new_rulenew_rule_neg subset of new_rule_neg that satisfies new_rule's preconditionslearned_ruleslearned_rules + new_rulepospos - { members of pos covered by new_rule } returnlearned_rules
Learning Sets of First-Order Rules:FOIL 4 • Internal loop : „General-to-specific“ • Specialization of the current rule • Generation of new literals • Current rule: P( x1, x2, ..., xk ) ¬L ... Ln • Treatment of new literales in the following form: • Q( v1, ..., vr ), where Q is a predicate name occuring in Predicates (QÎpredicates) and where the vi are either new variables or variables already present in the rule • At least one of the vi in the created literal must already exist as a variable in the rule. Equal( xj, xk ), where xj and xkvariables already present in the rule • The negation of either of the above forms of literals
Learning Sets of First-Order Rules:FOIL 5 • Example: predict GrandDaughter(x,y) other predicates: Father and Female • FOIL begins with the most general rule GrandDaughter(x,y) • Specialisation: the following literals are generated as candidates:Equal(x,y), Female(x), Female(y), Father(x,y), Father(y,x), Father(x,z),Father(z,x), Father(y,z), Father(z,y) + the negation of each of these literals • Suppose FOIL to select greedily Father(y,z): GrandDaughter(x,y) Father(y,z) • Generating the candidate literals to further specialise the rule:all in the previous step generated literals + Female(z), Equal(z,x), Equal(z,y), Father(z,w) Father(w,z) + their negation • Suppose FOIL to select greedily at this point Father(z,x) and on the next iteration Female(x)GrandDaughter(x,y) Father(y,z) Father(z,x) Female(x)
Guiding the Search in FOIL • Select the most promising literal: FOIL considers the performance of the rule over the training data • Consider all possible bindings of each variable in the current rule for each literal • As new literals are added to the rule the set of bindings will change • If a literal is added that introduces a new variable then the bindings for the rule will grow in length • If the new variable can be bound to different constants, then the number of bindings fitting the extended rule can be greater than the number associated with the original rule • The association function of FOIL (FOILGain) that estimates the utility of adding a new literal to the rule is based on the number of positive and negative bindings before and after adding the new literal
Guiding the Search in FOIL 2 • FoilGain • Consider some rule R and a candidate literal L that might be added to the body of R (R') . • Prove for each literal and for all bindings the information gain • High value - high information gain • p0: the number of positive bindings of R • p1: the number of positive bindings of R' • n0: the number of negative bindings of R • n1: the number of negative bindings of R' • t: the number of positive bindings of rule R that are still covered after adding literal L to R
Summary of FOIL • Extend CN2 to handle learning first-order rules similar to Horn clauses • During the learning process FOIL performs a general-to-specific search at each step adding a single new literal to the rule precondition • At each step FOILGain is used to select one of the candidate literals • If new literals are allowed to refer to the target predicate, then FOIL can, in principle learn sets of recursive rules • Handling noisy data the search is continued until some trade-off occurs between rule accuracy, coverage and complexity • FOIL uses a minimum description length approach to limit the growth of rules: by adding new literals only if their description length is shorter than the description length of the training data they explain.
Summary • Sequential covering learns a disjunctive set of rules by first learning a single accurate rule then removing the positive examples covered by this rule and iterating the process over the remaining training examples • Variety of methods, vary in the search strategy • CN2: general-to-specific beam search, generating and testing progressively more specific rules until a sufficiently accurate rule is found • Set of first-order rules provides a highly expressive representation • FOIL