390 likes | 661 Views
CS 9633 Machine Learning Learning Sets of Rules. Two Major Types of Rules. Propositional Sunny Warm PlayTennis First order predicate logic Parent(x,y) Ancestor(x,y) Parent(x,y) Parent(y,z) Ancestor(x,z). Sequential Covering Algorithms.
E N D
Two Major Types of Rules • Propositional Sunny Warm PlayTennis • First order predicate logic Parent(x,y) Ancestor(x,y) Parent(x,y) Parent(y,z) Ancestor(x,z)
Sequential Covering Algorithms • Family of algorithms for learning rule sets based on the strategy of: • Learning one rule • Remove the data it covers • Repeat until some condition is met • Two critical components of such algorithms • Method for learning one rule • Method for evaluating performance of rules • Require high accuracy • High coverage usually less important
SEQUENTIAL-COVERING(Target_attributes, Attributes, Examples, Threshold) • Learned_Rules { } • Rule LEARN-ONE-RULE(Target_attribute, Attributes, Examples) • While PERFORMANCE(Rule, Examples) > Threshold do • Learned_Rules Learned_Rules + Rule • Examples Examples – {examples covered by Rule} • Rule LEARN-ONE-RULE(Target_attribute, Attributes, Examples) • Learned_Rules sort Learned_rules by PERFORMANCE over Examples • Return Learned_rules
Learning One Rule • Possible Approaches • Use ID3 algorithm except follow only most promising branch • Greedy algorithm • Can define “best descendant” as one with lowest entropy • Use beam search to maintain k most promising rules at each step.
Learn_One_Rule • See Table 10.2
Remarks • Each hypothesis considered in the main loop of the algorithm is a conjunction of attribute-value constraints. • Each conjunctive hypothesis is evaluated by the entropy of the examples it covers • Increasingly specific candidate hypotheses are considered until the maximally specific hypothesis containing all attributes is reached. • The rule that is output is the one whose performance is greatest—not necessarily the final hypothesis in the search. • Each hypothesis represents the lhs of a rule. The right hand side that is chosen is the target attribute most common among the examples covered by the example.
Variations • Learn only rules that cover positive examples • assign negative classification to examples not covered by rules. • Instead of entropy, use a measure that evaluates the fraction of positive examples covered by the hypothesis • AQ approach • Explicitly seek rules that cover a particular target value • Learn a disjunctive set of rules for each target value in turn. • Uses a single positive example to focus general to specific beam search. • Each time it learns a new rule, it selects another positive example to act as a seed to search for another rule.
Sequential covering Chooses among alternative attribute-value pairs at each step. Performs more individual primitive search steps (n*k) Makes n*k independent decisions May be better when lots of data available Simultaneous covering Chooses among alternative attributes at each step Performs fewer individual primitive search steps Makes fewer independent decisions. If have less data, may prefer to make fewer independent decisions. Sequential Covering Versus Decision Trees
Direction of Search for Learning One Rule • Some proceed from general to specific (CN2) while others go from specific to general (FIND-S) • May not be able to identify one maximally specific hypothesis—some maintain several (Golem)
Generate and Test versus Example Driven • Generate and Test—search through syntactically legal hypotheses. • Example Driven—use individual training examples to constrain hypotheses. • Find-S • Candidate Elimination • AQ • CIGOL
Post Pruning • Rules generated by sequential coverage can be post pruned in the same way as for rules from decision trees. • Post pruning reduces overfitting. • Need to use a “pruning” data set that is different from the training data and the test data.
Performance Evaluation Rules (1) • Relative frequency (used by AQ) • Let n by the number of examples the rule matches • Let nc be the number of these that it classifies correctly. • Relative frequency estimate of a rule is nc/n
Performance Evaluation Rules (2) • The m-estimate of accuracy • Biased toward the default accuracy expected for the rule. • Often preferred when data is scarce. • Let p be the prior probability that a randomly drawn example will have classification assigned by the rule. • Let m be the weight (equivalent number of examples for weighting this prior) • When m is 0, m-estimate is relative frequency • As m gets larger, must have more examples to overcome prior assumed accuracy.
Performance Evaluation Rules (3) • Entropy • Let S be the set of examples that match the rule preconditions • Entropy measures the uniformity of the target function values for this set of examples. • Let c be the number of distinct target function values for the set of examples • Let pi be proportion of examples from S for which the target function takes on the ith value.
Learning First Order Rules • First order predicate logic rules are much more expressive than propositions. • Often referred to as inductive logic programming (ILP)
First-Order Horn Clauses • Given training example for Daughter(x,y) like: (Name1=Sharon, Mother1=Louise, Father1=Bob, Male1=False, Female1=True, Name2 =Bob, Mother2=Nora, Father2=Victor, Male2=True, Female2=False, Daughter1,2=True) • Proposition logic representation allows learning rules like: (Father1=Bob)(Name2=Bob)(Female1=True) (Daughter1,2=True) • First-order representations allows learning rules like: Father(x,y) Female(y)Daughter(x,y) • First-order Horn clauses may also refer to variables in the precondition that do not occur in the post condition Father(y,z) Mother(z,x)GrandDaughter(x,y)
Terminology • All expressions composed of: • Constants (Bob, Louise) • Variables (x,y) • Predicate symbols (Loves, Greater_Than) • Function symbols (age) • Term: any constant, any variable, or any function applied to any term (Bob, x, age(Bob) • Literal: any predicate or its negation applied to any term Loves(Bob, Louise) positive literal Greater_Than(age(Sue), 20) negative literal
More Terminology • Ground literal—a literal that does not contain any variables • Clause—any disjunction of literals were all variables are assumed to be universally quantified. • Horn Clause—a clause containing at most one positive literal H L1 … Ln Or equivalently (L1 … Ln) H Body or antecedents Head or consequent
Substitutions • A substitution is any function that replaces variables by terms. {x/3, y/z} {x/Mary, z/John, y/Joe} • A unifying substitution for two literals L1 and L2 is any substitution such that L1=L2
FOIL Algorithm for Learning Sets of First Order Rules • Developed by Quinlin • Natural extension of Sequential-Covering and Learn-One-Rule • Learns sets of first-order rules • Each rule is a Horn clause with two exceptions • Literals are not allowed to contain function symbols • Literals appearing in the body may be negated
See Table 10.4 • FOIL algorithm
Two Differences between Seq. Covering and FOIL • FOIL uses different approach for general to specific candidate specializations • FOIL employs a performance measure, Foil_Gain that is different from entropy
Generating Candidate Specializations in FOIL • FOIL generates new literals that can each be added to a rule of form: L1…LnP(x1,x2,…,xk) • New Candidate Literals Ln+1: • Q(v1,…vr) where • Q is any allowable predicate • Each vi is a new variable or variable in rule • At least one of the vi in the new literal is already in the rule • Equal(xj,xk) where xj, xk are variables already in the rule. • The negation of either of the above forms of literals
An Example • Goal is to learn rules to predict: GrandDaughter(x,y) • Possible predicates: • Father • Female • Initial rule: GrandDaughter(x,y)
Equal(x,y) Female(x) Female(y) Father(x,y) Father(y,x) Father(x,z) Father(z,x) Father(y,z) Father(z,y) Equal(x,y) Female(x) Female(y) Father(x,y) Father(y,x) Father(x,z) Father(z,x) Father(y,z) Father(z,y) Generate Candidate Literals
Extending the best rule • Suppose Father(y,z) is selected as the best of the candidate literals • The rule that is generated is: Father(y,z) GrandDaughter(x,y) • New literals for next round include • All remaining previous literals • Plus new literals Female(z) Equal(z,x) Equal(z,y) Father(z,x) Father(z,y) Father(z,w) Father(w,z) and negations of all of these
Termination of Specialization • When only positive examples are covered by the rule, the search for specializations of the rule terminates. • All positive examples covered by the rule are removed from the training set. • If additional positive examples remain, a new search for an additional rule is started.
Guiding the Search • Must determine the performance of candidate rules over the training set at each step. • Must consider all possible bindings of variables to the rule. • In general, positive bindings are preferred over negative bindings.
Example • Training data assertions GrandDaughter(Victor, Sharon) Female(Sharon) Father(Sharon,Bob) Father(Tom, Bob) Father(Bob, Victor) • Use closed world assumption: any literal involving the specified predicates and literals that is not listed is assumed to be false GrandDaughter(Tom,Bob) GrandDaughter(Tom,Tom) GrandDaughter(Bob,Victor) Female(Tom) etc.
Possible Variable Bindings • Initial rule • GrandDaughter(x,y) • Possible bindings from training assertions (how many possible bindings of 4 literals to initial rule?) Positive binding: {x/Victor, y/Sharon} Negative bindings: {x/Victor, y/Victor} {x/Tom, y/Sharon}, etc. • Positive bindings provide positive evidence and negative bindings provide negative evidence against the rule under consideration.
Evaluation FunctionFOIL Gain • Estimate of utility of adding a new literal • Based on the numbers of positive and negative bindings covered before and after adding the new literal. • Notation R a rule L candidate literal R’ rule created by adding L to R p0 number of positive bindings of rule R p1 number of positive bindings of rule R’ n0 number of negative bindings of rule R n1 number of negative bindings of rule R’ t number of positive bindings of R still covered after adding L to yield R’
Example • Let R be GrandDaughter(x,y) • Let L be Father(y,z) • What is Foil_Gain?
Interpretation of Foil Gain • Interpret in terms of information theory • The second log term is the minimum number of bits needed to encode the classification of an arbitrary positive binding among the bindings covered by R • The first log term is the number of bits required if the binding is one of those covered by rule R’. • Because t is the number of bindings of R that remain covered by R’, Foil_Gain(L,R) is the reduction due to L in the total number of bits needed to encode the classification of all positive bindings of R.
Learning Recursive Rule Sets • Allow new literals added to rule body to refer to the target predicate. • Example: Parent(x,y) Ancestor(x,y) Parent(x,z) Ancestor(z,y) Ancestor(x,y) • FOIL can accommodate recursive rules, but must take precautions to avoid infinite recursion
Another Approach to Learning First Order Rules • Induction is the inverse of deduction. • Use and inverted resolution procedure to form an inverse entailment operator. • X entails Y means that Y follows deductively from X • This approach has generally not been considered computationally feasible although there have been some recent improvements.