EECS 800 Research Seminar Mining Biological Data

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Overview • Rule-based classification method overview • CBA: classification based on association • Applying rules based method in Microarray analysis

Rule Based Methods Vs. SVM Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05, 2005.

Rule-Based Classifier • Classify records by using a collection of “if…then…” rules • Rule: (Condition)  y • where • Condition is a conjunctions of attributes • y is the class label • Examples of classification rules: • (Blood Type=Warm)  (Lay Eggs=Yes)  Birds • (Taxable Income < 50K)  (Refund=Yes)  Evade=No

R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians Rule-based Classifier (Example)

Application of Rule-Based Classifier • A rule rcovers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal

home Marital Taxable Tid Cheat Status Income No 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K Yes 5 No Divorced 95K No 6 No Married 60K No 7 Yes Divorced 220K Yes 8 No Single 85K No 9 No Married 75K Yes 10 No Single 90K 10 Rule Coverage and Accuracy • Coverage of a rule: • Fraction of records that satisfy the antecedent of a rule • Accuracy of a rule: • Fraction of records that satisfy both the antecedent and consequent of a rule (Status=Single)  No Coverage = 40%, Accuracy = 50%

How does Rule-based Classifier Work? R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules

Characteristics of Rule-Based Classifier • Mutually exclusive rules • Every record is covered by at most one rule • Exhaustive rules • Classifier has exhaustive coverage if it accounts for every possible combination of attribute values • Each record is covered by at least one rule

From Decision Trees To Rules Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree

home Marital Taxable Tid Cheat Status Income No 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K Yes 5 No Divorced 95K No 6 No Married 60K No 7 Yes Divorced 220K Yes 8 No Single 85K No 9 No Married 75K Yes 10 No Single 90K 10 Rules Can Be Simplified Initial Rule: (Refund=No)  (Status=Married)  No Simplified Rule: (Status=Married)  No

Effect of Rule Simplification • Rules are no longer mutually exclusive • A record may trigger more than one rule • Solution? • Ordered rule set • Unordered rule set – use voting schemes • Rules are no longer exhaustive • A record may not trigger any rules • Solution? • Use a default class

Ordered Rule Set • Rules are rank ordered according to their priority • An ordered rule set is known as a decision list • When a test record is presented to the classifier • It is assigned to the class label of the highest ranked rule it has triggered • If none of the rules fired, it is assigned to the default class R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians

Rule Ordering Schemes • Rule-based ordering • Individual rules are ranked based on their quality • Class-based ordering • Rules that belong to the same class appear together

Building Classification Rules • Direct Method: • Extract rules directly from data • e.g.: CBA • Indirect Method: • Extract rules from other classification models (e.g. decision trees, neural networks, etc). • e.g: C4.5rules

Direct Method: Sequential Covering • Start from an empty rule • Grow a rule using the Learn-One-Rule function • Remove training records covered by the rule • Repeat Step (2) and (3) until stopping criterion is met

Example of Sequential Covering

Example of Sequential Covering…

Aspects of Sequential Covering • Rule Growing • Rule Evaluation • Stopping Criterion • Rule Pruning

Rule Growing • Two common strategies

Rule Growing (Examples) • RIPPER Algorithm: • Start from an empty rule: {} => class • Add conjuncts that maximizes FOIL’s information gain measure: • R0: {} => class (initial rule) • R1: {A} => class (rule after adding conjunct) • Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] • where t: number of positive instances covered by both R0 and R1 p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1

Rule Evaluation • Metrics: • Accuracy • Laplace • M-estimate n : Number of instances covered by rule nc : Number of instances covered by rule k : Number of classes p : Prior probability

Stopping Criterion and Rule Pruning • Stopping criterion • Compute the gain • If gain is not significant, discard the new rule • Rule Pruning • Similar to post-pruning of decision trees • Reduced Error Pruning: • Remove one of the conjuncts in the rule • Compare error rate on validation set before and after pruning • If error improves, prune the conjunct

Summary of Direct Method • Grow a single rule • Remove Instances from rule • Prune the rule (if necessary) • Add rule to Current Rule Set • Repeat

Direct Method: RIPPER • For 2-class problem, choose one of the classes as positive class, and the other as negative class • Learn rules for positive class • Negative class will be default class • For multi-class problem • Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) • Learn the rule set for smallest class first, treat the rest as negative class • Repeat with next smallest class as positive class

Indirect Methods

Indirect Method: C4.5rules • Extract rules from an unpruned decision tree • For each rule, r: A  y, • consider an alternative rule r’: A’  y where A’ is obtained by removing one of the conjuncts in A • Compare the pessimistic error rate for r against all r’s • Prune if one of the r’s has lower pessimistic error rate • Repeat until we can no longer improve generalization error

Advantages of Rule-Based Classifiers • As highly expressive as decision trees • Easy to interpret • Easy to generate • Can classify new instances rapidly • Performance comparable to decision trees

Overview of CBA • Classification rule mining versus Association rule mining • Aim • A small set of rules as classifier • All rules according to minsup and minconf • Syntax • X  y • X Y

Association Rules for Classification • Classification: mine a small set of rules existing in the data to form a classifier or predictor. • It has a target attribute: Class attribute • Association rules: have no fixed target, but we can fix a target. • Class association rules (CAR): has a target class attribute. E.g., • Own_house = true  Class =Yes [sup=6/15, conf=6/6] • CARs can obviously be used for classification. • B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. KDD, 1998 • http://www.comp.nus.edu.sg/~dm2/

Decision tree vs. CARs • The decision tree below generates the following 3 rules. Own_house = true  Class =Yes [sup=6/15, conf=6/6] Own_house = false, Has_job = true  Class=Yes [sup=5/15, conf=5/5] Own_house = false, Has_job = false  Class=No [sup=4/15, conf=4/4] • But there are many other rules that are not found by the decision tree

There are many more rules • CAR mining finds all of them. • In many cases, rules not in the decision tree (or a rule list) may perform classification better. • Such rules may also be actionable in practice

Decision tree vs. CARs (cont …) • Association mining require discrete attributes. Decision tree learning uses both discrete and continuous attributes. • CAR mining requires continuous attributes discretized. There are several such algorithms. • Decision tree is not constrained by minsup or minconf, and thus is able to find rules with very low support. Of course, such rules may be pruned due to the possible overfitting.

CBA: Three Steps • Discretize continuous attributes, if any • Generate all class association rules (CARs) • Building a classifier based on the generated CARs.

RG: The Algorithm • Find the complete set of all possible rules • Usually takes long time to finish

RG: Basic Concepts • Frequent ruleitems • A ruleitem is frequent if its support is above minsup • Accurate rule • A rule is accurate if its confidence is above minconf • Possible rule • For all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems. • The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate.

Further Considerations in CAR mining • Multiple minimum class supports • Deal with imbalanced class distribution, e.g., some class is rare, 98% negative and 2% positive. • We can set the minsup(positive) = 0.2% and minsup(negative) = 2%. • If we are not interested in classification of negative class, we may not want to generate rules for negative class. We can set minsup(negative)=100% or more. • Rule pruning may be performed.

Building Classifiers • There are many ways to build classifiers using CARs. Several existing systems available. • Simplest: After CARs are mined, do nothing. • For each test case, we simply choose the most confident rule that covers the test case to classify it. Microsoft SQL Server has a similar method. • Or, using a combination of rules. • Another method (used in the CBA system) is similar to sequential covering. • Choose a set of rules to cover the training data.

Class Builder: Three Steps • The basic idea is to choose a set of high precedence rules in R to cover D. • Sort the set of generated rules R • Select rules for the classifier from R following the sorted sequence and put in C. • Each selected rule has to correctly classify at least one additional case. • Also select default class and compute errors. • Discard those rules in C that don’t improve the accuracy of the classifier. • Locate the rule with the lowest error rate and discard the rest rules in the sequence.

Rules are sorted first Definition: Given two rules, riand rj, rirj (also called ri precedes rj or rihas a higher precedence than rj) if • the confidence of ri is greater than that of rj, or • their confidences are the same, but the support of ri is greater than that of rj, or • both the confidences and supports of riand rj are the same, but ri is generated earlier than rj. A CBA classifier L is of the form: L = <r1, r2, …, rk, default-class>

Classifier building using CARs • Selection: • Each rule does at least one correct prediction • Each case is covered by the rule with highest precedence • This algorithm is correct but is not efficient

Classifier building using CARs • For each case d in D • coverRulesd = all covering rules of d • Sort D according to the precedence of the first correctly predicting rule of each case d • RuleSet = empty • Scan D again to find optimal rule set.

Refined Classification Based on TopkRGS (RCBT) • General Idea: • Construct RCBT classifier from top-k covering rule groups. So the number of the rule groups generated are bounded. • Efficiency and accuracy are validated by experimental results • Based on Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05

Dataset • In the microarray dataset: • Each row in the dataset corresponds to a sample • Each item value in the dataset corresponds to a distretized gene expression value. • Class labels correspond to category of sample, (cancer / not cancer) • Useful in diagnostic purpose

Introduction of gene expression data (Microarray) • Format of gene expression data: • Column – gene: thousands of • Row — sample (class): tens of or hundreds of patients discretize

Rule Example Rule r: {a,e,h} -> C Support(r)= 3 Condifence(r)= 66%

RG: General solution • Step1: Find all frequently occurred itemsets from dataset D. • Step2: Generate rule in the form of itemset -> C. Prune rules that do not have enough support an confidence.

{ } {a } {b } {c } {ab } {ac } {bc } {abc } RG: Previous Algorithms • Item enumeration: Search all the frequent itemsets by checking all possible combinations of items. • We can simulate the search process in an item enumeration tree.

Microarray data • Features of Microarray data • A few rows: 100-1000 • A large number of items, 10000 • The space of all the combinations of items is large 210000.

Motivations • Very slow for existing rule mining algorithms • Item search space is exponential to the number of item • use the idea of row enumeration to design new algorithm • The number of association rules are too huge even for a given consequent • mine top-k Interesting rule groups for each row

EECS 800 Research Seminar Mining Biological Data