330 likes | 678 Views
Fuzzy Interpretation of Discretized Intervals Dr. Xindong Wu. Andrea Porter April 11, 2002. Plan For Presentation. Introduction to Problem, HCV Discretization Techniques/Fuzzy Borders A Hybrid Solution for HCV Experiments and Results Conclusion. Introduction.
E N D
Fuzzy Interpretation of Discretized IntervalsDr. Xindong Wu Andrea Porter April 11, 2002
Plan For Presentation • Introduction to Problem, HCV • Discretization Techniques/Fuzzy Borders • A Hybrid Solution for HCV • Experiments and Results • Conclusion
Introduction • Real-world data contains both numerical and nominal data, must be able to deal with different types of data. • Existing systems discretize numerical domains into intervals and treat intervals as nominal values during induction. • Problems occur if test examples are not covered in training data (no-match, multiple match) • The solution is a hybrid approach using fuzzy intervals for no-match problem.
HCV • Attribute based rule induction algorithm, extension matrix approach • Divide positive examples into intersecting groups • Find a heuristic conjunctive rule in each group that covers all PE and no NE • HCV can find a rule in the form of variable-valued logic • More compact than the decision trees/rules of ID3 and C4.5
Variable Valued Logic and Selectors • Represents decisions where variables can take a range • Selector: [ X # R ] X = attribute # = relational operator ( = , <, >, . . . ) R = Reference, list of 1 or more values e.g [ Windy = true] [Temp > 90]
HCV Software • C++ implementation • Can work with noisy and real-valued domains as well as nominal and noise-free databases • Provides a set of deduction facilities for the user to test the accuracy of the produced rules on test examples
C4.5:The T class X2 = b X1 = 0 & X3 = a X1 = 0 & X3 = b X1 = 0 & X2 = a C4.5 Results vs. HCV • HCV:The T class • X2 = b • X1 = 0 & X2 = a • X1 = 0 & X4 = 0 • C4.5:The F class X1 = 1 & X2 = a • X1 = 1 & X2 = c • X2 = c & X3 = c
Deduction of Induction Results • Induction generates knowledge from existing data • Deduction applies induction results to interpret new data. • With real-world data, induction can not be assumed to be perfect • Three cases: 1) no-match (measure of fit) 2) single-match 3) multiple-match (estimate of probability)
Discretization • Occurs during rule induction • Discretize numerical domains into intervals and treat similar to nominal values. • The challenge is to find the right borders for the intervals • Possible Methods: 1) Simplest Class-Separating Method 2) Information Gain Heuristic (implemented in HCV)
Simplest Class- Separating Method: • Interval Borders are places between each adjacent pair of examples which have different classes. • If attribute is very informative - method is efficient and useful. • If attribute is not informative - method produces too many intervals
Information Gain Heuristic Use IGH to find more informative border. • x = (xi + xi+1)/2 for (i = 1, …, n-1) • x is a possible cut point if xi and xi+1 are of different classes. • Use IGH to find best x • Recursively split on left and right • To stop recursive splitting: 1) stop if IGH is same on all possible cut points. 2) stop if # of examples to split is less than a predefined number 3) limit the number of intervals
Fuzzy Borders • Discretization of continuous domains does not always fit accurate interpretation. • Instead of using sharp borders, use a membership function, measures the degree of membership. • A value can be classified into a few different intervals at the same time (e.g. single to multiple match)
Fuzzy Borders (2) • Fuzzy matching - deduction with fuzzy borders of discretized intervals. • Take the interval with the greatest degree as the value’s discrete value. • 3 functions to fuzzify borders: 1) linear 2) polynomial 3) arctan • Definitions s = spread parameter l = length of original xleft, xright = left/right sharp borders l xleft xright
l sl xleft xright Linear Membership Function a = -kxleft + 1/2 b = kxright + 1/2 linleft(x) = kx + a lin right(x) = -kx + b lin(x) = MAX(0, MIN(1,linleft(x),linright(x))) k = 1/2sl
*Polynomial Membership Function polyside(x) = asidex3 + bsidex2 + csidex + dside aside = 1/(4(ls)3) bside = -3asidexside side {left,right} cside = 3aside(xside2 - (ls)2) dside = -a(xside3 -3xside(ls)2 + 2(ls)3) polyleft(x), if xleft -ls x xleft + ls poly(x) = polyright(x), if xright -ls x xright +ls 1, if xleft +ls x xright -ls 0, otherwise
Match Degree • Selector method - take the max membership degree of the value in all the intervals involved. If 2 adjacent intervals have the same class, values close to the border will have low membership. • Conjunction method - adds with fuzzy plus ab=a + b - ab
No-Match Resolution Largest Class • Assign all no match examples to the largest class, the default class. • Works well, if the number of classes in a training set is small and one class is clearly larger. • Deteriorates if there is a larger number of classes and the examples are evenly distributed
No-Match Resolution Measure of Fit Calculate the measure of fit for each class: 1) calculate MF for each selector (sel) MF(sel, e) = 1, if sel is satisfied by e n/|x|, otherwise 2) calculate MF for each conjunctive rule(conj) MF(conj, e) = MF(sel, e) * n(conj)/N
No-Match Resolution Measure of Fit (2) 3) calculate MF for each class c MF(c, e) = MF(conj1, e) + MF(conj2, e) - MF(conj1,e)MF(conj2,e) * For more than two rules, apply formula recursively. * Find maximum MF - determines which class is closest to the example
Multiple-Match • Caused by over-generalization of the training examples at induction time • Example • (X1 = a, X2 = 1) • All PE cover X1 = a • All NE cover X2 = 1 • Multiple Match
Multiple-Match Resolution First Hit • Use first rule which classifies the example • Produces reasonable results if the rules from induction have been ordered according to a measure of reliability • Advantages - straightforward, efficient • Disadvantages - have to sort rules at induction time
Multiple-Match Resolution Largest Rule • Similar to largest class method from no-match resolution • Choose conjunctive rule that covers the most examples in the training set.
Multiple-Match Resolution Estimation of Probability • Assign EP value to each class based on the size of the satisfied conjunctive rules. 1) Find EP for each conjunctive rule (conj): EP(conj, e)= { n(conj)/N, if conj is satisfied by e 0, otherwise n(conj) = number of examples covered by conj N = number of total examples
Multiple-Match Resolution Estimation of Probability (2) 2) Find EP value for each class: EP(c, e) = EP(conj1, e) + EP(conj2, e) - EP(conj1,e)EP(conj2,e). * For more rules, apply formula recursively * Choose class with highest EP value
Hybrid Interpretation • Used because fuzzy borders only add conflicts because they don’t reduce the number rules that are applicable • HCV - use sharp borders during induction and use fuzzy borders only during deduction • Algorithm: * Single match - use class indicated by rules * Multiple match - use estimation probability (EP) with sharp borders * No match - use fuzzy borders with polynomial membership function to find closest rule
The Data • Used 17 databases from the Machine Learning Database Repository, U. of California, Irvine. • Databases selected because: 1) All include numerical data 2) All lead to situations where no rules clearly apply.
Results (cont.) • The results shown for C4.5 and NewID are the pruned ones • These were usually better than the unpruned ones in this experiment • HCV did not fine tune different parameters because this would be loss of generality and applicability of the conclusions
Accuracy Results • HCV(hybrid) - 9 databases • C4.5 (R 8) - 7 databases • C4.5 (R 5) - 6 databases • HVC (large) - 3 databases • HCV (fuzzy) - 2 databases
HCV Comparison • HCV (fuzzy) generally performs better than the simple largest class method • HCV’s performance improves significantly when the fuzzy borders (for no match) are combined with probability estimation (for multiple match) in HCV (hybrid)
Conclusions • Fuzzy borders are constructed and used at deduction time only when a no match case occurs. • This hybrid method performs more accurately than several other current deduction programs. • Fuzziness is strongly domain dependent, HCV allows the user to specify their own intervals and fuzzy functions.