170 likes | 354 Views
An Interval Classifier for Database Mining Applications . Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB Conference Vancouver, Canada, 1992. Presentation by: Vladan Radosavljevic. Outline. Introduction Motivation Interval Classifier
E N D
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18th VLDB Conference Vancouver, Canada, 1992. Presentation by: Vladan Radosavljevic
Outline • Introduction • Motivation • Interval Classifier • Example • Results • Conclusion
Introduction • Given a small set of labeled examples find classifier which will efficiently classify large unlabeled population database • Or – retrieve all examples from the database that belong to the desired class • Assumption: labeled examples are representative of entire population, number of classes are known in advance (m)
Motivation • Why an Interval Classifier? • Neural Networks – not database oriented, tuples have to be retrieved one at a time into memory before classification • Decision Trees (ID3, CART) – binary splits increase computation time, pruning the tree after building makes the tree generation more expensive
Interval Classifier (IC) • Key features: • Tree classifier • Categorical attributes – branches for each value • Numerical attributes – decomposing range into k intervals, k determined algorithmically for each node • IC generates SQL queries as final classification functions!
Interval Classifier - Algorithm • Algorithm: • Partition the domain of numerical attributes into predefined number of intervals, and for each interval determine winning class (class that has the largest frequency in that interval) • For each attribute compute the value of the goodness function - information gain ratio (or re-substitution error rate) and find the winner attribute A • Then for each partition of attribute A set strength of the winning class based on the frequency and predefined threshold, strength - weak or strong R R R G G G W W S S S S
Interval Classifier - Algorithm • … • Merge adjacent intervals that have the same winners with the equal strengths • Divide training set of examples using calculated intervals • Strong intervals become leaves with assigned winning class • Recursively proceed with weak intervals. Stop when all intervals are strong, or specified maximum tree depth are obtained W S S Leaves
Interval Classifier - Pruning • Pruning • Dynamic, while tree is generated • Find accuracy for the node using training set • Expand the node only if classification error is below threshold that depends on number of leaves and entire accuracy • The aim is to check whether the expansion will bring error reduction or not • To avoid pruning to aggressively – each node inherits from its parent a certain number of credits
Example • Age: numerical, uniformly distributed 20-80 • Zip-code: categorical, uniformly • Level of Education, elevel: categorical, unif. • Two classes: • A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) • B: otherwise
Example • 1000 training tuples • Calculate class histogram for numerical attribute age by choosing 100 equi-distant intervals and determine winning class for each partition • Find the best attribute based on the resubstitution error rate: 1-sum(win_freq(part)/total_freq)
Example • Choose age – the smallest error rate, partition the domain by merging adjacent intervals which have the same winning class with equal strengths B
Example • Proceed with weak nodes and repeat the same procedure • Finally: • Classes defined in the beginning: • A: (age<40 and elevel 0 to 1) OR • (40<=age<60 and elevel 0 to 3) OR • (age>=60 and elevel 0) • B: otherwise
Results • Generate examples with smooth boundaries among the groups • Training set 2500 tuples, test 10000 • Fixed precision – threshold 0.9 • Adaptive precision – adaptive threshold • Error pruning – credits • Function 5 – nonlinear
Results • Comparing to the ID3:
Conclusion • IC interface efficiently with the database systems • Treatment of numerical attributes • Dynamic pruning • Too many user defined parameters? • Scalability? • In practice K-ary trees are less accurate than binary ones?
References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami: “An Interval Classifier for Database Mining Applications”, in Proceeding of the VLDB Conference, Vancouver, BC, Canada, 1992, pp.560-573