170 likes | 180 Views
This presentation introduces an Interval Classifier (IC) for database mining applications developed by Agrawal, Ghosh, Imielinski, Iyer, and Swami. The IC efficiently classifies large databases by partitioning numerical attributes into intervals and identifying winning classes for each interval. The algorithm involves calculating goodness functions for attributes and pruning weak intervals dynamically during tree generation. The IC provides smooth boundary examples and interfaces well with databases. Questions regarding scalability, user-defined parameters, and accuracy compared to K-ary trees are discussed.
E N D
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18th VLDB Conference Vancouver, Canada, 1992. Presentation by: Vladan Radosavljevic
Outline • Introduction • Motivation • Interval Classifier • Example • Results • Conclusion
Introduction • Given a small set of labeled examples find classifier which will efficiently classify large unlabeled population database • Or – retrieve all examples from the database that belong to the desired class • Assumption: labeled examples are representative of entire population, number of classes are known in advance (m)
Motivation • Why an Interval Classifier? • Neural Networks – not database oriented, tuples have to be retrieved one at a time into memory before classification • Decision Trees (ID3, CART) – binary splits increase computation time, pruning the tree after building makes the tree generation more expensive
Interval Classifier (IC) • Key features: • Tree classifier • Categorical attributes – branches for each value • Numerical attributes – decomposing range into k intervals, k determined algorithmically for each node • IC generates SQL queries as final classification functions!
Interval Classifier - Algorithm • Algorithm: • Partition the domain of numerical attributes into predefined number of intervals, and for each interval determine winning class (class that has the largest frequency in that interval) • For each attribute compute the value of the goodness function - information gain ratio (or re-substitution error rate) and find the winner attribute A • Then for each partition of attribute A set strength of the winning class based on the frequency and predefined threshold, strength - weak or strong R R R G G G W W S S S S
Interval Classifier - Algorithm • … • Merge adjacent intervals that have the same winners with the equal strengths • Divide training set of examples using calculated intervals • Strong intervals become leaves with assigned winning class • Recursively proceed with weak intervals. Stop when all intervals are strong, or specified maximum tree depth are obtained W S S Leaves
Interval Classifier - Pruning • Pruning • Dynamic, while tree is generated • Find accuracy for the node using training set • Expand the node only if classification error is below threshold that depends on number of leaves and entire accuracy • The aim is to check whether the expansion will bring error reduction or not • To avoid pruning to aggressively – each node inherits from its parent a certain number of credits
Example • Age: numerical, uniformly distributed 20-80 • Zip-code: categorical, uniformly • Level of Education, elevel: categorical, unif. • Two classes: • A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) • B: otherwise
Example • 1000 training tuples • Calculate class histogram for numerical attribute age by choosing 100 equi-distant intervals and determine winning class for each partition • Find the best attribute based on the resubstitution error rate: 1-sum(win_freq(part)/total_freq)
Example • Choose age – the smallest error rate, partition the domain by merging adjacent intervals which have the same winning class with equal strengths B
Example • Proceed with weak nodes and repeat the same procedure • Finally: • Classes defined in the beginning: • A: (age<40 and elevel 0 to 1) OR • (40<=age<60 and elevel 0 to 3) OR • (age>=60 and elevel 0) • B: otherwise
Results • Generate examples with smooth boundaries among the groups • Training set 2500 tuples, test 10000 • Fixed precision – threshold 0.9 • Adaptive precision – adaptive threshold • Error pruning – credits • Function 5 – nonlinear
Results • Comparing to the ID3:
Conclusion • IC interface efficiently with the database systems • Treatment of numerical attributes • Dynamic pruning • Too many user defined parameters? • Scalability? • In practice K-ary trees are less accurate than binary ones?
References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami: “An Interval Classifier for Database Mining Applications”, in Proceeding of the VLDB Conference, Vancouver, BC, Canada, 1992, pp.560-573