An Interval Classifier for Database Mining Applications

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18th VLDB Conference Vancouver, Canada, 1992. Presentation by: Vladan Radosavljevic

Outline • Introduction • Motivation • Interval Classifier • Example • Results • Conclusion

Introduction • Given a small set of labeled examples find classifier which will efficiently classify large unlabeled population database • Or – retrieve all examples from the database that belong to the desired class • Assumption: labeled examples are representative of entire population, number of classes are known in advance (m)

Motivation • Why an Interval Classifier? • Neural Networks – not database oriented, tuples have to be retrieved one at a time into memory before classification • Decision Trees (ID3, CART) – binary splits increase computation time, pruning the tree after building makes the tree generation more expensive

Interval Classifier (IC) • Key features: • Tree classifier • Categorical attributes – branches for each value • Numerical attributes – decomposing range into k intervals, k determined algorithmically for each node • IC generates SQL queries as final classification functions!

Interval Classifier - Algorithm • Algorithm: • Partition the domain of numerical attributes into predefined number of intervals, and for each interval determine winning class (class that has the largest frequency in that interval) • For each attribute compute the value of the goodness function - information gain ratio (or re-substitution error rate) and find the winner attribute A • Then for each partition of attribute A set strength of the winning class based on the frequency and predefined threshold, strength - weak or strong R R R G G G W W S S S S

Interval Classifier - Algorithm • … • Merge adjacent intervals that have the same winners with the equal strengths • Divide training set of examples using calculated intervals • Strong intervals become leaves with assigned winning class • Recursively proceed with weak intervals. Stop when all intervals are strong, or specified maximum tree depth are obtained W S S Leaves

Interval Classifier - Pruning • Pruning • Dynamic, while tree is generated • Find accuracy for the node using training set • Expand the node only if classification error is below threshold that depends on number of leaves and entire accuracy • The aim is to check whether the expansion will bring error reduction or not • To avoid pruning to aggressively – each node inherits from its parent a certain number of credits

Example • Age: numerical, uniformly distributed 20-80 • Zip-code: categorical, uniformly • Level of Education, elevel: categorical, unif. • Two classes: • A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) • B: otherwise

Example • 1000 training tuples • Calculate class histogram for numerical attribute age by choosing 100 equi-distant intervals and determine winning class for each partition • Find the best attribute based on the resubstitution error rate: 1-sum(win_freq(part)/total_freq)

Example • Choose age – the smallest error rate, partition the domain by merging adjacent intervals which have the same winning class with equal strengths B

Example • Proceed with weak nodes and repeat the same procedure • Finally: • Classes defined in the beginning: • A: (age<40 and elevel 0 to 1) OR • (40<=age<60 and elevel 0 to 3) OR • (age>=60 and elevel 0) • B: otherwise

Results • Generate examples with smooth boundaries among the groups • Training set 2500 tuples, test 10000 • Fixed precision – threshold 0.9 • Adaptive precision – adaptive threshold • Error pruning – credits • Function 5 – nonlinear

Results • Comparing to the ID3:

Conclusion • IC interface efficiently with the database systems • Treatment of numerical attributes • Dynamic pruning • Too many user defined parameters? • Scalability? • In practice K-ary trees are less accurate than binary ones?

References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami: “An Interval Classifier for Database Mining Applications”, in Proceeding of the VLDB Conference, Vancouver, BC, Canada, 1992, pp.560-573

THANK YOU!

An Interval Classifier for Database Mining Applications

An Interval Classifier for Database Mining Applications

Presentation Transcript

Extrema on an Interval

Patterns for Slick database applications

Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001

Database Applications

An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases

CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data

Extrema on an Interval

SLIQ: A Fast Scalable Classifier for Data Mining

Boosting an Associative Classifier

SPRINT: A Scalable Parallel Classifier for Data Mining

Database Applications

Data Mining for Security Applications

SPRINT : A Scalable Parallel Classifier for Data Mining

Extrema On An Interval

Extrema on an Interval

Mining Relationships Among Interval-based Events for Classification

SLIQ: A Fast Scalable Classifier for Data Mining

Extrema on an Interval

An Interval Classifier for Database Mining Applications

EXTREMA ON AN INTERVAL

Database Applications