SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic

Outline • Introduction • Motivation • SLIQ Algorithm • Building tree • Pruning • Example • Results • Conclusion

Introduction • Most of the classification algorithms are designed for memory resident data – limited suitability for mining large datasets • Solution – build a scalable classifier - SLIQ • SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM

Motivation • Recall (ID3, C4.5, CART):

Motivation • NON SCALABLE DECISION TREES: • The complexity lies in determining the best split for each attribute • The cost of evaluating splits for numerical attributes is dominated by the cost of sorting values at each node • The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset • Pruning • crossvalidation inapplicable for large datasets • divide data in two parts - training and test set - sizes, distribution???

Motivation • Improve scalability of tree classifiers • Previous proposals: • Sampling data at each node • Discretization of numerical attributes • Partitioning input data and build tree for each partition • All methods achieve low accuracy! • SLIQ – improve learning time without loss in accuracy!

SLIQ • Key features: • Tree classifier, handling both numerical and categorical attributes • Presort numerical attributes before tree has been built • Breadth first growing strategy • Goodness test – Gini index • Inexpensive tree pruning algorithm based on Minimum Description Length (MDL)

SLIQ - Algorithm • Eliminate the need to sort the data at each node • Create sorted list for each numerical attribute • Create class list

SLIQ - Algorithm • Example:

SLIQ - Algorithm • Split evaluation:

SLIQ - Algorithm • Update class list:

SLIQ - Algorithm • For large-cardinality categorical attributes (determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated • When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node • SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical

SLIQ - Pruning • Post pruning algorithm based on Minimum Description Length principle • Find a model that minimizes: Cost(M,D) = Cost(D|M) + Cost(M) Cost(M) - cost of the model Cost(D|M) - cost of encoding the data D if model M is given

SLIQ - Pruning • Cost of the data: classification error • Cost of the model: • Encoding the tree: number of bits • Encoding the splits: • numerical attribute - constant (empirically 1) • categorical attribute - depends on cardinality • The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact

SLIQ - pruning • Three pruning strategies: • Full – pruning both children and convert node to the leaf • Partial – prune into the leaf or prune the left child or prune the right child or leave node intact • Hybrid – apply Full method and then partial (prune left, prune right or leave intact)

Results • SLIQ was tested on the datasets:

Results • Pruning strategy comparison:

Results • Accuracy:

Results • Scalability:

Conclusion • SLIQ demonstrates to be a fast, low-cost and scalable classifier that builds accurate trees • Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees • Scalability??? Memory problem when increasing number of attributes or number of classes

References [1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.

THANK YOU!

SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ: A Fast Scalable Classifier for Data Mining

Presentation Transcript

Regression for Data Mining

Data Mining: Concepts and Techniques

Chapter 2 Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining

Data Mining Tools

CS 277: Data Mining Notes on Classification

CS 490 Sample Project Mining the Mushroom Data Set

INTRODUCTION TO DATA MINING

Web Mining : A Bird ’ s Eye View

CS590D: Data Mining Prof. Chris Clifton

XML and Web Data

Ch5 Data Mining Classification: Alternative Techniques

DATA WAREHOUSING AND DATA MINING

Scalable Web Architectures

CS590D: Data Mining Chris Clifton

DATA MINING LECTURE 4

Data Mining : Implementations