250 likes | 580 Views
SLIQ: A Fast Scalable Classifier for Data Mining. Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic. Outline. Introduction Motivation SLIQ Algorithm Building tree Pruning Example Results Conclusion. Introduction.
E N D
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic
Outline • Introduction • Motivation • SLIQ Algorithm • Building tree • Pruning • Example • Results • Conclusion
Introduction • Most of the classification algorithms are designed for memory resident data – limited suitability for mining large datasets • Solution – build a scalable classifier - SLIQ • SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM
Motivation • Recall (ID3, C4.5, CART):
Motivation • NON SCALABLE DECISION TREES: • The complexity lies in determining the best split for each attribute • The cost of evaluating splits for numerical attributes is dominated by the cost of sorting values at each node • The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset • Pruning • crossvalidation inapplicable for large datasets • divide data in two parts - training and test set - sizes, distribution???
Motivation • Improve scalability of tree classifiers • Previous proposals: • Sampling data at each node • Discretization of numerical attributes • Partitioning input data and build tree for each partition • All methods achieve low accuracy! • SLIQ – improve learning time without loss in accuracy!
SLIQ • Key features: • Tree classifier, handling both numerical and categorical attributes • Presort numerical attributes before tree has been built • Breadth first growing strategy • Goodness test – Gini index • Inexpensive tree pruning algorithm based on Minimum Description Length (MDL)
SLIQ - Algorithm • Eliminate the need to sort the data at each node • Create sorted list for each numerical attribute • Create class list
SLIQ - Algorithm • Example:
SLIQ - Algorithm • Split evaluation:
SLIQ - Algorithm • Example:
SLIQ - Algorithm • Update class list:
SLIQ - Algorithm • Example:
SLIQ - Algorithm • For large-cardinality categorical attributes (determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated • When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node • SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical
SLIQ - Pruning • Post pruning algorithm based on Minimum Description Length principle • Find a model that minimizes: Cost(M,D) = Cost(D|M) + Cost(M) Cost(M) - cost of the model Cost(D|M) - cost of encoding the data D if model M is given
SLIQ - Pruning • Cost of the data: classification error • Cost of the model: • Encoding the tree: number of bits • Encoding the splits: • numerical attribute - constant (empirically 1) • categorical attribute - depends on cardinality • The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact
SLIQ - pruning • Three pruning strategies: • Full – pruning both children and convert node to the leaf • Partial – prune into the leaf or prune the left child or prune the right child or leave node intact • Hybrid – apply Full method and then partial (prune left, prune right or leave intact)
Results • SLIQ was tested on the datasets:
Results • Pruning strategy comparison:
Results • Accuracy:
Results • Scalability:
Conclusion • SLIQ demonstrates to be a fast, low-cost and scalable classifier that builds accurate trees • Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees • Scalability??? Memory problem when increasing number of attributes or number of classes
References [1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.