1 / 24

SLIQ: A Fast Scalable Classifier for Data Mining

This presentation provides an introduction to the SLIQ algorithm, a scalable classifier for data mining. The algorithm improves learning time without sacrificing accuracy by eliminating the need to sort data at each node and using a inexpensive tree pruning algorithm based on Minimum Description Length (MDL). Results show that SLIQ builds accurate trees with comparable accuracy to other tree-based classifiers while producing smaller decision trees.

llackey
Download Presentation

SLIQ: A Fast Scalable Classifier for Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic

  2. Outline • Introduction • Motivation • SLIQ Algorithm • Building tree • Pruning • Example • Results • Conclusion

  3. Introduction • Most of the classification algorithms are designed for memory resident data – limited suitability for mining large datasets • Solution – build a scalable classifier - SLIQ • SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM

  4. Motivation • Recall (ID3, C4.5, CART):

  5. Motivation • NON SCALABLE DECISION TREES: • The complexity lies in determining the best split for each attribute • The cost of evaluating splits for numerical attributes is dominated by the cost of sorting values at each node • The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset • Pruning • crossvalidation inapplicable for large datasets • divide data in two parts - training and test set - sizes, distribution???

  6. Motivation • Improve scalability of tree classifiers • Previous proposals: • Sampling data at each node • Discretization of numerical attributes • Partitioning input data and build tree for each partition • All methods achieve low accuracy! • SLIQ – improve learning time without loss in accuracy!

  7. SLIQ • Key features: • Tree classifier, handling both numerical and categorical attributes • Presort numerical attributes before tree has been built • Breadth first growing strategy • Goodness test – Gini index • Inexpensive tree pruning algorithm based on Minimum Description Length (MDL)

  8. SLIQ - Algorithm • Eliminate the need to sort the data at each node • Create sorted list for each numerical attribute • Create class list

  9. SLIQ - Algorithm • Example:

  10. SLIQ - Algorithm • Split evaluation:

  11. SLIQ - Algorithm • Example:

  12. SLIQ - Algorithm • Update class list:

  13. SLIQ - Algorithm • Example:

  14. SLIQ - Algorithm • For large-cardinality categorical attributes (determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated • When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node • SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical

  15. SLIQ - Pruning • Post pruning algorithm based on Minimum Description Length principle • Find a model that minimizes: Cost(M,D) = Cost(D|M) + Cost(M) Cost(M) - cost of the model Cost(D|M) - cost of encoding the data D if model M is given

  16. SLIQ - Pruning • Cost of the data: classification error • Cost of the model: • Encoding the tree: number of bits • Encoding the splits: • numerical attribute - constant (empirically 1) • categorical attribute - depends on cardinality • The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact

  17. SLIQ - pruning • Three pruning strategies: • Full – pruning both children and convert node to the leaf • Partial – prune into the leaf or prune the left child or prune the right child or leave node intact • Hybrid – apply Full method and then partial (prune left, prune right or leave intact)

  18. Results • SLIQ was tested on the datasets:

  19. Results • Pruning strategy comparison:

  20. Results • Accuracy:

  21. Results • Scalability:

  22. Conclusion • SLIQ demonstrates to be a fast, low-cost and scalable classifier that builds accurate trees • Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees • Scalability??? Memory problem when increasing number of attributes or number of classes

  23. References [1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.

  24. THANK YOU!

More Related