Classification using Decision Trees

Classification using Decision Trees Data Mining and Information Data Mining and Machine Learning Techniques Decision trees and C5 Applications Plan for this week

Data Mining and Information • Any result should answer a practical or theoretical question. • For your results to be useful, they must be interpretable in most applications. • Data mining -- the process of finding, interpreting, and evaluating patterns in large sets of data.

Data Mining and Machine Learning Techniques • Machine learning programs adapt their behavior with experience. To “learn” is to be trained by data with a set of well defined instructions – machine learning algorithms. • Data mining tools are supplements, rather than substitutes, for human knowledge and intuition. • The objective of running the learning algorithm on the data is to find some patterns or trends that will aid in understanding the data.

Model Classification by Outcome

Classification Problem • Given dataset D and class label C, find a classifierd such that misclassification rate of d is minimized. • Goal – to produce accurate classifier and to understand problem structure • Requirements: high accuracy, interpretable, fast construction for very large training data

Decision Trees • A decision tree T encode d (a classifier) in form of a tree • Internal node – binary, k-ary splits • Leaf node – labeled with one class label

Decision Tree Construction • Top-down tree construction schema: • Examine training data and find best splitting attribute for the root node • Partitioning training data • Recur on each child node

Decision Tree Construction (contd.) BuildTree (Node t, Training data D, Split Selection Method S) • Apply S to D to find splitting criterion • If (t is not a leaf node) • create chidren nodes of t • partition D into children partitions • recur on each partition • Endif Three algorithmic components: • Split selection (C5, CART, QUEST, …) • Pruning • Data access

Split Selection Methods • Impurity-based split selection: CART, C5 (most common in today’s data mining tools) • Model-based split selection: QUEST (Loh and Shih, 1997, freeware, available at www.stat.wisc.edu/~loh, quick, unbiased, efficient, statistical tree)

Decision Trees and C5 • One of data mining methods commonly reported in the literature. • C5 is a software package based on decision tree method by J. R. Quinlan. • One major advantage of decision trees over other machine learning techniques is that they produce models (rules) that can be interpreted by humans. • To learn more about Rule Induction …

CSUS Access to C5 • Login to quad • Change directory to /opt/C50Release1 • Read the “ReadMe” file for example and format requirements • You are ready to use C5 • An example of C5 application

Extracting Knowledge from Gene Expression Data: A Case Study of Batten Disease– S. M. Lin • Duke University Medical Center proposed a prototype KDD system to enable scientists to analyze the massive microarray data, form hypotheses, and draw insights directly into underlying mechanisms of diseases. • Data  Microarray database  data mining  patterns  human experts  Genomics knowledge base discoveries

Plan for this week • Monday (Lu, Dunham part II) • DT-based: 1R, ID3, C5, CART • Rule-generating: Prism • Wednesday (Han-ch7, Dunham-part II) • Statistics-based: Regression (D), Naïve Bayes • Distance-based KNN (D) • ANN

Classification using Decision Trees