280 likes | 500 Views
Spatial and Temporal Data Mining. V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by Christos Faloutsos). More details… Decision Trees. Decision trees Problem Approach Classification through trees
E N D
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees (based on notes by Jiawei Han and Micheline Kamber and on notes by Christos Faloutsos)
More details… Decision Trees • Decision trees • Problem • Approach • Classification through trees • Building phase - splitting policies • Pruning phase (to avoid over-fitting)
Decision Trees • Problem: Classification – i.e., given a training set (N tuples, with M attributes, plus a label attribute) find rules, to predict the label for newcomers Pictorially:
Decision trees • Issues: • missing values • noise • ‘rare’ events
Decision trees • types of attributes • numerical (= continuous) - eg: ‘salary’ • ordinal (= integer) - eg.: ‘# of children’ • nominal (= categorical) - eg.: ‘car-type’
num. attr#2 (e.g., chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’) Decision trees • Pictorially, we have
Decision trees • and we want to label ‘?’ ? num. attr#2 (e.g., chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’)
Decision trees • so we build a decision tree: ? num. attr#2 (e.g., chol-level) - - + + + 40 - + - + - + - + 50 num. attr#1 (e.g., ‘age’)
Decision trees • so we build a decision tree: age<50 N Y chol. <40 + Y N - ...
Decision trees • Typically, two steps: • tree building • tree pruning (for over-training/over-fitting)
num. attr#2 (e.g.,chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’) Tree building • How?
Tree building • How? • A: Partition, recursively - pseudocode: Partition ( dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) • Q1: how to introduce splits along attribute Ai • Q2: how to evaluate a split?
Tree building • Q1: how to introduce splits along attribute Ai • A1: • for numerical attributes: • binary split, or • multiple split • for categorical attributes: • compute all subsets (expensive!), or • use a greedy approach
Tree building • Q2: how to evaluate a split? • A: by how close to uniform each subset is - ie., we need a measure of uniformity:
Tree building Any other measure? entropy: H(p+, p-) p+: relative frequency of class + in S 1 0 0.5 0 1 p+
1 0 0.5 0 1 p+ Tree building ‘gini’ index: 1-p+2 - p-2 entropy: H(p+, p-) p+: relative frequency of class + in S 1 0 0.5 0 1 p+
Tree building Intuition: • entropy: #bits to encode the class label • gini: classification error, if we randomly guess ‘+’ with prob. p+
num. attr#2 (e.g., chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’) Tree building Thus, we choose the split that reduces entropy/classification-error the most: E.g.:
Tree building • Before split: we need (n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels • After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half
num. attr#2 (eg., chol-level) - - + + + - + - ... + - + - + num. attr#1 (eg., ‘age’) Tree pruning • What for?
num. attr#2 (eg., chol-level) - - + + + - + - ... + - + - + num. attr#1 (eg., ‘age’) Tree pruning • Q: How to do it?
Tree pruning • Q: How to do it? • A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)
Tree pruning • Q: How to do it? • A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) • A2: or, rely on MDL (= Minimum Description Language) - in detail:
Tree pruning • envision the problem as compression • and try to minimize the # bits to compress (a) the class labels AND (b) the representation of the decision tree
(MDL) • a brilliant idea – e.g.: best n-degree polynomial to compress these points: • the one that minimizes (sum of errors + n)
Major Issues in Data Mining • Issues relating to the diversity of data types • Handling relational as well as complex types of data • Mining information from heterogeneous databases and global information systems (WWW) • Issues related to applications and social impacts • Application of discovered knowledge • Domain-specific data mining tools • Intelligent query answering • Process control and decision making • Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem • Protection of data security, integrity, and privacy
Summary • Data mining: discovering interesting patterns from large amounts of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation • Mining can be performed in a variety of information repositories • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. • Classification of data mining systems • Major issues in data mining