230 likes | 348 Views
Implementation In Tree. Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye. Data Background. From SPSS Answer Tree program, we use its credit scoring example There are 323 data points The target variable is credit ranking (good [48%], bad [52%]) The four predictor variables are
E N D
Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye
Data Background • From SPSS Answer Tree program, we use its credit scoring example • There are 323 data points • The target variable is credit ranking (good [48%], bad [52%]) • The four predictor variables are age categorical (young [58%], middle [24%], old[18%]) has AMEX card (yes [48%], no [52%]) paid weekly/monthly (weekly pay [51%], monthly salary [49%]) social class (management [12%], professional [49%], clerical [15%], skilled [13%], unskilled [12%])
Data Background • Useful to see how the target variable is distributed by each of the predictor variable
Data Background Pearson Correlation Coefficients, N = 323 Prob > |r| under H0: Rho=0 CREDIT_R PAY_WEEK AGE AMEX CREDIT_R 1.00000 0.70885 0.66273 0.02653 CREDIT_R <.0001 <.0001 0.6348 PAY_WEEK 0.70885 1.00000 0.51930 0.08292 PAY_WEEK <.0001 <.0001 0.1370 AGE 0.66273 0.51930 1.00000 -0.00172 AGE <.0001 <.0001 0.9755 AMEX 0.02653 0.08292 -0.00172 1.00000 AMEX 0.6348 0.1370 0.9755 • Correlation Matrix:
Objective • To create a predictive model of good credit risks • To assess the performance of the model, we randomly split data into two parts: a training set to develop the model (60%) and the rest (40%) to validate. • This is done to avoid possible “over fitting” since the validation set was not involve in deriving the model • Using the same data, we compare the results using R’s Tree, Answer Tree’s CART, and SAS’ Proc Logistic
Logistic Regression • Let x be a vector of explanatory variables • Let y be a binary target variable (0 or 1) • p = Pr(Y=1|x) is the target probability • The linear logistic model has the form • Predicted probability, phat = 1/(1+exp(-(α+β’x))) • Note that the range for p is (0,1), but logit(p) is the whole real line
Logistic Results • Using the training set, the maximum likelihood estimate failed to converge using the social class and age variables • Only the paid weekly/monthly and has AMEX card variables could be estimated • The AMEX variable was highly insignificant and so was dropped • Apparently, the tree algorithm does a better job in handling all the variables • SAS output of model: Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 1.5856 0.2662 35.4756 <.0001 PAY_WEEK 1 1 -3.6066 0.4169 74.8285 <.0001 • So, the odds of weekly pay to be a good risk over a monthly salary person is exp(-3.6) ≈ 0.027 to 1 or 36 to 1 against.
Validation Results • With only one variable in our predicted model, there are only two possible predicted probabilities: 0.117 and 0.830 • Taking the higher probability as predicting a “good” account, our results are below Validation Set Training Set Actual BadActual GoodActual BadActual Good Predicted Bad 60 11 83 11 Predicted Good 8 50 17 83 Percent Agreement 85.3% 85.6% • The better measure is use the validation set results. Note that the results are very similar, so overfitting does not appear to be a problem.
Growing a Tree in R(Based on training data) > credit_data<- read.csv(file="training.csv") > library(tree) > credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + AMEX, data=credit_data, split=c("gini")) > tree.pr<-prune.tree(credit_tree) > plot(tree.pr) # figure 1 > plot(credit_tree, type="u"); text(credit_tree, pretty=0) # figure 2 > tree.1<-prune.tree(credit_tree, best=5) > plot(tree.1, type="u"); text(tree.1, pretty=0) # figure 3, 4, 5 > summary(tree.1)
Implementing using SPSS ANSWER TREETraining sample – C&RT (Min impunity change .01)
Implementing using SPSS ANSWER TREETraining sample – CHAID (Pearson Chi2, p=.05)
Summary of Validation datagrouped by training data classification
Summary of result • Similar trees were generated from R and SPSS ANSWER TREE • Similar results were derived using different tree generation methods – C&RT and CHAID • Classification tree has higher percentage of agreement between predicted values and actual values than logistic regression on training data • Utilizing the grouping criteria derived from training data, logistic regression has higher percentage of agreement than classification tree
Conclusion • Classification tree is a non-parametric method to select predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive probability • Classification tree can be implemented in different software and using different tree growing methodologies • Classification tree normally performs better than parametric models with higher percentage of agreement between predicted values and actual values • Classification tree has special advantages in industries like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of predicting error
Conclusion – con’d • As a non-parametric method, no function form is specified and no parameter will be estimated and tested • As showed in this small study, the lower percentage of agreement for validation data shows “overfitting” might be a potential problem in classification tree