80 likes | 252 Views
Defining Dummy Variables. Getting ready for Discriminant Analysis. Why dummies?. Not necessary for predictive models, but has some advantages. A subset of a variable (a certain range of values) may affect dependent differently, but variable used as a continuous one may not be significant.
E N D
Defining Dummy Variables Getting ready for Discriminant Analysis
Why dummies? • Not necessary for predictive models, but has some advantages. • A subset of a variable (a certain range of values) may affect dependent differently, but variable used as a continuous one may not be significant. • Easier to interpret for business applications. • For credit bureau variables, can handle special cases (no record, inquiries only, missing, etc.) a little better, based on dependent variable characteristics for those categories.
How to define them • Compute ratio of column percentages for each category (Good Column Percent / Bad Column Percent). • Use the pattern of these ratios to determine how many categories (and hence number of dummies) to create. • Must have a neutral category.
Example:Customer Age Dummies 1 2 Neutral 3 4 5 6 7
Some Guidelines • Look for a logical pattern • Eg: Ratios get better with age – does that make sense? Why or why not? • If a higher age category has lower ratio then combine it with the previous (or next) category. • If pattern is contrary to business expectation, investigate data, and/or drop the variable. • If no pattern (variation in ratios) at all, drop the variable – it has no discriminatory power.
Special Cases • What to do with ‘No Record’, ‘Inquiries Only’, etc. while dealing with Credit Bureau variables? • Look at Good/Bad ratio for those categories. • Find category with closest match and make that the Neutral category. • The special cases should also be part of Neutral category for all variables. • Assess their impact only once in the model by defining dummies for the CBTYPE variable.
CBTYPE Variable Key to CBTYPE variable 1 = Record with Trades 2 = Record w/Inqs. and Pub Recs Only 3 = Record w/Inqs. Only 4 = Record w/Pub Recs Only 5 = No Record