120 likes | 224 Views
Mailing Campaign Model. Nan Yang University of Central Florida 04/11/2008. Overview. Data Visualization Data Preparation Model Building Variable Selection Interaction Model Assessment ROC. Data Visualization. 63 Variables
E N D
Mailing Campaign Model Nan Yang University of Central Florida 04/11/2008
Overview • Data Visualization • Data Preparation • Model Building • Variable Selection • Interaction • Model Assessment • ROC
Data Visualization • 63 Variables • Target is binary with 1 indicating people responded to the mailing campaign • Target is very unbalanced • Target rate is 1.13% for training set
Data Visualization • Categorical Variable • High level variables • x2 ~ 57 levels • DATE variables (x10 & x11) ~ over 100 levels • Missing value • DATE variables ~ 30%-70% • Some variables missing value coded as “Unknown” or “Uncoded”, e.g x20
Data Visualization • Interval Variable • Skewness
Data Preparation • Missing Value Indicator (MVI) • Variables with > 5% missing • Binary • Capture the missing value information
Data Preparation • Imputation • Unconditional imputation • Categorical variable • Tree/Tree Surrogate • Interval variable • Cluster
Data Preparation • Transformation • Right skewed • Log or Square Root transformation • Left skewed • Square transformation
Model Building • Variable selection • Individual predictive power • Logistic backward elimination • Keep the potential interaction terms • Logistic stepwise selection • Tree • Different criterions • 21 variables selected
Model Building • Interactions • SAS EMiner Regression node • 11 interaction terms selected • Model • Ensemble different logistic models
Model Assessment • AUC = 0.66
Acknowledgement • UCF Statistics Dept • BlueCross BlueShield of FL