180 likes | 514 Views
Annual Income Prediction Modeling Using SVM. Xinjue YU 12/14/2010. Annual Income Prediction. Why this problem? Useful in industries such as insurance, banking, marketing, etc Interested in the income distribution The goal:
E N D
Annual Income Prediction Modeling Using SVM Xinjue YU 12/14/2010
Annual Income Prediction • Why this problem? • Useful in industries such as insurance, banking, marketing, etc • Interested in the income distribution • The goal: • To predict whether a person has an annual income of more than $50,000 • The information we have: • Age, gender, education level, working hours per week, etc.
The Dataset • The Adult dataset: • 32561 total with 16281 for testing • Extracted from the 1994 Census database. • A set of reasonably clean records was extracted ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) • http://archive.ics.uci.edu/ml/machine-learning-databases/adult/
Preparation • There are 14 features in the raw dataset, using 4 out of 14 • The 4 features that are used: gender, education level, aged and working hours per week • Quantizing the features • Education level: 1(<=high school), 2(<grad school) & 3(>=grad school) • Gender: 0(Female) & 1(male) • Age: 1(<30), 2(30-50) & 3(>50) • Working hours per week: 1(<=40) & 2(>40)
The Approach • Using Support Vector Machine in artificial neural network • The data are supposed to be non-separable • Using SVM for non-separable pattern classification • Trying different kernels such as • Linear • RBF • Polynomial • Sigmoid • Using 2-D feature pairs first • gender & education level • Age and working hours per week • Using 4 features in further study (increased complexity)
The Expected Results • Predict a person’s annual income is whether more than 50K by the result using SVM (classification/clustering involved) • Using testing data to get the error rates of different kernels • Comparison of the results of different kernels • Linear kernels are supposed to have the highest error rate • Try to limit the error rate within 20%-30%