490 likes | 653 Views
Data Mining Project. Team 4 Christine Dorothy - HT062300W Djoko Joewono - HT062137Y Kang Tiong Meng - HT072164H Ng Hock Leong Tommy - HT072172H Wang Long - HT072109B Hoo Soo Yean. Project Overview. Mail Order Company in USA Would like to find out if there is a way
E N D
Data Mining Project Team 4 Christine Dorothy - HT062300W DjokoJoewono - HT062137Y Kang TiongMeng - HT072164H Ng Hock Leong Tommy - HT072172H Wang Long - HT072109B HooSoo Yean
Project Overview • Mail Order Company in USA • Would like to find out if there is a way • To reduce mailing cost • By analyzing the past data
Business Understanding • Business Objectives: • To find out which customers that are good candidate to purchase products • To explore the data to determine company’s valuable customers
Business Understanding • Assess the Situation: • One CSV file from 3 data sources • Census Group A • Census Group B • Tax Filers • Personnel • Six MTech Students • Minimum experience in data mining • Software • MS Excel • Clementine, Data Scope
Business Understanding • Data Mining Goals • Predict which variables affects customer buying decision • Build models and compare the cost against randomly-chosen customers • Suggest a model to achieve >1% mailing response
Data Understanding • First Insights Discovery • Total record is 2158 • Distribution by Objective • Distribution by Gender
Data Understanding • Data Quality Problems • Some columns are normalized others not • All values are number, harder to visualize • Many data is incomplete • Missing recency, no of transactions and dollars of spending data for individual products
Data Understanding • Describe Data • Gross properties of data • The data is extracted from a larger set with respond rate of ~1%.All 1079 responders and 1079 randomly chosen non-responders • Relationship between attributes • firstmonth and tenure have a linear relationship, Thus tenure can be omited.
Data Preparation • Select Data • Variables chosen • Clean Data • Some normalizations • Construct Data • Chose the variables as input • Data Transformation • Rescaling • Derive new variables
Data Cleaning Reduce redundancy caused by data integration Replace lowincome and highincome with IncomeGroup. Replace gender1,gender2 and gender3 with Gender. Discard V171 Total taxfilers with unemployment benefits Discard V175, V181,V184, V190,V193,V196. they equal to male data plus female data
Data Transformation Rescaling Log() of totalspend and totaltrans to reduce effect of large variables Derive Data Derive ActAccInMostRecMon from product recency data(no of active accounts in most recent month) Derive the ratio of low taxfiler income from V156-V163 Value=V156/sum(V156:V163) Convert value to 5 categories.
Data Transformation Histogram of new variable with Objective overlaid
Census A Languages (cont’d) Inverse correlation between English and French speaking regions No region with significant Tagalog, Spanish or other language-speaking populations Can probably discard amtspanish, amttagalog, amtsingres, amtengnon, amtmultilin Cluster/segment English/French areas
Census A vs B Languages Linear relationship for English and French across Census A & B Can merge amtenglish and bhlenglish Can merge amtfrench and bhlfrench
Acflonepar vs bfslonepar Linear relationship Merge acflonepar & bfslonepar Filter out noisy data
Anfamrel – living with relatives Most data below 0.1 Objective remains constant throughout Not important to business objective – discard
afem40to44 Lack of data from other age groups Very specific targeted marketing to 40-44 females group Normalize values from 0 to 0.1 if necessary Objective improves as proportion increases
afp1child Objective clearly improves when afp1child is on lower end of normal curve
acfwchcom 7 regions with acfwchcom = 0.19 and objective = 1
Acftotmar vs acfhuswife Most regions have above 60% married couples, assuming normalized data Acftotmar and acfhuswife mirror one another Can discard either field Filter noisy data Categorical : lone-parent and husband-wife
Census B Data Understanding As the other cencus and taxfiler data, these data represents the distribution of the region.
Period of Construction • There is a similar trend, the number of construction between the two period is more or less the same number. • The sample population only represents a small number of people of construction in the region.
Maintenance & Repair • Those who does regular maintenance does not have major nor minor repair
Maintenance & Repair • Those who has major repair, tend to have less minor repair.
Ethnic Origin • These sample population represents majority of the English or British ethnic origin in the region. • Those who has British ethnic origin also has English ethnic origin. • Those who has English ethnic origin is less than British ethnic origin.
Ethnic Origin • This data only represents a very low number of people who is French ethnic origin.
Household vs Family Income • Both have the same trend, some who doesn’t answer for family income, answered for household income
Labour Force Variable • Both of them has the same description. Need to check which one is which.
Birth City • The population sample is mostly locals