1 / 49

Data Mining Project

Data Mining Project. Team 4 Christine Dorothy - HT062300W Djoko Joewono - HT062137Y Kang Tiong Meng - HT072164H Ng Hock Leong Tommy - HT072172H Wang Long - HT072109B Hoo Soo Yean. Project Overview. Mail Order Company in USA Would like to find out if there is a way

dee
Download Presentation

Data Mining Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Project Team 4 Christine Dorothy - HT062300W DjokoJoewono - HT062137Y Kang TiongMeng - HT072164H Ng Hock Leong Tommy - HT072172H Wang Long - HT072109B HooSoo Yean

  2. Project Overview • Mail Order Company in USA • Would like to find out if there is a way • To reduce mailing cost • By analyzing the past data

  3. CRISP Process Overview

  4. Business Understanding • Business Objectives: • To find out which customers that are good candidate to purchase products • To explore the data to determine company’s valuable customers

  5. Business Understanding • Assess the Situation: • One CSV file from 3 data sources • Census Group A • Census Group B • Tax Filers • Personnel • Six MTech Students • Minimum experience in data mining • Software • MS Excel • Clementine, Data Scope

  6. Business Understanding • Data Mining Goals • Predict which variables affects customer buying decision • Build models and compare the cost against randomly-chosen customers • Suggest a model to achieve >1% mailing response

  7. Project Plan

  8. Data Understanding • First Insights Discovery • Total record is 2158 • Distribution by Objective • Distribution by Gender

  9. Data Understanding • Data Quality Problems • Some columns are normalized others not • All values are number, harder to visualize • Many data is incomplete • Missing recency, no of transactions and dollars of spending data for individual products

  10. Data Understanding • Describe Data • Gross properties of data • The data is extracted from a larger set with respond rate of ~1%.All 1079 responders and 1079 randomly chosen non-responders • Relationship between attributes • firstmonth and tenure have a linear relationship, Thus tenure can be omited.

  11. firstmonth Vs tenure

  12. Histogram of product6

  13. Data Preparation • Select Data • Variables chosen • Clean Data • Some normalizations • Construct Data • Chose the variables as input • Data Transformation • Rescaling • Derive new variables

  14. Data Cleaning Reduce redundancy caused by data integration Replace lowincome and highincome with IncomeGroup. Replace gender1,gender2 and gender3 with Gender. Discard V171 Total taxfilers with unemployment benefits Discard V175, V181,V184, V190,V193,V196. they equal to male data plus female data

  15. Data Transformation Rescaling Log() of totalspend and totaltrans to reduce effect of large variables Derive Data Derive ActAccInMostRecMon from product recency data(no of active accounts in most recent month) Derive the ratio of low taxfiler income from V156-V163 Value=V156/sum(V156:V163) Convert value to 5 categories.

  16. Data Transformation Histogram of new variable with Objective overlaid

  17. Gender Vs IncomeGroup

  18. Multiplot for p16

  19. totaltranslog vs totalspendlog

  20. totalspendlog over totaltranslog(Sum)

  21. Histogram of ActAccInMostRecMon

  22. Census A Languages

  23. Census A Languages (cont’d) Inverse correlation between English and French speaking regions No region with significant Tagalog, Spanish or other language-speaking populations Can probably discard amtspanish, amttagalog, amtsingres, amtengnon, amtmultilin Cluster/segment English/French areas

  24. Census A vs B Languages

  25. Census A vs B Languages

  26. Census A vs B Languages Linear relationship for English and French across Census A & B Can merge amtenglish and bhlenglish Can merge amtfrench and bhlfrench

  27. Acflonepar vs bfslonepar

  28. Acflonepar vs bfslonepar Linear relationship Merge acflonepar & bfslonepar Filter out noisy data

  29. Anfamrel – living with relatives

  30. Anfamrel – living with relatives Most data below 0.1 Objective remains constant throughout Not important to business objective – discard

  31. afem40to44

  32. afem40to44 Lack of data from other age groups Very specific targeted marketing to 40-44 females group Normalize values from 0 to 0.1 if necessary Objective improves as proportion increases

  33. afp1child

  34. afp1child Objective clearly improves when afp1child is on lower end of normal curve

  35. acfwchcom 7 regions with acfwchcom = 0.19 and objective = 1

  36. Acftotmar vs acfhuswife

  37. Acftotmar vs acfhuswife Most regions have above 60% married couples, assuming normalized data Acftotmar and acfhuswife mirror one another Can discard either field Filter noisy data Categorical : lone-parent and husband-wife

  38. Census B Data Understanding As the other cencus and taxfiler data, these data represents the distribution of the region.

  39. Period of Construction • There is a similar trend, the number of construction between the two period is more or less the same number. • The sample population only represents a small number of people of construction in the region.

  40. Maintenance & Repair • Those who does regular maintenance does not have major nor minor repair

  41. Maintenance & Repair • Those who has major repair, tend to have less minor repair.

  42. Ethnic Origin • These sample population represents majority of the English or British ethnic origin in the region. • Those who has British ethnic origin also has English ethnic origin. • Those who has English ethnic origin is less than British ethnic origin.

  43. Ethnic Origin • This data only represents a very low number of people who is French ethnic origin.

  44. Household vs Family Income • Both have the same trend, some who doesn’t answer for family income, answered for household income

  45. Labour Force Variable • Both of them has the same description. Need to check which one is which.

  46. Birth City • The population sample is mostly locals

More Related