180 likes | 259 Views
Lecture 10. MARK2039 Summer 2006 George Brown College Wednesday 9-12. Assignment 8: Geocoding example. Example: A retailer has the following information: Name and address of its customers Address of its stores Stats Can Information
E N D
Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12
Assignment 8: Geocoding example • Example: • A retailer has the following information: • Name and address of its customers • Address of its stores • Stats Can Information • As a marketer, how would you intelligently use this information • Get Postal codes of customers and stores • Get geocodes(latitude and longitude numbers of each postal code) • Calculate distance between each customer and neares store • Create trading area around store to determine relevant customers for store • Identify best stores and calculate demographics of best stores vs. the remaining stores • Use above learning to either promote non performing stores with similar customer demographic makeup of best stores • Use above info to determine where to open up or perhaps close stores
Assignment 8 • Why do we look at correlation analysis as our first statistical exercise in the data mining process • Allows us to initially use statistics as a prescreen tool in eliminating variables from the data mining exercise
Assignment 8 • Give me an example of a correlation table of 5 variables where two variables aresignificant and three variables are not significant. Provide correlation values that support your results
Recapping from last week • Geocoding • What are key things to think of. • Look at answer from two slides ago.Geo coding gives us numbers to calculate distance between two postal codes • More Material on correlation analysis • How do EDA reports tie into the correlation analysis • They are trend-like reports which demonstrate why a given variable has a strong relationship with the objective function. • How should we present the final results of a model? How is the above derived? From the partial R2 of each variable divided by the total R2 of the equation.
Notion of Lift • What is Lift: the performance of a group relative to the performance of the benchmark • Examples: Untargetted/ Targetted/ Type of Activity Benchmark Challenger Lift Acquisition Campaign Response Rate 1% 2% 200. Retention Campaign Churn Rate 15% 25% 166 Credit Card Loss Rate 5% 8% 160 Product Affinity Rate 10% 30% 300 The targetted group represents those names as determined by a data mining tool such as a predictive model.
Notion of Lift • Examples of cases where lift is below 100 Untargetted/ Targetted/ Type of Activity Benchmark Challenger Lift Acquisition Campaign Response Rate 1% .5% 50 Retention Campaign Churn Rate 15% 10% 66 Credit Card Loss Rate 5% 2% 40 Product Affinity Rate 10% 6% 60
Validating the Model: Example of a Gains Chart • Revenue per order is $60. • Cost of 1 mail piece is $.855 • Benefits of modelling are the foregone promotion costs by promoting fewer names to achieve a given # of orders at a higher response rate. • Listed below are the hard numbers that might comprise a lift curve % of List Validation Cum. Cum. % Cum. Interval Benefits (Ranked by Mail Resp. of all Lift ROI Model Quantity Rate Resp Score) 0 - 10% 20000 3.50% 23.33% 233 145% $22799 10 - 20% 40000 3.00% 40% 200 75% $34200 20 - 30% 60000 2.75% 55% 183 58% $42750 30 - 40% 80000 2.50% 67% 167 23% $45600 40 - 50% 100000 2.25% 75% 150 - 12.2% $42750 . . . 90 - 100% 20,0000 1.50% 100% 100 - 58% $0 How might this be plotted?-in class we saw this as a straight decreasing linear slope if we were plotting interval resp. rate against the deciles. If we plot the Cum % of responders, then the shape would be a parobola type curve with a larger parobola representing a better model. Meanwhile, a steeper slope if we plotted interval response rate against deciles would represent a stronger model.
Validating the Model: Calculating the metrics on the gains charts. • Cum. % of Responders in top 10%: • Total Responders: 200000 X 1.5%: 3000 • # of responders in top 10%:20000X3.5%: 700 • Cum. % in top 10%: 700/3000: 23% • Cum. Lift in top 10%: • Average Response Rate: 1.5% • Cum. Response Rate in top 10%: 3.5% • Cum .Lift: 233
Calculating the metrics on the gainscharts. • Interval ROI in 10%-20% • # of persons mailed: 20000 • # of responders in 10%-20%(40%-23.33%)*3000: 500 • Net revenue: (500*60)-.855*20000: 12900 • Costs: 17100 • ROI:(12900/17100): 75% • Calculating Benefits Column at 30%: • Mailed costs to achieve 1650 responders without modelling: • ((.0275*60000)/.015) * .855= 94050 • Mailed costs with modelling=60000*.855= 51300 • Benefits: 94050-51300= $42750
Cum. # of Names Cum. Response Mailed Rate Interval Resp.Rate Interval Lift Benefits Interval ROI 10000 2.50% 20000 2.25% 30000 2.10% 40000 1.80% . . . . 100000 1% Gains Chart Examples 1 25% 0 -10% -55% $15,000 $25,000 $33,000 $32,000 2.5% 250 2.5% 200 2.5% 1.8% 180 0.9% 90 Assume a mail cost of $1.00 per piece and a revenue per order of $50.00. IntervalResp.Rate 10,000*0.025=250=2.5% 20,000*0. Please fill in the blanks for the first 4 rows.
Lift Curve with Zero Model Effectiveness What does this look like if we plot it on a lift curve A line rather than a parobola if we plot cum % of responders
Gains Chart Examples What is the best model?-Model 1 What is the worst model?-Model 4 What are the Model 3 results telling you. –we have some rank ordering all the way down to 70000 names and then the model flattens out-may need a strategy herefor this bottom segment.
Gains Chart Examples • In each response model case, answer the following questions: • Where would you cutoff be with a budget of $80000 and a cost per piece of $2.00 • 40000 names • Where would you cutoff be if you needed to attain a forecasted order qty of 350. • Between 10000 and 20000 names-model 1 and 2, between 20000 and 30000 for model 3 and between 30000 and 40000 formodel 4 • Where would your optimum cutoff be presuming that budget nor forecasted order model quantities were constraints? 50000-model 1,2, and 60000 for model 3 –it does not matter for model 4
Gains Chart Examples • Calculate the Following: -Interval Names Mailed -Cum. Response Rate • Assuming a cost per name of $1.50 and revenue perresponder of $75, calculate the interval ROI foreach intervaland modelling benefits for each interval?
Tracking of Models • Two models are used in two campaigns. In campaign A, the overall response rate is 3.5% which is above the breakeven response rate of 2%. In campaign B, the overall response rate is 1.2% which is below the breakeven response rate of 2%. Yet, the model in campaign B is more effective. Explain Why? Model is rank ordering names quite well for campaign B(1.2% overall) while the better campaign overall(3.5%) exhibits no rank ordering of response rate between deciles.
CHAID • CHAID” is an acronym for Chi-square Automatic Interaction Detection • Produces decision-tree like report • Branches and Nodes • Non parametric approach • Output of routine is a segment or groupas opposed to a score • Uses Chi-Square statistics to determine statistically significant breaks • Conceptual Interpretation:(Observed-Expected)/Expected
CHAID What criteria determine the end nodes?