520 likes | 531 Views
Explore how Target's analysts constructed a pregnancy prediction model, the data inputs used, additional variables to improve the model, and the actions informed by predicted pregnancy scores.
E N D
Target’s Pregnancy Prediction ProblemThe Complete Analytical Process “Take a fictional Target shopper named Jenny Ward, who is 23, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s pregnant and that her delivery date is sometime in late August.”
Model Inputs Based on Duhigg’s description of Jenny Ward, what kinds of data did Target analysts use in constructing the pregnancy prediction model?
DATA OUTPUT ACTION Pregnancy Scores Product Selection in Brochures Past Purchases - Related items - Target items Age Gender
Discuss An analyst wants to improve the model by adding more variables to it. Suggest some additional variables.
Discuss Suggest other actions that can be informed by the predicted pregnancy scores.
Model DATA OUTPUT ACTION Pregnancy Scores Product Selection in Brochures Past Purchases - Related items - Target items Age Gender
What is a Model? Valuation Model Source: Keith Howe (2009)
What is a Model? Climate Model Source: Mark Chandler, EdGCM
What is a Model? Climate Model Source: IPCC
What is a Model? Digital Marketing Attribution Model
Digital Marketing Attribution For each response, allocate credit to the responsible channel Display Ad SEO SEM Email
FIRST CLICK EXP DECAY LAST CLICK Response Display Ad SEO SEM Email t0 t10 t9 t8 t7 t6 t5 t4 t3 t2 t1 User clicked on Google organic search result User clicked on banner ad Influence scales with time order
All models are wrongbut some are useful George Box
First to Your Door “Right around the birth of a child... parents are exhausted and overwhelmed and their shopping patterns and brand loyalties are up for grabs.” “We knew that if we could identify them in their second trimester, there’s a good chance we could capture them for years.”
Brochure Design “As long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons.” “We’d put an ad for wineglasses next to infant clothes. That way, it looked like all the products were chosen by chance.” “We started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random.”
Customer acquisition tool • It’s predictive • So accurate it’s creepy
The Marketing Problem Better Prospecting More Relevant Brochures More New Customers More Revenue Per Current Customer Lower Cost More Revenue More Profit
Discuss Identify any problems with the way Target analysts framed the business problem.
The Business Problem Revisited Targeting of At-Risk Customers More Relevant Brochures Retain More Customers More Revenue Per Customer Cost: New Brochures v. Fewer Brochures + Any Offers More Revenue More Profit
Measuring Success According to Duhigg, Target’s pregnancy prediction effort was highly successful. How accurate was Target’s prediction model? The accuracy rate was not disclosed.
20% predicted to be pregnant 3 of 10 predictions are accurate 2 of 5 pregnancies are missed
20% predicted to be pregnant Accurate Prediction Missed Opp False Positives Why did Target mix in random products? A) 7 out of 10 receiving brochures will not be pregnant B) 3 out of 10 receiving brochures will feel creeped out A >> B 27
Customer acquisition tool • It’s predictive • So accurate it’s creepy • Inaccurate anddetracting • Not for customer acquisition 2. It’s not very predictive - even with Big Data
Target’s Pregnancy Prediction Problem • Defining and framing the business problem • Collecting data for the analytical model • Selecting an analytical method • Developing a useful model that solves the problem • Describing how model outputs can drive action • Projecting the impact of such action • Measuring the model performance
Use Cases Which of the following questions can be answered directly by the Baby Names Voyager (without referring to other materials)? A. Why did the name Barbara peak in the 1940s? B. Is the name Charlotte or Chelsea more popular? C. How popular with David be in the year 2025? D. What name should I choose for my baby girl?
Other Analyses of the Data Source: Social Security Administration
Other Analyses of the Data Source: Social Security Administration
Other Analyses of the Data Source: Social Security Administration
Given a name, which time period is most likely? Given a name, guess someone’s age Given a name, guess what languages he/she speaks fivethirtyeight.com Inverting the Frame Given a time period, which names are popular? Baby Names Voyager
DATA OUTPUT ACTION Address Religion First Name Last Name Probabilities of speaking English, Spanish, German, Japanese, etc. Segmentation, Targeting, etc.
Evaluating Model Performance Make prediction using the median Use IQR as a measure of error Accuracy varies with name Source: fivethirtyeight.com
Evaluating Model Performance Accuracy varies with gender Accuracy improves with more co-variates Source: fivethirtyeight.com 43
Discuss What other co-variates might be useful to help predict age more accurately?
Course Project I. Project Proposal (Wk 3) II. Midterm: Data Cleaning & Processing (Wk 7) III. Final: Analysis & Modeling (Wk 12)
Project Proposal • Objectives: • Select a dataset and specify a business/organizational problem you want to solve • Diagnose data issues in your dataset (you will fix these issues in Deliverable #2). We cover diagnosing and fixing data issues in Module 2 • Due Date: [Sep 25th], 11:59 PM • Grading: max 10 points • All assignment files must be uploaded to Canvas. We do not accept emailed files. • Reminder: Late assignments (excused or not) will incur a penalty of 20%. Late without prior notification, or late by more than 7 days, will be scored zero. • Ling or I will provide feedback and approval on Canvas. (Please open your documents before you email us asking where our comments are.)
Choosing your Dataset • Not too small (e.g. > 500 rows) • Not too big (e.g. < 1 million) • Not too aggregated • Not too dirty • Not too clean • Non-anticipatory (if Prediction)
Example of a Bad Dataset Ebola in West Africa data Too aggregated For any given business problem, many of these rows will be useless Too few variables
Selecting an Analytical Problem PREDICTION SEGMENTATION • Probability of a borrower defaulting a loan • Probability of an email being spam • Probability of a customer deactivating (“churn”) • Amount of revenues • Frequency of visits • There is a response (outcome) variable • If the response is binary (yes/no) or categorical (e.g. which product type), also called a “classification” problem • Looking for correlations between the response and co-variates • Predictions can be validated • How many types of customers do we have? • What are the characteristics of different types of shoppers? • What is the probability that a company has a business model of type A (B, C, etc.)? (advanced) • No response (outcome) variable • Adding structure to the data • Looking for correlations between co-variates • Difficult to validate, need external evidence such as survey results