Unveiling the Top 7 Testing Pitfalls: Live Presentation by Ronny Kohavi

Sponsored By: Top 7 Testing Pitfalls Presented live November 18, 2009 Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation Platform Admin Note: Attendees will also get a copy of these slides + an On-demand mp3 of this via email on Thursday afternoon November 19th

WhichTestWon.com First: Why Bother Testing? -> ‘Best Practices’, standard Web design templates, and marketer’s “gut” often FAIL tests. -> For previously untested sites, testing gives an average ~ 40% conversion lift. -> Tests can help you generate better quality leads or sales – not just more conversions.

Agenda • Intro & controlled experiments in one slide • Examples: you’re the decision maker • Seven pitfalls • Q&A Pitfalls based on KDD 2009 paper: http://exp-platform.com/ExPpitfalls.aspx by Thomas Crook, Brian Frasca, Ronny Kohavi, and Roger Longbotham

Our Experience at Microsoft • The Experimentation Platform started at Microsoft in 2006 • Experiments ran on 20 Microsoft properties, including MSN home pages in several countries, MSN Money, MSN Real estate, www.microsoft.com, store.microsoft.com, support.microsoft.com, Office Online, www.xbox.com, several marketing sites, and Windows Genuine Advantage • Large experiments run with tens of millions of users • Multiple experiments have projected annual improvements of over $1M each

Controlled Experiments in One Slide • Concept is trivial • Randomly split traffic betweentwo (or more) versions • A (Control) • B (Treatment) • Collect metrics of interest • Analyze • Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s) • Must run statistical tests to confirm differences are not due to chance

Examples • Three experiments that ran at Microsoft • All had enough users for statistical validity • OEC: the Overall Evaluation Criterion • See how many you get right • Three choices are: • A wins (the difference is statistically significant) • A and B are approximately the same (no stat sig diff) • B wins

Office Online Test new design for Office Online homepage OEC: Clicks on revenue generating links (red below) A B Is A better, B better, or are they about the same?

Office Online • B was 64% worse • The Office Online team wrote A/B testing is a fundamental and critical Web services… consistent use of A/B testing could save the company millions of dollars

MSN UK Hotmail experiment Hotmail module on the MSN UK home page

MSN UK Hotmail experiment A: When user clicks on email hotmail opens in same window B: Open hotmail in separate window Trigger: only users that click in the module are in experiment (no diff otherwise) OEC: clicks on home page (after trigger) Is A better, B better, or are they about the same?

UK Hotmail • For those in the experiment, clicks on MSN Home Page increased +8.9% • <0.001% of users in B wrote negative feedback about the new window

Data Trumps Intuition • We distribute experiment reports widely at Microsoft • Someone who saw the report wrote This report came along at a really good time and was VERY useful. I argued this point to my team (open Live services in new window from HP) just some days ago. They all turned me down.Funny, now they have all changed their minds.

MSN Home Page Search Box OEC: Clickthrough rate for Search box and popular searches A B • Differences: • A has taller search box (overall size is the same), has magnifying glass icon, “popular searches” • B has big search button Is A better, B better, or are they about the same?

Search Box • No statistically significant difference • Insight Stop debating, it’s easier to get the data

Hard to Assess the Value of Ideas:Data Trumps Intuition • At Amazon, half of the experiments failed to show improvement • QualPro tested 150,000 ideas over 22 years • 75 percent of important business decisions and business improvement ideas either have no impact on performance or actually hurt performance… • Based on experiments with ExP at Microsoft • 1/3 of ideas were positive ideas and statistically significant • 1/3 of ideas were flat: no statistically significant difference • 1/3 of ideas were negative and statistically significant • Our intuition is poor: 2/3rd of ideas do not improve themetric(s) they were designed to improve. Humbling!

The HiPPO • The less data, the stronger the opinions • Our opinions are often wrong – get the data • HiPPO stands for the Highest Paid Person’s Opinion • Hippos kill more humans than any other (non-human) mammal (really) • Don’t let HiPPOs in your orgkill innovative ideas. ExPeriment! • We give out these toy HiPPOs at Microsoft

Is Software Just Hard? NO! • Doctors have been taking the HiPPocratic Oath and promising “no harm,” yet many beliefs werewrong for hundreds of years • For centuries, an illness was thought to be a toxin • Opening a vein and letting the sickness run outwas the best solution – bloodletting • One British medical text recommended bloodletting foracne, asthma, cancer, cholera, coma, convulsions, diabetes, epilepsy, gangrene, gout, herpes, indigestion, insanity, jaundice, leprosy, ophthalmia, plague, pneumonia, scurvy, smallpox, stroke, tetanus, tuberculosis, and for some one hundred other diseases • Physicians often reported the simultaneous use of fifty or more leeches on a given patient. Through the 1830s the French imported about forty million leeches a year for medical purposes

Bloodletting (2 of 2) Lancet • President George Washington had a sore throatand doctors extracted 82 ounces of blood over 10 hours (35% of his total blood), causing anemia and hypotension. He died that night • Pierre Louis did an experiment in 1836 that is now recognized as one of the first clinical trials, or randomized controlled experiment. He treated people with pneumonia either with • early, aggressive bloodletting, or • less aggressive measures • At the end of the experiment, Dr. Louis counted the bodies. They were stacked higher over by the bloodletting sink

Agenda • Intro & controlled experiments in one slide • Examples: you’re the decision maker • Seven pitfalls • Q&A

Pitfall 1: Wrong Success Metric Remember this example? OEC: Clicks on revenue generating links (red below) A B

Pitfall 1: Wrong OEC • B had drop in the OEC of 64% • Were sales correspondingly less also? • No. The experiment is valid if the conversion from a click to purchase is similar • The price was shown only in B, sending more qualified purchasers to the pipeline • Lesson: measure what you really need to measure, even if it’s difficult!

Pitfall 2: Incorrect Interval Calculation • Confidence Intervals (CI) are a great way to summarize results that have variability • Example: 95% CI for conversion rate might be 2.8%-3.2% (mean of 3.0% +/- 0.2%), which improved from 1.8%-2.2% • Business users prefer percent effect: 2% to 3% is a 50% improvement in conversion! • How can we provide a confidence interval on the 50%?

Pitfall 2: Incorrect Interval Calculation (cont) • You can’t just convert the confidence interval to a percent effect because the denominator is a random variable (we have a ratio of means) • Use Fieller’s formula for an exact percent effect • More complex formula, but that’s why we have computers (and statisticians who figured this out in 1954) • Note: the confidence interval is not always symmetric around the mean in this case

Pitfall 3: Using Standard formulas for Standard Deviation • Many metrics for online experiments cannot use the standard statistical formulas • Example: Click-through rate = clicks/page-views • The standard statistical approach would assume this would be approximately Bernoulli • However, the true standard deviation is commonly larger than Bernoulli because of independence violations • Solution: Bootstrap or the delta method

Best Practice: Ramp-up • Ramp-up • Start an experiment at 0.1% • Do simple analyses to make sure no egregious problems can be detected • Ramp-up to a larger percentage, and repeat until desired percent (e.g., 50%) • Big differences are easy to detect because the min sample size is quadratic in the effect we want to detect • Detecting 10% difference requires a small sample and serious problems can be detected during ramp-up • Detecting 0.1% requires a population 100^2 = 10,000 times bigger

Pitfall 4: Combining Data when Percent to Treatment Varies • Simplified example: 1,000,000 users per day • For each individual day the Treatment is much better • However, cumulative result for Treatment is worse • This is called Simpson’s Paradox

Pitfall 5: Not Filtering out Robots • Internet sites can get a significant amount of robot traffic (search engine crawlers, email harvesters, botnets, etc.) • Robots can cause misleading results • Most concerned about robots with high traffic (e.g. clicks or PVs) that stay in Treatment or Control • We’ve seen one robot with > 600,000 clicks in a month on one page (and it was executing JavaScript)

Pitfall 5: Not Filtering out Robots (cont) • Identifying robots can be difficult • Some robots identify themselves through the UserAgent • Many look like human users and execute Javascript • Use heuristics to ID and remove robots from analysis (e.g. more than 100 clicks in an hour) • Ongoing research. No silver bullet

Effect of Robots on A/A Experiment • Each hour represents clicks from thousands of users • The “spikes” can be traced to single “users” (robots)

Pitfall 6: Invalid or Inadequate Instrumentation • Validating initial instrumentation • Logging audit – compare experimentation observations with recording system of record • A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both • Expect about 5% of metrics to be statistically significant • P-values should be uniformly distributed on the interval (0,1) and no p-values should be very close to zero (e.g. <0.001) • Many of our “customers” initially fail one of these tests

Pitfall 7: Insufficient Experimental Control • Must make sure the only difference between Treatment and Control is the change being tested • Plot shows hourly click-throughrate for Control and Treatmentin the MSN Home Page • Headlines were supposed to be the same in both • One headline was different for one 7 hour period, significantly changing the result

Summary • It is hard to assess the value of ideas • Get the data by experimenting because data trumps intuition • Examples are humbling • Avinash Kaushik wrote: “…the power of: Controlled Experiments. I am convinced this is God’s gift to online humanity.” • Replace the HiPPO with an OEC • Make sure the org agrees what you are optimizing (long term lifetime value) • Experts are often wrong. Doctors did bloodletting for centuries (and they swear by the HiPPOcratic oath) • Watch out for the pitfalls

Resources for Deeper Drive • Controlled Experiments on the Web: Survey and Practical Guide in Data Mining and Knowledge Discovery journal, 2009http://exp-platform.com/hippo_long.aspx • KDD 2009 Tutorialhttp://exp-platform.com/tutorial.aspx • Contact: ronnyk@ microsoft dot you know what

WhichTestWon.com Live Q&A with Anne, Ronny, Roger

Thanks, plus 2 free offers: • Online Testing Awards • Free entries • Everyone eligible • Deadline this Friday! • http://whichtestwon.com/awards Free Landing Page Evaluation Offer Click to schedule: http://whichtestwon.com/widerfunnel/lp.html

Unveiling the Top 7 Testing Pitfalls: Live Presentation by Ronny Kohavi

Unveiling the Top 7 Testing Pitfalls: Live Presentation by Ronny Kohavi

Presentation Transcript

Sponsored by

SPONSORED BY

Sponsored by

Sponsored by:

Sponsored by

Sponsored by

Sponsored by:

SPONSORED BY:

SPONSORED BY

Sponsored by:

Sponsored by:

Sponsored by :

sponsored by:

Sponsored by

Sponsored by

Sponsored by

Sponsored by:

Sponsored by

Sponsored by: Co-sponsored by: Presented by:

Sponsored by