Investigating Online A/B/n Tests in the Wild: Who's the Guinea Pig?

Who’s the Guinea Pig? Investigating Online A/B/n Tests in-the-Wild Shan Jiang, John Martin, Christo Wilson Published in Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* 2019). Atlanta, GA, January, 2019.

Who Am I? Christo Wilson • Associate Professor • Khoury College of Computer Sciences • Northeastern University, Boston MA Research focus: online security and privacy • Public Key Infrastructure (PKI) and Transport Layer Security (TLS) • Tracking on the web • Algorithm auditing Impact and ethics of socio-technical systems on users

Online Behavioral Experiments OBEs, otherwise known as: • A/B or A/B/n tests • Split or bucket testing Use experiments to guide the design of websites, apps, services • Divide users into groups or segments • Show different variations per group • Measure the results and iterate or implement All of the major platforms use OBEs to guide development • In 2011, Google claimed to run 7,000 OBEs per year

OBEs Are Human Subjects Experiments In academia, OBEs are subject to Institutional Review Board approval • Submit the research plan for evaluation • Must be ethical, respect subjects • Obtain informed consent or perform post-hoc debrief None of this applies to industry • No requirements for disclosure or consent This matters, because OBEs may be ethically problematic • Facebook’s “emotion contagion” study • Price discrimination or product steering

Questions Little is known about OBEs in industry • What experiments are being run? • By whom? • On whom? • What are the ethical implications? Challenging to study, because companies are not transparent

Our Study Use Optimizely as a lens to measure OBE testing at-scale • 575 websites for three months • Complete data on audience segments and experimental treatments Analyze characteristics of audiences and experiments • Most websites have <= 5 experiments, a few have dozens • Audience segmentation by device, geography, etc. Case studies of ethically problematic experiments • News headlines, price discrimination, ads and tracking

IntroData CollectionAnalysisCase StudiesDiscussion

Company that offers tools for conducting OBE tests • Monthly subscription fee for website operators Website operators define audiences and experiments • Audiences  treatment groups • Segmentation based on browser, platform, IP, custom JavaScript, cookies, etc. • Variables may be combined using Boolean expressions • Experiments may be A/B/n tests or static personalization • Each experiment has >= 1 variations • Variations may correspond to audiences, or users may be randomly segmented • A classic A/B test would be an experiment with two, randomly assigned variations Operators have complete control over treatment effects • Composed of arbitrary HTML, CSS, and JavaScript

How Does Optimizely Work? • Website operators define their audiences and experiments • Resulting data is saved in a JSON-encoded configuration file • Operator adds the Optimizely JavaScript library to their website • When a user visits the website, the Optimizely JS executes • Downloads the JSON-config file • Calculates which audiences & experimental variations the user is part of • Injects the corresponding variations into the webpage All configuration data is available client side! • Thus, it can be crawled ;)

Data Collection Crawled the Alexa Top 1M in 2016 • Identified all sites that includes a resource from Optimizely Crawled these sites once per week between January—March 2018 • Homepage plus 19 internal links from each site • Stored all Optimizely JSON configs embedded in pages Possible explanations for the discrepancy • May have stopped using Optimizely • May not use the full Optimizely tool suite • May not have had active experiments in Jan—March • May use Optimizely on portions of the site we didn't crawl

IntroData CollectionAnalysisOverall Optimizely UsageAudiencesExperimentsCase StudiesDiscussion

Overall Usage 80% of websites include Optimizely on ~18 pages Vast majority of Optimizely usage is by Top 100K websites Some websites use Optimizely more selectively

Audiences 77% of websites have <= 5 audiences 38% of websites defined custom audience segments • Note: remaining 62% of sites may still be running experiments! Aggressive segmentation • Optimizely (114) • The New York Times (90) • AirAsia (79) • CREDO Mobile (64) Four websites do aggressive segmentation

Defining Audiences Custom JavaScript • Ability to execute arbitrary code in the user’s browser to determine audience Cookies from third-parties • Tracking pixels • Data brokers (BlueKai) Sadly, we can’t analyze these :( Relatively high-level of sophistication

Experiments A/B tests where A group is the audience, B group is all other visitors Basic A/B tests 69% of websites have <= 5 experiments Not an actual experiment, same “variation” for all visitors These experiments may still have multiple variations based on randomized audience segmentation 9 websites have > 50 experiments (Optimizely, NYT, etc.)

IntroData CollectionAnalysisCase StudiesPrice DiscriminationNews HeadlinesDiscussion

Price Discrimination Search for all experiment variations containing “$”, “sale”, “price”, etc. • 40 websites with 117 experiments containing these terms PolicyGenius – insurance comparison marketplace • 13 experiments with 13 audiences • Audiences based on Google Analytics’ Urchin tracker cookie • “Term Life Insurance As Low As X Per Month” where X = [$9.99, $10, $29] • Text with and without 40% discount offer language Particularly concerning in the insurance market, historical discrimination

91 experiments 90 audiences A/B/n testing headlines Typically last 1-2 hours

Issues With Headline Testing Exemplifies competing priorities between business and editorial • Clickier headlines earn more money • But clickiest headline may not be the most informative or nuanced • Worst-case scenario: A/B/n tests privilege clickbait Complicates the idea of mass media as a shared frame • People often don’t read articles, they scan headlines • Framing of the headline shapes people’s perceptions of the story/event • Testing may create divergent perceptions of the news

IntroData CollectionAnalysisCase StudiesDiscussion

Limitations Optimizely is the most popular OBE tool, but not the only one • Competing tools from Adobe, Google • Big platforms (Google, Facebook) use bespoke tools No coverage of mobile apps 53% of experiments target custom audiences • No way to analyze these audiences

Transparency and Consent We did not find any evidence of overtly unethical experiments OBEs are human experiments To our knowledge, none of the 575 websites prominently discloses experiments or asks for affirmative consent • Most say nothing in the ToS • NYT has an old blog post Respecting human autonomy requires grappling with these issues

Ethics for Practitioners Optimizely’s tools (and others like them) are powerful and accessible Optimizely offers no training on how to use the tools ethically • Offer extensive training in all other aspects of their products Opportunities to teach website operators about experimental ethics • Belmont Report and Menlo Report • Informed consent and debriefing protocols • Beneficence, justice, respect for autonomy Promote accountability by asking for self-certification of compliance • Friction forces practitioners to consider ethics • Creates grounds for recourse against bad actors

DETOUR Act Deceptive Experiences To Online Users Reduction Act • Introduced by senators Mark Warner (D-VA) and Deb Fischer (R-NE) • Targets internet firms with >100 million active users Bans deceptive interfaces that prevent informed consent from users Bans segmentation of users into groups for experimentation without consent • Experiments must be disclosed to users and the public at least every 90 days • Mandates independent review boards for behavioral and psych research

Questions? Christo Wilson c.wilson@northeastern.edu @bowlinearl

Investigating Online A/B/n Tests in the Wild: Who's the Guinea Pig?