100 likes | 241 Views
Data Quality Models for High Volume Transaction Streams: A Case Study. Joseph Bugajski Visa International. Robert Grossman, Chris Curry, David Locke & Steve Vejcik Open Data Group. The Problem: Detect Significant Changes in Visa’s Payments Network. Account. Issuing Bank. Merchant.
E N D
Data Quality Models for High Volume Transaction Streams: A Case Study Joseph Bugajski Visa International Robert Grossman, Chris Curry, David Locke & Steve VejcikOpen Data Group
The Problem: Detect Significant Changes in Visa’s Payments Network Account Issuing Bank Merchant Acquiring Bank
Visa Payment Network • Over 1.59 billion Visa cards in circulation • 6800 transactions per second (peak) • Over 20,000 member banks • Millions of merchants
The Challenge: Payments Data is Highly Heterogeneous • Variation from cardholder to cardholder • Variation from merchant to merchant • Variation from bank to bank
Observe: If Data Were Homogeneous, Could Use Change Detection Model Baseline Model Observed Model • Sequence of events x[1], x[2], x[3], … • Question: is the observed distribution different than the baseline distribution? • Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests
Key Idea: Build 104+ Models, One for Each Cell in Data Cube • Build separate model for each bank (1000+) • Build separate model for each geographical region (6 regions) • Build separate model for each different type of merchant (c. 800 types of merchants) • For each distinct cube, establish separate baselines for each metric of interest (declines, etc.) • Detect changes from baselines 20,000+ separate baselines Geospatial region Type of Transaction Bank Modeling using Cubes of Models (MCM)
Greedy Meaningful/Manageable Balancing (GMMB) Algorithm Breakpoint • More alerts • Alerts more meaningful • To increase alerts, add breakpoint to split cubes,order by number of new alerts, & select one or more new breakpoints • Fewer alerts • Alerts more manageable • To decrease alerts, remove breakpoint,order by number of decreased alerts, & select one or more breakpoints to remove One model for each cell in data cube
Augustus • Open source Augustus data mining platform was used to: • Estimate baselines for over 15,000 separate segmented models • Score high volume operational data and issue alerts for follow up investigations • Augustus is PMML compliant • Augustus scales with • Volume of data (Terabytes) • Real time transaction streams (15,000/sec+) • Number of segmented models (10,000+)
Some Results to Date • System has been operational for 2.5 years • ROI • 5.1x Year 1 (over 6 months) • 7.3x Year 2 (12 months) • 10.0x Year 3 (12 months) • Currently estimating over 15,000 individual baseline models • The system has issued alerts for: • Merchants using incorrect Merchant Category Code (MCC) - account testing • Sales channel variations • Incorrect use of merchant city name field • Incorrect coding or recurring payments
Summary • Used new methodology (Modeling using Cubes of Models) for modeling large, highly heterogeneous data sets • This project contributed in part to the development of a Baseline Model in the open source Augustus system • Integrated system to generate alerts using Baseline Models with manual investigation process • Project is generating over 10x ROI • Poster #20 in the Tuesday night Poster Session.