260 likes | 273 Views
Impossibility Mining. Traditional Data Mining. Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins Canonical: Diapers and Beer at Walmart Urban Legend – comes from 1992 Teradata study of Osco. Correlation!=Causation
E N D
Traditional Data Mining • Using multidimensional data to find previously unknown hidden relationships • Not just simple query/joins • Canonical: Diapers and Beer at Walmart • Urban Legend – comes from 1992 Teradata study of Osco. • Correlation!=Causation • Terminology currently has negative connotations in the press
Il buono, il brutto, il cattivo • 3 categories of “data mining” for fraud • Profiling (il brutto) • Probability Mining (il cattivo) • Anomaly Detection (il buono)
Profiling • Looking for a series of characteristics which identify a likely problem • Demographic Profiling: • Looking for a series of personal identifiers to determine likely suspects • Example: Corporate data thieves tend to be males between 30 and 40 years of age • Behavior Profiling: • Looking for a series of behaviors which indicate likely suspects • Example: Corporate data thieves are more likely to work weekends, not take vacations, and be generally highly rated
Profiling - Issues • Demographic profiling, no matter how good, will likely end up with you on CNN • Base Rate Fallacy: The profile needs to be extraordinarily close to 100% for a population of any size.
Probability Mining • Identifying high probability issues to target • Can be applied to profiling or anomaly detection • Good for sliding thresholds with competing business drivers • Example: Stolen credit cards are more likely to be used at electronics stores for high ticket items. Applied to a particular profile, a plasma TV purchase may have a 10% chance of being fraudulent.
Probability Mining - Issues • Business drivers need to be considered • Is it worth it to bother 10 legitimate credit card holders to find 1 stolen card? What about 100? 1000? • Probability generation requires a lot of data and a pre-labeled dataset to be useful
Anomaly Detection • Sesame Street analysis • Relies on finding outliers in data • Does not require a priori expert knowledge of the data • Does require après-analysis expert knowledge to interpret outliers
Case Example: Anomaly Detection • Product launch event - $1.5 Million budget • Launch directors had authority for procurements up to $10,000 • Report received of a “person directing the launch event gave a lot of vendor work to his brother-in-law” • There were ~25 recent launch events that this could refer to, 10 of which were male-directed • Looked at the financials for each launch event
Anomaly Detection – How we Found ‘em • Benford’s Law • Take a look at both the last and first digits • Distribution is well of predictions • Nearness-to-threshold • Distribution should not be a logarithmic decline from approval threshold • Nothing was over threshold… • Common Sense • Plasma TV Rentals - $10K to rent? Why 2?
Results • Subject hired their brother-in-law to do phantom consulting • Subject rented plasma TVs with a $1 buyout option
Case Example: Geospatial Anomalies • Problem: Identify web activity that is spurious in nature • Application: Successfully applied to internal user data (activity logs) as well as external data (attacks)
Impossibility Mining • Is NOT data mining • IS an application of control testing • Looks for patterns that cannot exist in any model of reasonable likelihood • Can be single or multifactor • Only identifies real outliers
Impossibility Mining Example – Single Factor • Asset Management • IT Asset Management software installed on all machines in a company • Cataloged installed hardware and software at different points in time • Proactive Look • Identify any computers where installed memory at time T is less than or equal to T-1 • Identified several hundred laptops from remote office users that met the criteria
Impossibility Mining Example – Single Factor, cont’d • Identified commonality in laptops • All laptops were serviced by the same IT support location • Found the drop in memory was consistent with the last “upgrade” • Reviewed eBay activity of the local IT support personnel • Found the thief, who was removing half of the memory from laptops of non-power users and selling it!
Impossibility Mining – Dual Factor • Electronic Funds Transfer Investigation • Payment Process • Manager takes in payment request and assigns to a clerk • Clerk enters payment information and selects a payee • Manager enters EFT information for the payee and confirms transaction (cannot change amount) • Division Head confirms name on account, amount, and releases funds • Question: Does fraud require collusion?
Impossibility Mining – Dual Factor, cont’d • EFT Audit • Compared actual EFTs for internal consistency • Looked for EFTs where the customer ID was the same, but the bank routing number was different • Identified a manager who was manually changing routing information to funnel to her husband’s account • 3rd set of eyes (Division Head) did not help – ineffective control • Two process changes • Only Division Head can add EFT information • Automated check implemented to ID bank name != routing number
Impossibility Mining – Data Joining • Unauthorized Computer Access • Created a table of physical sites • Calculated the minimum travel time between sites • Identified anyone logging in to a machine at 2 sites where time between logins < minimum travel time
Impossibility Mining – Data Joining, cont’d • Identified several stolen passwords • Also highlighted password sharing • … as well as user passwords hard-coded in applications
Impossibility Mining - Conclusions • The less likely for something to occur, the better the candidacy for impossibility mining • Can always implement controls to prevent the “impossibilities”, but they are not always implemented correctly • Best example in the media: Insurance fraud case - men were claiming hysterectomies, ovarian cyst removal, PAP tests…
Questions • …Other than can we go yet?