1 / 26

Impossibility Mining

Impossibility Mining. Traditional Data Mining. Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins Canonical: Diapers and Beer at Walmart Urban Legend – comes from 1992 Teradata study of Osco. Correlation!=Causation

clamb
Download Presentation

Impossibility Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Impossibility Mining

  2. Traditional Data Mining • Using multidimensional data to find previously unknown hidden relationships • Not just simple query/joins • Canonical: Diapers and Beer at Walmart • Urban Legend – comes from 1992 Teradata study of Osco. • Correlation!=Causation • Terminology currently has negative connotations in the press

  3. Il buono, il brutto, il cattivo • 3 categories of “data mining” for fraud • Profiling (il brutto) • Probability Mining (il cattivo) • Anomaly Detection (il buono)

  4. Profiling • Looking for a series of characteristics which identify a likely problem • Demographic Profiling: • Looking for a series of personal identifiers to determine likely suspects • Example: Corporate data thieves tend to be males between 30 and 40 years of age • Behavior Profiling: • Looking for a series of behaviors which indicate likely suspects • Example: Corporate data thieves are more likely to work weekends, not take vacations, and be generally highly rated

  5. Profiling - Issues • Demographic profiling, no matter how good, will likely end up with you on CNN • Base Rate Fallacy: The profile needs to be extraordinarily close to 100% for a population of any size.

  6. Probability Mining • Identifying high probability issues to target • Can be applied to profiling or anomaly detection • Good for sliding thresholds with competing business drivers • Example: Stolen credit cards are more likely to be used at electronics stores for high ticket items. Applied to a particular profile, a plasma TV purchase may have a 10% chance of being fraudulent.

  7. Probability Mining - Issues • Business drivers need to be considered • Is it worth it to bother 10 legitimate credit card holders to find 1 stolen card? What about 100? 1000? • Probability generation requires a lot of data and a pre-labeled dataset to be useful

  8. Anomaly Detection • Sesame Street analysis • Relies on finding outliers in data • Does not require a priori expert knowledge of the data • Does require après-analysis expert knowledge to interpret outliers

  9. Case Example: Anomaly Detection • Product launch event - $1.5 Million budget • Launch directors had authority for procurements up to $10,000 • Report received of a “person directing the launch event gave a lot of vendor work to his brother-in-law” • There were ~25 recent launch events that this could refer to, 10 of which were male-directed • Looked at the financials for each launch event

  10. Data

  11. Benford

  12. Anomaly Detection – How we Found ‘em • Benford’s Law • Take a look at both the last and first digits • Distribution is well of predictions • Nearness-to-threshold • Distribution should not be a logarithmic decline from approval threshold • Nothing was over threshold… • Common Sense • Plasma TV Rentals - $10K to rent? Why 2?

  13. Results • Subject hired their brother-in-law to do phantom consulting • Subject rented plasma TVs with a $1 buyout option

  14. Case Example: Geospatial Anomalies • Problem: Identify web activity that is spurious in nature • Application: Successfully applied to internal user data (activity logs) as well as external data (attacks)

  15. User Data

  16. User Data – Plotted as Anomalies

  17. Outliers – What Were They?

  18. Impossibility Mining • Is NOT data mining • IS an application of control testing • Looks for patterns that cannot exist in any model of reasonable likelihood • Can be single or multifactor • Only identifies real outliers

  19. Impossibility Mining Example – Single Factor • Asset Management • IT Asset Management software installed on all machines in a company • Cataloged installed hardware and software at different points in time • Proactive Look • Identify any computers where installed memory at time T is less than or equal to T-1 • Identified several hundred laptops from remote office users that met the criteria

  20. Impossibility Mining Example – Single Factor, cont’d • Identified commonality in laptops • All laptops were serviced by the same IT support location • Found the drop in memory was consistent with the last “upgrade” • Reviewed eBay activity of the local IT support personnel • Found the thief, who was removing half of the memory from laptops of non-power users and selling it!

  21. Impossibility Mining – Dual Factor • Electronic Funds Transfer Investigation • Payment Process • Manager takes in payment request and assigns to a clerk • Clerk enters payment information and selects a payee • Manager enters EFT information for the payee and confirms transaction (cannot change amount) • Division Head confirms name on account, amount, and releases funds • Question: Does fraud require collusion?

  22. Impossibility Mining – Dual Factor, cont’d • EFT Audit • Compared actual EFTs for internal consistency • Looked for EFTs where the customer ID was the same, but the bank routing number was different • Identified a manager who was manually changing routing information to funnel to her husband’s account • 3rd set of eyes (Division Head) did not help – ineffective control • Two process changes • Only Division Head can add EFT information • Automated check implemented to ID bank name != routing number

  23. Impossibility Mining – Data Joining • Unauthorized Computer Access • Created a table of physical sites • Calculated the minimum travel time between sites • Identified anyone logging in to a machine at 2 sites where time between logins < minimum travel time

  24. Impossibility Mining – Data Joining, cont’d • Identified several stolen passwords • Also highlighted password sharing • … as well as user passwords hard-coded in applications

  25. Impossibility Mining - Conclusions • The less likely for something to occur, the better the candidacy for impossibility mining • Can always implement controls to prevent the “impossibilities”, but they are not always implemented correctly • Best example in the media: Insurance fraud case - men were claiming hysterectomies, ovarian cyst removal, PAP tests…

  26. Questions • …Other than can we go yet?

More Related