300 likes | 313 Views
Explore the latest advancements in data mining for fraud detection, including key themes, use of analytics, and tools in stock. Learn from expert keynote speakers and understand the efficacy of different models in various domains.
E N D
Data Mining at ATO2004 Canberra Warwick Graco Analytics Project Change Program ATO
Outline • Some key Themes • ATO at a glance
Data Mining in Government • White-Collar Crime – Dollar Figures Quoted are in the Stratosphere • Sophisticated Frauds and Internal Fraud • Slowness of Regulators
Roles of Data Miner • Role of Analytics • Cost-effectiveness versus Precision with Detection • Medical versus Engineering Model • Inoculate your system against Security and Integrity Attacks • Use of agents for this purpose • Each agent designed to detect a specific breach
Some New Points of View • Fraud found at the edge or boundary of pockets of activity rather than being outliers Outliers Boundary Cases
Some New Points of View • False Negatives • Perturbing Classifications • Determine effects of different proportions of perturbed classifications have on hit rates and by inference miss rates • This is potentially a method for estimating the incidence of fraud and abuse in society eg • Size of Black Economy • Amount lost to Health Fraud and Abuse • Amount lost to Social Security Fraud and Abuse
Taylor-Russell Table Aberrant Cases True Positives False Negatives Baseline Separating Aberrant from Acceptable Cases False Positives True Negatives Acceptable Cases Cutoff used by Classifier
What tools in stock • Diverse Applications of Data Mining including • Hot Spots Methodology • Tree Stumps • Control Charts • Detection of Outliers • Hardware and Software Developments with Data Mining
Keynote Speakers • Professor Han • Covered Classical Mining and Modelling Methods • Covered Some New Developments eg web mining, stream mining and bioinformatics mining • One example he used was how data mining can be used in software engineering to debug programs
Keynote Speakers • Usama Fayyad • Covered the Lessons Learned from Applying Data Mining in practice • Emphasised the importance of consulting experts and incorporating domain knowledge in the mining process • Discoveries from mining have to be related to expert’s understanding and interpretation of issues • Have to both mine data and model processes to obtain ideal results eg reselling
Keynote Speakers • Usama Fayyad • Emphasised the importance of presenting results in a way that managers understand and relate to eg lifetime value of employees versus churn rates • Covered many of the technical challenges facing the field eg complexity, scalability, validation and the need for firm theoretical foundation
Outline • Efficacy of Models • Recently Developed Models • Local versus Global Models • Features • Identifying Discriminatory Features • Sharing the Magic Few Features • Estimating Miss Rates and Showing Cost/Benefits
Outline • Analytics –achieving synergies • Regulatory Work – need for proactive approach • Mapping the Detection Process • Static versus Dynamic Aspects • Capturing Expertise • Embedding Knowledge
Efficacy of Models • Symbolic • Random Forests • Tree Stumps • MART • Statistical • MARS • Weighted KNN • Biological • Boosted ANNs • GA/ANN
Efficacy of Models • There is hype about the versatility and effectiveness of many of these models • We need to clarify scientifically how well they perform and in what circumstances • They need to be tested in a variety of domains
Local versus Global Models • Comparisons between those which are narrow and specific and those which are broad and general in focus • The former are important with transaction fraud and abuse and the later with client profiling • We need to establish how each contributes to detection and how they can work together in tandem • The need to test a medical approach to model development compared to engineered approaches – does the former afford greater protection than the latter against security and integrity attacks
Discriminatory Features • Feature Selection versus Classification Trade Off with Identifying features • If you have the luxury of many classified cases that represent the important trends in the data, you can use a supervised approach to identify discriminatory features • Examples include filter and wrapper approaches
Discriminatory Features • If you do have this luxury, one option is to use an unsupervised approach to identify discriminatory features • Examples include taxonomic and clustering methods and anomaly detection • We need to do comparisons to see which methods work best and in what circumstances
Magic Few Features • A small number of highly Discriminatory Features account for most high-risk cases • An example is SPP across age and gender groups with GPs • The remainder provide small contributions • Discriminating Features tend to be locked in vaults and not shared across regulatory agencies
Miss Rates with Detection • Perturbation versus Tasselation • Sensitivity versus Specificity Trade Off • A major challenge is establish to what degree discovery and detection technology increases the strike rate above the industry benchmark of 1:10 with identification of security and integrity breaches
Taylor-Russell Table Aberrant Cases True Positives False Negatives Baseline Separating Aberrant from Acceptable Cases False Positives True Negatives Acceptable Cases Cutoff used by Classifier
Cost/Benefits • We also need studies to inform us of cost benefits of using discovery and detection technology • That is what cost benefit ratio can we expect from using this technology compared to using conventional compliance measures such as telephone tip offs, random audits, purpose-based audits etc
Cost-Benefit Approach Optimal Return Benefits Costs
Analytics • Analytics is a new field and embraces a variety of disciplines including intelligence, risk analysis, profiling, data matching, discovery (mining) and detection (modelling) work • Major challenge is achieving integration across these disciplines so that they work together and achieve synergies
Analytics • Intelligence should drive compliance activities in terms of where those who do risk analysis, profiling, matching and discovery and detection focus their attention • A major failing with many organisations is that work done is not based on the results of sound intelligence • The decision making is often ad hoc and arbitrary
Regulatory Work • Telephone and Banking work real time and have to detect security and integrity breaches as they occur • Payment, Insurance and Revenue Collection Agencies work past time and seek restitution after the event • Question how do we get onto the front foot become proactive rather than reactive with discovery, detection, treatment and prevention
Static versus Dynamic View • KDD historically has focused on retrospective and prospective score models • These give a static view of compliance issues – ie a picture of practice at a particular point in time • Fraud and abuse are usually not static but dynamic. They tend to change and sometimes change quickly
Static versus Dynamic View • To identify and track these issues effectively requires an investigative approach • Our current models and approaches are not well suited to track changes and to continuously adapt to new developments • One illustration of what is implied here is work with and model the steps, procedures and routines that experts use to solve cases
Role of Expertise • Need to develop procedures/methods for capturing the knowledge, skills and strategies experts employ yo identify non compliance or the smell factor with cases and to incorporate these as routines and models in our discovery and detection systems • Examples include the expertise used for • Crime Identification • Feature Selection • Classifications of Cases
Embedded Knowledge • The expertise captured can be included as metaknowledge and be linked to security and integrity breaches, features and cases • Knowledge needs to be embedded in discovery and detection processes