Data Mining at ATO 2004 Canberra

Data Mining at ATO2004 Canberra Warwick Graco Analytics Project Change Program ATO

Outline • Some key Themes • ATO at a glance

Data Mining in Government • White-Collar Crime – Dollar Figures Quoted are in the Stratosphere • Sophisticated Frauds and Internal Fraud • Slowness of Regulators

Roles of Data Miner • Role of Analytics • Cost-effectiveness versus Precision with Detection • Medical versus Engineering Model • Inoculate your system against Security and Integrity Attacks • Use of agents for this purpose • Each agent designed to detect a specific breach

Some New Points of View • Fraud found at the edge or boundary of pockets of activity rather than being outliers Outliers Boundary Cases

Some New Points of View • False Negatives • Perturbing Classifications • Determine effects of different proportions of perturbed classifications have on hit rates and by inference miss rates • This is potentially a method for estimating the incidence of fraud and abuse in society eg • Size of Black Economy • Amount lost to Health Fraud and Abuse • Amount lost to Social Security Fraud and Abuse

Taylor-Russell Table Aberrant Cases True Positives False Negatives Baseline Separating Aberrant from Acceptable Cases False Positives True Negatives Acceptable Cases Cutoff used by Classifier

What tools in stock • Diverse Applications of Data Mining including • Hot Spots Methodology • Tree Stumps • Control Charts • Detection of Outliers • Hardware and Software Developments with Data Mining

Keynote Speakers • Professor Han • Covered Classical Mining and Modelling Methods • Covered Some New Developments eg web mining, stream mining and bioinformatics mining • One example he used was how data mining can be used in software engineering to debug programs

Keynote Speakers • Usama Fayyad • Covered the Lessons Learned from Applying Data Mining in practice • Emphasised the importance of consulting experts and incorporating domain knowledge in the mining process • Discoveries from mining have to be related to expert’s understanding and interpretation of issues • Have to both mine data and model processes to obtain ideal results eg reselling

Keynote Speakers • Usama Fayyad • Emphasised the importance of presenting results in a way that managers understand and relate to eg lifetime value of employees versus churn rates • Covered many of the technical challenges facing the field eg complexity, scalability, validation and the need for firm theoretical foundation

Outline • Efficacy of Models • Recently Developed Models • Local versus Global Models • Features • Identifying Discriminatory Features • Sharing the Magic Few Features • Estimating Miss Rates and Showing Cost/Benefits

Outline • Analytics –achieving synergies • Regulatory Work – need for proactive approach • Mapping the Detection Process • Static versus Dynamic Aspects • Capturing Expertise • Embedding Knowledge

Efficacy of Models • Symbolic • Random Forests • Tree Stumps • MART • Statistical • MARS • Weighted KNN • Biological • Boosted ANNs • GA/ANN

Efficacy of Models • There is hype about the versatility and effectiveness of many of these models • We need to clarify scientifically how well they perform and in what circumstances • They need to be tested in a variety of domains

Local versus Global Models • Comparisons between those which are narrow and specific and those which are broad and general in focus • The former are important with transaction fraud and abuse and the later with client profiling • We need to establish how each contributes to detection and how they can work together in tandem • The need to test a medical approach to model development compared to engineered approaches – does the former afford greater protection than the latter against security and integrity attacks

Discriminatory Features • Feature Selection versus Classification Trade Off with Identifying features • If you have the luxury of many classified cases that represent the important trends in the data, you can use a supervised approach to identify discriminatory features • Examples include filter and wrapper approaches

Discriminatory Features • If you do have this luxury, one option is to use an unsupervised approach to identify discriminatory features • Examples include taxonomic and clustering methods and anomaly detection • We need to do comparisons to see which methods work best and in what circumstances

Magic Few Features • A small number of highly Discriminatory Features account for most high-risk cases • An example is SPP across age and gender groups with GPs • The remainder provide small contributions • Discriminating Features tend to be locked in vaults and not shared across regulatory agencies

Miss Rates with Detection • Perturbation versus Tasselation • Sensitivity versus Specificity Trade Off • A major challenge is establish to what degree discovery and detection technology increases the strike rate above the industry benchmark of 1:10 with identification of security and integrity breaches

Taylor-Russell Table Aberrant Cases True Positives False Negatives Baseline Separating Aberrant from Acceptable Cases False Positives True Negatives Acceptable Cases Cutoff used by Classifier

Cost/Benefits • We also need studies to inform us of cost benefits of using discovery and detection technology • That is what cost benefit ratio can we expect from using this technology compared to using conventional compliance measures such as telephone tip offs, random audits, purpose-based audits etc

Cost-Benefit Approach Optimal Return Benefits Costs

Analytics • Analytics is a new field and embraces a variety of disciplines including intelligence, risk analysis, profiling, data matching, discovery (mining) and detection (modelling) work • Major challenge is achieving integration across these disciplines so that they work together and achieve synergies

Analytics • Intelligence should drive compliance activities in terms of where those who do risk analysis, profiling, matching and discovery and detection focus their attention • A major failing with many organisations is that work done is not based on the results of sound intelligence • The decision making is often ad hoc and arbitrary

Regulatory Work • Telephone and Banking work real time and have to detect security and integrity breaches as they occur • Payment, Insurance and Revenue Collection Agencies work past time and seek restitution after the event • Question how do we get onto the front foot become proactive rather than reactive with discovery, detection, treatment and prevention

Static versus Dynamic View • KDD historically has focused on retrospective and prospective score models • These give a static view of compliance issues – ie a picture of practice at a particular point in time • Fraud and abuse are usually not static but dynamic. They tend to change and sometimes change quickly

Static versus Dynamic View • To identify and track these issues effectively requires an investigative approach • Our current models and approaches are not well suited to track changes and to continuously adapt to new developments • One illustration of what is implied here is work with and model the steps, procedures and routines that experts use to solve cases

Role of Expertise • Need to develop procedures/methods for capturing the knowledge, skills and strategies experts employ yo identify non compliance or the smell factor with cases and to incorporate these as routines and models in our discovery and detection systems • Examples include the expertise used for • Crime Identification • Feature Selection • Classifications of Cases

Embedded Knowledge • The expertise captured can be included as metaknowledge and be linked to security and integrity breaches, features and cases • Knowledge needs to be embedded in discovery and detection processes

Data Mining at ATO 2004 Canberra

Data Mining at ATO 2004 Canberra

Presentation Transcript

Another Look at Data Mining

Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining at Duke

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining at work

Data Mining: Data

Mining Large Data at SDSC

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data