Analytics: Data Mining for Risk and Compliance

Analytics: Data Mining for Risk and Compliance 31 January 2005 4 June 2009 INTERNAL ATO – Various SEGMENT AUDIENCE DATE Name of Presenter Title of Presenter Analytics, Office of the Chief Knowledge Officer Version 1.0

Overview • Analytics and the Data Mining Process • Exploring Data • Supervised Modelling • Unsupervised Modelling • Data Matching • Analytics Project Achievements

Analytics and the Data Mining Process The Shape and Form of a Data Mining Project

Analytics • Under Office of the Chief Knowledge Officer, and is part of EST sub-plan • Established as a National capability in 2003 • Team has been built up to 19 data mining specialists, representing the largest data mining team in Australia. • Working with up to 60 analysts throughout the organisation to spread the new technology and provide an over arching framework for Risk Management for the ATO. • The National team works closely with Business Lines to both deliver new risk models and to transfer skills and technology • Analytics Community of Practise meets weekly to share experiences and technology, and to peer review modelling across the ATO.

Analytics Functions • Deploy data mining, • Working with business lines to deliver new risk models • Improved strike rates and more efficient usage of limited resources • Analytics Community of Practise • Weekly meetings and emailing lists to share experiences and to introduce new technologies • AnalyticsNet Infrastructure • New 64bit hardware to allow our large datasets to be analysed in memory (32GB memory) • Sharing of new tools and technology • Analytics Training • Beginning a series of courses introducing data mining • A hands-on approach – kick start with own data

Analytics and Traditional Modelling • Analytics brings a different, but complementary and advanced, approach to modelling and predicting client behaviour. • Traditional modelling approaches explore client data and couple this with an understanding of financial processes to build mathematical models to simulate these processes, and to then identify non-compliance to models. • Analytics, using data mining technology, supplements traditional modelling approaches by modelling from the data – using powerful tools to automatically search for interesting, unusual, unexpected, patterns that indicate non-compliance – a data driven approach.

Data driven approach • Crucial to have the right data • Clean • Relevant • Before the event • An data mining project is a joint process between the business experts and data miners • business problem • business processes • data

CRISP-DM Business Understanding 2. Data Understanding 3. Data Preparation 4. Modelling 5. Evaluation 6. Deployment Sourer: http://www.crisp-dm.org/Process/index.htm The Data Mining Process

Applying results of data mining… 1 2 3 4 Apply New Risk Segmentation TuneScreening Rules Optimise a Treatment Strategy Optimise Treatment Portfolio Adjust screening rules (thresholds, ratios, exceptions) to reflect better understanding of risk. Look at adjusting, combining rules. Can be applied straight away. Find the optimal point to maximise revenue collection, while minimising caseload and occurrence of fraud.Apply risk scores to case selection to get best overall outcomes. Find the optimal point to maximise revenue collection, while minimising caseload and occurrence of fraud – for the whole of treatment portfolio. Optimise the treatment mix Instead of using $ value or market segment as proxy for risk, identify actual group and its characteristics. Create new language and awareness of risk. Degree of Sophistication Optimisation is more than picking the right clients – the right treatment and right work mix also need to be optimised…

Client Scoring for treatment selection… So we can personalise our treatment strategies to the client Decision Tree of Rules derived from data to assign scores Letter X Letter Y Treatment – Audit Call Treatment – Review In fact scores are likely to be done via several models ‘voting’ together – Ensembles.

Moving Forward with Analytics • The low hanging fruit for Data Mining is the large collection of outcomes from audit activity – this has been a primary focus in the first instance. • It is a more difficult data mining task to identify emerging risks, but technology for identifying emergent patterns is becoming available. • Text mining and social network analysis will significantly enhance our Intelligence and Risk Modelling capabilities. • Deployment of Analytics through Operational Analytics • How best to deploy Analytics Models – new territory • Translate models to SQL or leave in native language (R, SAS, Java)? • Computational requirements of SQL over the Data Warehouse

Supervised Modelling Working From What We Know To Build Models To Automate “Case Selection”

Supervised modelling • predict some value or outcome having seen a number of training examples • training data will have a ‘target’ variable • prediction can be a continuous variable, or a class • model ‘learns’ from training data, and is tested on ‘unseen’ cases

Effect of Adding More Data – Data is Fundamental Base Data Client History

New Technologies • Regression • Decision Trees • Random Forests • Boosted Trees • Support Vector Machines • Neural Networks

Unsupervised Modelling A Data Driven Approach to Identifying – Exploring – Understanding Client Groups

Unsupervised modelling • A class of problems in which one seeks to determine how the data are organised • Distinct from supervised modelling in that the data have no ‘target’ variable • Seek to summarise and explain key features of the data.

Cluster Analysis • Seeks to identify homogeneous subgroups in a population • establish groups and then analyse group membership • discovers structures in data without explaining why they exist • mostly used when no a priori hypotheses, but are still in the exploratory phase of our research • use to classify large amounts of information into manageable meaningful piles

outlier outlier outlier outlier • Omitted Income – outlier detection

Self Organising Maps (SOM) • A self-organizing map is a special type of artificial neural network which performs unsupervised competitive learning (Kohonen, 1982) • Useful for visualising low-dimensional views of high-dimensional data • Plot the similarities of the data by grouping similar data items together

Debt Behaviour - Self Organising Maps Aim: understand the logic and structures that drive tax payers’ compliance behaviour (behavioural archetypes). Construct ‘psychographic groups’ (Wells 1975) by using data mining clusters – each cell in the “map” represents thousands of entities who are similar across many characteristics. Identify hot spots which indicate high levels of “activity” associated with different characteristics. 6.5 Million entities in total population

Text Mining of Complex Documents • Large collections of documents (unstructured data from multiple sources including source systems, client hard drives and scanned material) need to be reviewed • Task: systematically sift the required information from the “noise” • Aim: Reduce the time taken to identify those documents that support compliance treatment

Associated Entities • Identifying and understanding Associated Entities is important in many different Taxation contexts. • Debt: • Linking Associated Entities is important in understanding an entity’s Propensity and Capacity to Pay and then in modelling their debt risk. • Entities are associated through partnerships, directorships, and consolidated groups where we need to identify the ultimate holding company. • Lodgment: • Analyse lodgment behaviour and risk to revenue by knowing relationships between Associated Entities. • Relationships derived from the linkages could be used for identifying “leverage” points for more effective treatment strategies. • Tool is in the early stages of development. One Degree of Separation Colours: CompaniesGovernmentIndividuals PartnershipsSuperannuationTrusts Triangle = non lodged; Circle = lodged; Size = Ind … Large Two Degrees of Separation

Associated Entities One Degree of Separation Two Degrees of Separation Three Degrees of Separation Companies Government Individuals Partnerships Suoerannuation Trusts Triangle = non lodged Circle = lodged Size = Ind … Large Four Degrees of Separation Five Degrees of Separation

Data Matching • AUSTRAC • Internal data

Analytics Project Achievements • Application of Data Mining • in the ATO

Data mining at work

Intangible Effect of Data Mining

Other projects • Ceased Business • Failure to Lodge • IT Return Not Necessary • Propensity to Lodge • Risk to Revenue – FBT • Strategy Evaluation and Improvement • In House Prosecutions • Risk to Information • Risk Score Associated Entities • Risk to Reputation

Some New Points of View • Fraud found at the edge or boundary of pockets of activity rather than being outliers Outliers Boundary Cases

Scattergram (Taylor-Russell Table) Aberrant Cases True Positives False Negatives Baseline Separating Aberrant from Acceptable Cases False Positives True Negatives Acceptable Cases Cutoff used by Classifier

Role of Expertise • Need to develop procedures/methods for capturing the knowledge, skills and strategies experts employ yo identify non compliance or the smell factor with cases and to incorporate these as routines and models in our discovery and detection systems • Examples include the expertise used for • Risk Identification • Feature Selection • Classifications of Cases

Analytics: Data Mining for Risk and Compliance