270 likes | 279 Views
Learn about Benford's Law and its application in data quality analysis. Discover how it can be used to identify fraud and improve data management practices. This paper provides examples and practical tips using Excel.
E N D
Benford’s LawData Quality Analysis applications Alan F Doyle, Data Management SpecialistGroup Data Management 4 March 2009
Contents Introduction – who are we ? Benford's Law – What is it, when should it apply, and to what types of data ? Benford's Law and Data Quality 101 30% of all balances in your GL should begin with a "1" (and other stories)!!! Fraud identification using Benford's Law Using Excel to test real data sets using Benford's Law References/Bibliography AppendixExcel approach to Benford's Law (suggested procedure and Excel tips) 2
Key Group facts • 39,729 employees (FTE), globally • 10 million retail and business banking customers, globally • 1,714 branches and service centres, globally • $15.4 billion revenue • $8.1 billion underlying profit • 2,939 ATMs, globally (including non-branded ATMs) Source: 2008 Shareholder Review
Data Management and Benford's Law The Data Quality link NAB Group Data Management developed a Group Data Quality (DQ) Policy We have a key focus in the policy on the capabilities to ensure appropriate DQ We undertake Data Quality Profiling – a key tool/capability, but generallybusiness-rules based There was some awareness in our team, of Benford’s Law and its application to fraud detection…..maybe we can apply it to DQ profiling also? We selected various data stores which should conform to Benford’s Law and tested this hypothesis Conclusion: very close correlation found, and a worthwhile addition to existing DQ assessment techniques/rules Today’s paper provides background on:– what Benford’s Law is (and is not)– how it can be applied in practice, and – examples of the results of analysis on some real data sets 5
Benford's Law What is it? ….lay man’s guide Also called the “First-Digit Law” Method of predicting, with surprising accuracy,the initial digits of any non-random series of numbers Wikipedia4 – simple, plain language: > “…in lists of numbers from many real-life sources of data, > the leading digit is distributed in a specific, non-uniform way….. > the first digit is 1 almost one third of the time, and > larger digits occur as the leading digit with lower frequency, > to the point where 9 as a first digit occurs less than one time in twenty.” 6
Benford's Law What is it? ….the technical stuff Basis: values of real-world measurements are often distributed logarithmically, so the log of such sets is generally distributed uniformly Log tables were more “dog-eared” at 1st few pages (1st digits 1,2 etc) than last pages (1st digits 7,8,9) ….(fictional?) Probability of digit D as 1st digit = log10(1+1/D) A generalised formula exists also which allows us to predict the probability of, for example :– the first 3 digits being 314! – the 4th digit being a 6! – etc.. 7
Benford's Law Real world/day-to-day examples of where it should apply Electricity bills Street addresses Stock prices Population numbers Death rates Lengths of rivers Accounts payable invoice and payment values General Ledger balances Customer Loan and Deposit account balances Land Valuations 8
Benford's Law To what “types” of data should it apply? Balances or totals of numbers resulting from aggregation (e.g. General Ledger Balances, supplier accounts payable balances, data warehouse aggregates) The more stages of calculations to obtain each member of a series of numbers, the more likely it is that the end results will conform to the predictions of Benford’s Law Numbers resulting from the mathematical combination of numbers (e.g. price times quantity) Transaction-level data (e.g. payments, sales, purchases) Numbers that describe the ‘count’ or ‘value’ of the elements of a dataset 9
Benford's Law When does it NOT (or most likely not) apply? Assigned numbers (e.g. cheque numbers, invoice numbers) Numbers which conform to other distributions (e.g. normal distribution, uniform distribution like random drawings, lotteries, or the roll of one die) Numbers influenced by human thought (e.g. prices set with psychological thresholds such as $1.99) Balances of accounts set up for a specific purpose (e.g. to record $100 refunds) Items/numbers with built-in minimum or maximum values, e.g. 1st digit of heights (in metres) of a group of humans is most likely to be a 1 or a 2 ‘Price Effect’ e.g. Sales receipts where one product (with a specific price) forms a large part of the population of sales made, or individual staff members’ payroll totals for a pay period (predominance of similar hours times similar rates per hour When selecting small sample sizes Non-naturally occurring numbers (e.g. telephone numbers) 10
Linking Benford's Law and Data Quality Benford’s Law helps to identify ….. Duplicate payments (accounts payable) Fraudulent payments Fraudulent expense claims Tax return fraud Biased estimation in General Ledger balances Arbitrarily invented numbers in forecasting (forecasts should conform to the expected distributions of their related ‘actuals’) Biased estimates in bad debt provisions Systemic error (e.g. through incorrect ETL logic, resulting in accidentally duplicated or repeated values) Processing inefficiencies (e.g. high quantity/low $ transactions) 11
Fraud identification using Benford's Law “It is the failure of assigned numbers to follow Benford’s Law that makes the Law so powerful in detecting fraudulent “made up” numbers among calculated numbers” 1 “The general expectation is that Benford’s Law will apply to any series of calculated numbers, and explanation is required when any series does not conform. The explanation may be in the exceptions listed above [refer previous slides] or the explanation may be in anomalous behaviour.” 1 Results are indicative. Need to evaluate the results to conclude which applies (a valid exception to Benford’s Law, or an anomaly) If anomalous => further investigation should be conducted relating to the anomalies to confirm why they occur (e.g. fraud, estimation biases, unintended or manual or programmed generation of duplicates) 12
Fraud identification using Benford's LawExample: Expense Claims A classic and easily understood example relates to expense account manipulation Assume General Manager approval required for expense claims >= $300 Often find a clustering of expenses below $300 to avoid the need to seek GM approval This is often achieved by arranging multiple purchases just below the threshold, and/or collusion by purchasers with suppliers, to split larger invoices into smaller individual invoices (e.g. 2 invoices for $260 and $240 rather than one for $500) Benford’s Law to the rescue! First digits will show anomalies (e.g. preponderance of 1s and/or 2s, and fewer than expected 3s, 4s and 5s) Refer suggested simple procedure andExcel tips in the appendix 13
Fraud identification using Benford's Law Example: Expense Claims Anomalies in distribution of actual occurrence of first digits compared with Benford’s expected distribution – the tell-tale signs* 1 2 3 4 5 6 7 8 9 Possible Fraudulent Expense Claims * Dummy data used to illustrate this example 14
Using Excel to test a real data set using Benford's Law Sample data from specific balance files (6.5m records) – Data Warehouse 15
Using Excel to test a real data set using Benford's Law Graph of this data (6.5m records) – closely mirrors Benford’s Distribution KEY : Actual > Benford’s Actual < Benford’s 16
Using Excel to test a real data set using Benford's Law Another Sample data set from the Warehouse (3.3m records) 17
Using Excel to test a real data set using Benford's Law KEY : Actual > Benford’s Actual < Benford’s • Graph of this data (3.3m records) – matches Benford’s, slight tendency to understate ? However differences are relatively small. 18
Using Excel to test a real data set using Benford's Law • Australian GL : Balance Sheet balances : Feb 09 – 55,000 records Match is Spot On !!! …good news for us !! 19
Using Excel to test a real data set using Benford's Law • Australian GL transactions (single Balance Sheet account type; all branches) : Feb 09 - 62,000 records • 62k GL records (transactions) v 6.5m Warehouse records (balances) • Less inherent ‘aggregation’ (transaction vs. balance) and smaller number of items => a less exact match is not unexpected • Nevertheless, still a close match • The following example (personal credit card) illustrates increasing ‘lumpiness’ as the number of items decreases and ‘behaviour’ plays a greater role… 20
Using Excel to test a real data set using Benford's Law • My MasterCard ! (12 months/976 transactions) – “raw” • Interesting : rough match (trend at least) even if somewhat ‘lumpy’ • As expected - know it includes many duplicates (direct debits, credit transfers, round sum ATM withdrawals, bank fees, etc) • Distributions are still quite similar • May be worth adjusting the calculations to allow for the ‘known’ behaviours • Either the “raw” or the adjusted actual distributions can be used as a ‘fingerprint” for comparative analysis across/with other time periods 21
Using Excel to test a real data set using Benford's Law • My MasterCard ! (12 months/976 transactions) – Adjusted for duplicates, etc • Backed out duplicates and near duplicates e.g. monthly direct debits of $19.99 then $19.98, then $19.99; bank fees, standing a/c transfers, etc • Backed out known ‘behaviours’ e.g. many ‘ATM withdrawals’ for $20, $40, $60, $80 • Added back total of these, to each digit’s subtotal, after being redistributed to the leading digits per Benford’s Law • Recalculated ‘adjusted’ actuals • Close match given the relatively small sample, and many behavioural factors at play here • Tendency towards lower digits ! Penny pincher ?......contd/ 22
Using Excel to test a real data set using Benford's Law • My MasterCard ! (12 months/976 transactions) – Adjusted (continued) • Graphically the raw distribution is an intuitive outcome based on my personal behaviours …… • I tend to spend larger sums like $600 to $999 less frequently than lower amounts like $10 to $19, and $100 to $199, or $20 to $29, $200 to $299, $30 to $39, $300 to $399 etc !! …a normal behaviour for most of us when it hits our own hip pocket!! • For a business, the aggregates are comprised of the outcomes of the behaviours of many • => reduced impact of the behaviours of one or two individuals - i.e. a closer fit to Benford’s is expected for businesses (larger populations) 23
Conclusions from our analyses Benford’s Law does indeed apply to aggregated datasets => very relevant to Data Quality assessment! Where anomalies are found, further review may indicate the initial hypothesis regarding a data set’s expected ‘distribution’ was incorrect Alternatively, review of outcomes may indicate that although the distribution should conform to Benford’s Law, in reality it doesn’t – further investigation required. Further investigation should then reveal the true reasons for deviationse.g. fraud, inefficient processes, genuine repeated patterns, or systemic data processing/ETL logic errors. Benford’s Law provides a simple yet potentially powerful technique to add to our DQ assessment armory, and can be achieved by applying nothing more than a very simple spreadsheet against your data set We’ll never look at GL and data warehouse balances (or at least their leading digits) the same way again! 24
Questions 25
References/Bibliography Benford’s Law and Fraud DetectionRobert Lowe, NZ Chartered Accountants Journal November 2000 When Benford’s Law is Broken Robert Lowe, NZ Chartered Accountants Journal December 2000 I’ve Got Your NumberMark Nigrini, AICPA Journal of Accountancy May 1999 Wikipedia – Benford’s Lawhttp://en.wikipedia.org/wiki/Benford’s_law 26
Appendix: Excel approach to Benford's Law(suggested procedure and Excel tips) Columnar (left to right) derivation recommended (helps check first digit extraction is correct) Convert to absolute values first – e.g. =ABS(F11) Exclude zero balances (just sort list, then delete rows) Remove any leading zeros (number/balance <0) – select numbers/balances <0, then – multiply by 100, 1000, doesn’t matter (e.g. 2 decimal places => use 100 or greater)– just need to get leading digit on the left of the decimal Extract leading digit(s) – e.g. =LEFT(J11,1) Sort list in ascending order of extracted leading digit cell Create subtotals on change in leading digit value Calculate Grand total of record count (less zero balances) Calculate percentage of Grand total represented by the count of each leading digit Compare with Benford’s Distribution percentages and graph if desired Review outcomes and decide next steps. 27