310 likes | 317 Views
Discover the intriguing distribution of first digits in numbers, known as Benford's Law. Originally observed by Simon Newcomb in 1881, this law has since been rediscovered by Frank Benford and is applicable in various data sets. Explore how this distribution deviates from equal frequency and its implications in different domains.
E N D
Benford’s Very Strange Law John D. Barrow
Simon Newcomb 1888:"We are probably nearing the limit of all we can know about astronomy" 1835-1909 ‘Note on the Frequency of Use of the Different Digits in Natural Numbers’, 1881
Newcomb’s ‘Law’ "That the ten digits do not occur with equal frequency must be evident to anyone making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9." The law of probability of the occurrence of numbers is such that all mantissae [fractional part] of their logarithms are equally probable.
Data on first digits are evenly spread on a logarithmic scale But it will not be on a linear scale. They become increasingly sparse Newcomb said this law was “evident” P(d) [log(d+1) – log(d)]/[log(10) – log(1)] = log(1 + 1/d)
Probability of the First Digit Being Equal to d P(d)= log10[1 + 1/d], d = 1, 2,.. Ignore signs and take first digit after decimal point eg for -3.1526 it is 1
A Big Surprise You might have thought P(1) = P(2) = P(3) = ….P(9) = 0.11.. But… P(1) = 0.30 P(2) = 0.18 P(3) = 0.12 P(4) = 0.10 P(5) = 0.08 P(6) = 0.07 P(7) = 0.06 P(8) = 0.05 P(9) = 0.05
Rediscovered by Frank Benford at GEC in 1938 1883-1948 P(d)= log10[1 + 1/d] first-digit distribution then becomes known as “Benford’s Law” ‘The Law of Anomalous Numbers’ (1938)
Benford gathered 20,000 pieces of data and studied First-digit frequencies
Picking Raffle Tickets P(1) = 1/3 P(1) = 1/2 P(1) = 1/5 P(1) = 1/9 P(1) goes up as be go to 19 tickets, then falls
P(1) P(1) depends on the number of tickets Number of tickets P(1) Take an average over all Possible numbers of tickets The average is 30.1% Number of tickets S. Mould
Universal distribution P(x) for numbers with units Means it must be scale invariant P(kx) = f(k)P(x) Since P(x)dx = 1 we must have P(kx)dx = 1/k so 1/k = P(kx)dx = f(k) P(x)dx = f(k) Means f(k) = 1/k d/dk of P(kx) = f(k)P(x) xdP(kx)/d(kx) d(kx)/dk = -P(x)/k2 Put k = 1 Means P(x) = 1/x In reality we won’t go to zero or infinity so don’t worry about 0 1/x dx being infinite
Other Digits By the same kind of analysis we can determine the probability that the seconddigit will have a certain value. It's only necessary to consider a single order of magnitude, since the pattern is repeated on each order. For example, in the base 10, the probability of the second digit being "3" is equal to the sum of the probabilities of the first two digits being "1.3", "2.3", "3.3", ... or "9.3" for numbers in the range from 1 to 10. This is indicated by the shaded regions in the logarithmic scale: The fraction in 1.4 to 1.3 is Now just find the fractions in 2.2 to 2.3 etc and add all the answers together
Probabilities for Successive Significant Digits P(first digit is d) = log[1 + 1/d], d = 1,2,3,…9. P(second digit is d) = 9k=1 log[1 + (10k+d)-1], d = 0,1,2…9. (Newcomb) The joint distribution of all digits can be found and they are not independent P(first = d1, …,kth = dk) = log[1 + (i=1k di 10k-i)-1] Eg for 0.314; P(3,1,4) = log[1 + (314)-1] = 0.0014.. Unconditional probability that second digit is 1 is P(second digit =1) = 0.109, But conditional probability that it is 1 given that the first is 1 is 0.115 Dependence falls off fast as distance between digits increases Distn of the nth digit approaches a uniform distribution on 0,1,2,…,9 very fast as n , so P 1/10 for occurrence of each 0,1,2…,9 as log(1 + 1/n) 1/n
Invariances Pick Out Benford • Scale invariance – no preferred units • Base invariance wrt base of arithmetic b P(d) = logb(1 + 1/d) • But why should there be a distribution like this at all?
Do All First-Digit Distributions Follow Newcomb-Benford?
Random number generator US tax return data
Not Everything Follows Benford • Continued fraction digits are mostly 1’s in general but they are not Benford-Newcomb-like a = k + x = integer + fractional part For almost all real numbers: P(k) = ln[1 + 1/k(k + 2)]/ln[2] P(1) = 0.41, P(2) = 0.17, P(3) = 0.09, P(4) = 0.06, P(5) = 0.04 Steeper than Benford: P(k) 1/k2 as k ln(1+x) x
First digits are Benford-Newcomb distributed so long as • Data measure same phenomena (eg all prices or areas) • There is no built in max or min values • The numbers are not assigned (like phone nos) • The underlying distribution is fairly smooth • More observations of small items than large ones • Data spans several whole numbers on the log scale: • * The distribution must be broad rather than narrow * Red area is relative Prob first digit is 1 Ratios of areas proportional to widths Eg incomes. populns Blue area is relative Prob first digit is 8 Broad Ratios of areas not proportional to widths Eg human heights, IQ scores Narrow
Different Types of Data Benford-like ? yes yes no yes
Winning Lotteries • The Massachusetts Numbers Game – State Lottery 1. Bet on a 4-digit number 2. A 4-digit number is generated randomly 3. All winners share the jackpot • A Possible Strategy To avoid sharing the prize. Assume entrants pick numbers from their experience (ie not at random) and obey Benford’s law. So pick numbers that are least probable by the Benford-Newcomb law. So start with 9’s and 8’s • Evidence (Hill 1988) that numbers ‘randomly’ chosen by people tend to start with low digits
Generalised Benford’s Laws • A random process with probability distribution P(x) 1/x gives Benford data for first digits: P(d)= log[1 + 1/d] • Random processes with P(x) 1/xa and a 1 give P(x) = C dd+1 x-a dx = (101-a – 1)-1[(d+1)1-a – d1-a] • For a = 2: P(d = 1) = 0.56, P(d = 2) = 0.185, P(d = 3) =0.09, P(d = 9) = 0.012 • For prime numbers from 1 to N a(N) = 1/[logN – c] c = 1.10 + 0.05 large N Perone et al
A Well-defined Approach to Uniformity by the Primes Christian Perone a = 1.10
Detecting Fraud ‘Natural’ distributions and their combinations should follow Benford Maybe ‘Doctored’ or ‘artificial’ constructions do not ??? Mark Nigrini Univ. Cincinnati PhD thesis (1992) ‘The detection of income evasion through an analysis of digital distributions’ Data from the lines of 169,662 IRS model files follow Benford's law closely. Fraudulent data taken from a 1995 King’s County, New York, District Attorney's Office study of cash disbursement and payroll in business don’t follow Benford's law. The fraudulent or concocted data appear to have far fewer numbers starting with 1 and many more starting with 5 or 6 than do true data.
Forensic Accounting with Newcomb-Benford Robert Burton, the chief financial investigator for the Brooklyn District Attorney recalled in an interview that he had read an article by Dr. Nigrini that fascinated him. "He had done his Ph.D. dissertation on the potential use of Benford's Law to detect tax evasion, and I got in touch with him in what turned out to be a mutually beneficial relationship," Mr. Burton said. "Our office had handled seven cases of admitted fraud, and we used them as a test of Dr. Nigrini's computer program. It correctly spotted all seven cases as "involving probable fraud." He feels your pain