630 likes | 637 Views
How to Fake Data if you must. Rachel Fewster. Department of Statistics. Who wants to fake data?. Electoral finance returns… Toxic emissions reports… Business tax returns…. Land areas of world countries: real or fake?. Land areas of world countries: real or fake?. 1 2 3 4 5 6 7
E N D
How to Fake Data if you must Rachel Fewster Department of Statistics
Who wants to fake data? • Electoral finance returns… • Toxic emissions reports… • Business tax returns…
Land areas of world countries: real or fake? 1 2 3 4 5 6 7 8 9 IIIII III III I I II I
Land areas of world countries: real or fake? 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 IIIII III III I I II I I I III I IIII I II III
Land areas of world countries: real or fake? This one is right! 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 IIIII III III I I II I I I III I IIII I II III This one seems more even… This one has as many 1s as 5-9s put together!
1 2 3 4 5 6 7 8 9 IIIII III III I I II I Real land areas of world countries 11 of them begin with digits 1 – 4… Only 5 begin with digits 5 – 9…
Friday’s Newspaper: 1 2 3 4 5 6 7 8 9 IIII IIII IIII III IIII II IIII II III 10 out of 34 numbers began with a 1… None out of 34 began with a 9!
The Curious Case of the Grimy Log-books • In 1881, American astronomer Simon Newcomb noticed something funny about books of logarithm tables…
The Curious Case of the Grimy Log-books The first pages are for numbers beginning with digits 1 and 2… The books always seemed grubby on the first pages… The last pages are for numbers beginning with digits 8 and 9… … but clean on the last pages
The Curious Case of the Grimy Log-books Why? People seemed to look up numbers beginning with 1 and 2 more often than they looked up numbers beginning with 8 and 9. Because numbers beginning with 1 and 2 are MORE COMMON than numbers beginning with 8 and 9!!
Newcomb’s Law 30% of numbers begin with a 1 !! < 5% of numbers begin with a 9 !! American Journal of Mathematics, 1881
The First Digits… Over 30% of numbers begin with a 1 Only 5% of numbers begin with a 9
The First Digits… Numbers beginning with a 1 Numbers beginning with a 9 There is the same “opportunity” for numbers to begin with 9 as with 1 … but for some reason they don’t!
0.301 = log10(2/1) 0.176 = log10(3/2) 0.125 = log10(4/3) Chance of a number starting with digit d
Reactions to Newcomb’s law Nothing! …for 57 years!
Enter Frank Benford: 1938 Physicist with the General Electric Company Assembled over 20,000 numbers and counted their first digits! ‘A study as wide as time and energy permitted.’
Populations Numbers from newspapers Drainage rates of rivers Numbers from Readers Digest articles Street addresses of American Men of Science
About 30% begin with a 1 About 5% begin with a 9
Anomalous numbers !! Benford gave the ‘law’ its name… …but no explanation.
“…The logarithmic law applies to outlaw numbers that are without known relationship, rather than to those that follow an orderly course; and so the logarithmic relation is essentially a Law of Anomalous Numbers.”
What is the explanation? Explanations for Benford’s Law • Numbers from a wide range of data sources have about 30% of 1’s, down to only 5% of 9’s. • Benford called these ‘outlaw’ or ‘anomalous’ numbers. They include street addresses of American Men of Science, populations, areas, numbers from magazines and newspapers. • Benford’s ‘orderly’ numbers don’t follow the law – like atomic weights and physical constants
Popular Explanations These two say that IF there is a universal law, it must be Benford’s. They don’t explain why there should be a law to start with! • Scale Invariance • Base Invariance • Complicated Measure Theory • Divine choice • Mystery of Nature
Complicated Measure Theory In a nutshell … If you grab numbers from all over the place (a random mix of distributions), their digit frequencies ultimately converge to Benford’s Law
It doesn’t really explain WHAT will work well, nor why • It doesn’t explain why street addresses of American Men of Science works well!
The Key Idea… If a hat is covered evenly in red and white stripes… Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon
The Key Idea… If a hat is covered evenly in red and white stripes… … it will be half red and half white. Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon
A Hat • If the red stripes cover half the base, they’ll cover about half the hat The red stripes and the white stripes even out over the shape of the hat
What if the red stripes cover 30% of the base? 0 0.3 1 1.3 2 2.3 3 3.3 4 4.3 5 5.3 6 Then they’ll cover about 30% of the hat.
What if the red stripes cover precisely fraction 0.301 of the base? Then they’ll cover fraction ~0.301 of the hat. 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 0.301 = log10(2/1)
Think of X as a random number… • We want the probability that X has first digit = 1 • Let the ‘hat’ be a probability density curve for X • Then AREAS on the hat give PROBABILITIES for X
Think of X as a random number… • We want the probability that X has first digit = 1 • Let the ‘hat’ be a probability density curve for X • Then AREAS on the hat give PROBABILITIES for X Area = 0.95 from 1 to 5 Pr(1 < X < 5) = 0.95 Total area = 1
In the same way …. 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 If the red stripes somehow represent the X values with first digit = 1, and the red stripes have area ~ 0.301, then Pr(X has first digit 1) ~ 0.301.
So X values with first digit=1 somehow lie on a set of evenly spaced stripes? Write X in Scientific Notation:
So X values with first digit=1 somehow lie on a set of evenly spaced stripes? Write X in Scientific Notation: r is between 1 and 10 n is an integer
For example… r is between 1 and 10 n is an integer
For example… For the first digit of X, only r matters!
For example… r > 2 J 1 < r < 2 J For the first digit of X, only r matters!
Take logs to base 10… Or in other words…
r is between 1 and 10 n is an integer
r is between 1 and 10 n is an integer
r is between 1 and 10 n is an integer
n is an integer X has first digit 1 precisely when log(X) is between n and n + 0.301 for any integer n n = 0 : X from 1 to 2 n = 1 : X from 10 to 20 n = 2 : X from 100 to 200
n is an integer X has first digit 1 precisely when log(X) is between n and n + 0.301 for any integer n STRIPES!! n = 0 : n = 1 : n = 2 :
The ‘hat’ is the probability density curve for log(X) 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 • X values with first digit = 1 satisfy: n = 0 : and so on! n = 1 : n = 2 :
The ‘hat’ is the probability density curve for log(X) 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 • X values with first digit = 1 satisfy: n = 0 : X from 1 to 2 n = 1 : X from 10 to 20 n = 2 : X from 100 to 200
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 So X values with first digit=1 DO lie on evenly spaced stripes, on the log scale! The PROBABILITY of getting first digit 1 is the AREA of the red stripes, ~ approx the fraction on the base, = 0.301.