Statistics & Probability Overview: Normal Approximation, Poisson & Exponential Distributions

Raoul LePage Professor STATISTICS AND PROBABILITY www.stt.msu.edu/~lepage click on STT315_Sp06 Week 4

WEEK PLAN: Normal Approx of Binomial Poisson and its normal Approx Exponential Selected material from Ch. 1. Week 4

Normal Approx of Binomial n = 10, p = 0.4 mean = n p = 4 sd = root(n p q) ~ 1.55 Week 4

Poisson Distribution Governing Counts of Rare Events p(x) = e-mean meanx / x! for x = 0, 1, 2, ..ad infinitum Week 4

Poisson e..g. X = number of times ace of spades turns up in 104 tries X~ Poisson with mean 2 p(x) = e-mean meanx / x! e.g. p(3) = e-2 23 / 3! ~ 0.18 Week 4

Poisson e.g. X = number of raisins in MY cookie. Batter has 400 raisins and makes 144 cookies. E X = 400/144 ~ 2.78 per cookie p(x) = e-mean meanx / x! e.g. p(2) = e-2.78 2.782 / 2! ~ 0.24 (around 24% of cookies have 2 raisins) Week 4

Poisson THE FIRST BEST THING ABOUT THE POISSON IS THAT THE MEAN ALONE TELLS US THE ENTIRE DISTRIBUTION! note: Poisson sd = root(mean) Week 4

400 raisins 144 COOKIES E X = 400/144 ~ 2.78raisins per cookie sd = root(mean) = 1.67 (for Poisson) Week 4

Poisson THE SECOND BEST THING ABOUT THE POISSON IS THAT FOR A MEAN AS SMALL AS 3 THE NORMAL APPROXIMATION WORKS WELL. 1.67 = sd = root(mean) Special to Poisson Week 4 mean 2.78

WE AVERAGE 127.8 ACCIDENTS PER MO. E X = 127.8 accidents If Poisson then sd = root(127.8) = 11.3049 and the approx dist is: sd = root(mean) = 11.3 Special to Poisson ~ Week 4 mean 127.8 accidents

Exponential The “ lifetime” distribution when death comes by rare event. Week 4

Exponential density for mean life 57.3 years Beware, the text uses notation E(57.3) to denote exponential distribution having mean 57.3. We will NOT do so! mean 57.3 years Week 4

Exponential tail areas for mean life 57.3 years P(X > x) = e-x/mean P(X > 100) = e-100/57.3 = 0.1746 mean 57.3 years Week 4

Selected material from Chapter 1 Week 4

The overwhelming majority of samples of n from a population of N can stand-in for the population. THE GREAT TRICK OF STATISTICS ATT Sysco Pepsico GM Dow population of N = 5 sample of n = 2

The overwhelming majority of samples of n from a population of N can stand-in for the population. THE GREAT TRICK OF STATISTICS ATT Sysco Pepsico GM Dow ATT Pepsico population of N = 5 sample of n = 2

Sample size n must be “large.” For only a few characteristics at atime, such as profit, sales, dividend.SPECTACULAR FAILURES MAY OCCUR! GREAT TRICK : SOME CAVEATS ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 population of N = 5 sample of n = 2

GREAT TRICK : SOME CAVEATS This sample is obviously “not representative.” ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 Sysco 21 Pepsi 42 population of N = 5 sample of n = 2

With-replacementvs without replacement. HOW ARE WE SAMPLING ? ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 population of N = 5 sample of n = 2

With-replacement HOW ARE WE SAMPLING ? ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 Pepsi 42 Pepsi 42 population of N = 5 sample of n = 2

Rule of thumb: With and without replacement are about the same ifroot [(N-n) /(N-1)] ~ 1. DOES IT MAKE A DIFFERENCE ? with vs without SAME ? population of N sample of n

WITH-replacement samples have no limit to the sample size n. UNLIMITED SAMPLING ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 population of N = 5 sample of n = 6

WITH-replacement samples have no limit to the sample size n. UNLIMITED SAMPLING ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 Dow 9 Pepsi 42 Dow 9 Pepsi 42 ATT 12 GM 8 repeats allowed population of N = 5 sample of n = 6

TOSS COIN 100 TIMES H T toss coin 100 times population of N = 2

TOSS COIN 100 TIMES H , T , H , T , T , H , H , H , H , T , T , H , H , H , H , H , H , H , T , T , H , H , H , H , H , H , H , H , H , T , H , T , H , H , H , H , T , T , H , H , T , T , T , T ,T , T , T , H , H , T , T , H , T , T , H , H , H , T , H , T , H , T , T , H , H , T ,T , T , H , T , T , T , T , T , H , H , T , H , T , T , T , T , H , H , T , T , H , T , T , T , T , H , T , H , H , T , T , T , T , T H T sample of n = 100 with-replacement 49 H and 51 T population of N = 2

POPULATION CLONING, MANY COINS SAME AS 2 TOSS COIN 100 TIMES H , T , H , T , T , H , H , H , H , T , T , H , H , H , H , H , H , H , T , T , H , H , H , H , H , H , H , H , H , T , H , T , H , H , H , H , T , T , H , H , T , T , T , T ,T , T , T , H , H , T , T , H , T , T , H , H , H , T , H , T , H , T , T , H , H , T ,T , T , H , T , T , T , T , T , H , H , T , H , T , T , T , T , H , H , T , T , H , T , T , T , T , H , T , H , H , T , T , T , T , T H H T sample of n = 100 with-replacement 49 H and 51 T T population of N

HOW MANY SAMPLES ARE THERE ? 2 from 5

HOW MANY SAMPLES ARE THERE ? 30 from 500

HOW MANY SAMPLES ARE THERE ? there are far more samples with-replacement

They would have you believe the population is {8, 9, 12, 42} and the sample is {42}. A SET is a collection of distinct entities. CORRECTION TO PAGE 25 OF TEXT ATT 12 IBM 42 AAA 9 Pepsi 42 GM 8 Dow 9 WE SAMPLE COMPANIES NUMBERS COME WITH THEM Pepsi 42 Pepsi 42

IF THE OVERWHELMING MAJORITY OF SAMPLES ARE “GOOD SAMPLES” THEN WE CAN OBTAIN A “GOOD” SAMPLE BY RANDOM SELECTION. THE ROLE OF RANDOM SAMPLING

HOW TO SAMPLE RANDOMLY ? SELECTING A LETTER AT RANDOM Digits are made to correspond to letters. a = 00-02 b = 03-05 …. z = 75-77 Random digits then give random letters. 1559 9068 … (Table 14, pg. 809) 15 59 90 68 etc… (split into pairs) f t * w etc… (take chosen letters) For samples without replacement just pass over any duplicates.

The Great Trick is far more powerful than we have seen.A typical sample closely estimates such things as a population mean or the shape of a population density.But it goes beyond this to reveal how much variation there is among sample means and sample densities.A typical sample not only estimates population quantities. It estimates the sample-to-sample variations of its own estimates.

EXAMPLE : ESTIMATING A MEAN The average account balance is $421.34 for a random with-replacement sample of 50 accounts. We estimate from this sample that the average balance is $421.34 for all accounts. From this sample we also estimate and display a “margin of error” $421.34 +/- $65.22 = . s denotes "sample standard deviation"

SAMPLE STANDARD DEVIATION NOTE: Sample standard deviation s may be calculated in several equivalent ways, some sensitive to rounding errors, even for n = 2.

EXAMPLE : MARGIN OF ERROR CALCULATION The following margin of error calculation for n = 4 is only an illustration. A sample of four would not be regarded as large enough. Profits per sale = {12.2, 15.3, 16.2, 12.8}. Mean = 14.125, s = 1.92765, root(4) = 2. Margin of error = +/- 1.96 (1.92765 / 2) Report: 14.125 +/- 1.8891. A precise interpretation of margin of error will be given later in the course, including the role of 1.96. The interval 14.125 +/- 1.8891 is called a “95% confidence interval for the population mean.” We used: (12.2-14.125)2 + (15.3-14.125)2 + (16.2-14.125)2 + (12.8-14.125)2 = 11.1475.

EXAMPLE : ESTIMATING A PERCENTAGE A random with-replacement sample of 50 stores participated in a test marketing. In 39 of these 50 stores (i.e. 78%) the new package design outsold the old package design. We estimate from this sample that 78% of all stores will sell more of new vs old. We also estimate a “margin of error +/- 11.5% Figured: 1.96 root(pHAT qHAT)/root(n) =1.96 root(.78 .22)/root(50) = 0.114823 in Binomial setup

Plot the average heights of tents placed at {10, 14}. Each tent has integral 1, as does their average. TENTING TONIGHT tent density "average tent"

THE MEAN OF A DENSITY IS TYPICALLY THE SAME AS THE MEAN OF THE DATA FROM WHICH IT IS MADE.

Plot the average heights of tents placed at {10, 14}. Each tent has integral 1, as does their average. SMOOTHING DATA SO YOU CAN SEE IT tent density "average tent"

Making the tents narrower isolates different parts of the data and reveals more detail. NARROWER TENTS = MORE DETAIL

With narrow tents. THE DENSITY BY ITSELF density

Histograms lump data into categories (the black boxes), not as good for continuous data. DENSITY OR HISTOGRAM ? density histogram

Plot of average heights of 5 tents placed at data {12, 21, 42, 8, 9}. DENSITY FOR { 12, 21, 42, 8, 9 } tent density

Narrower tents operate at higher resolution but they may bring out features that are illusory. IS DETAIL ILLUSORY ? which do we trust ? kinkier smoother

Population of N = 500 compared with two samples of n = 30 each. BEWARE OVER-FINE RESOLUTION POP mean = 32.02 population of N = 500 with 2 samples of n = 30

Population of N = 500 compared with two samples of n = 30 each. BEWARE OVER-FINE RESOLUTION sample means are close SAM1 mean = 33.03 SAM2 mean = 30.60 POP mean = 32.02 densities not good at fine resolution population of N = 500 with 2 samples of n = 30

The same two samples of n = 30 each from the population of 500. WE DO BETTER AT COARSE RESOLUTION SAM1 mean = 33.03 SAM2 mean = 30.60 POP mean = 32.02 how about coarse resolution ? population of N = 500 with 2 samples of n = 30

Statistics & Probability Overview: Normal Approximation, Poisson & Exponential Distributions