Biostats I – Review Lecture 4

Biostats I – Review Lecture 4 October 11, 2012

Last Time • Probabilities • Binomial Distribution • Normal Distribution

This Lecture • Sampling a population • Central Limit Theorem • t Distribution • Sampling distribution and CI for proportions

Sampling the Population μ = True population mean X bar: mean from the sampled population

Sampling the Population • In order to make inference about the entire population mean we must have: Random sample of the data • The larger the sample, the more reliable our estimates about the population parameters • Because we do not know μ (true mean) and only know x bar (sample mean) we use confidence intervals to quantify our uncertainty

Sampling Distribution If we sample many different times in the population, sample distribution will approach a normal distribution

Central Limit Theorem • With a random variable in a population with • mean = μ • standard deviation = σ • Then the sampling distribution of multiple sample means (x bars) with a sample size = n: • If n is large enough the sampling distribution will be approximately normal • The mean of the sampling distribution = μ • The standard deviation of the sample is σ/√n

Note • Population Standard Deviation • σis the standard deviation of the original distribution • Sample Standard Deviation • σ/√n is called the standard error, or more precisely, the standard error of the mean, and it is the standard deviation of the distribution of the sample mean.

Central Limit Theorem – uniform random variables

Central Limit Theorem – Chi-square distribution

Central Limit Theorem – Bimodal distribution

Using the Central Limit Theorem • HIV population CD4 count • μ = 250 • σ = 200 • Repeated sample populations of n = 50 • X bar = μ = 250 • Standard error = σ/ √n = 200/ √50 = 28.3

Using the Central Limit Theorem • What proportion of samples will have mean value < 100 cells/mm3 (X)? • Convert our population to the normal distribution use: • Z= (X – μ)/σ ** remember to use σ/ √n • Z= (100-250) / (200/ √50) • Z = -5.3 • P (Z < -5.3) = < 0.0001

Using the Central Limit Theorem • What level of CD4 count is the lower 10th percentile of the mean values (Z <= 0.1)? • Calculate the Z value for this proportion: di invnormal(.10) -1.2815516 • Transform back to CD4 count value • -1.28 = (X – μ) / (σ/ √n) = (X – 250) / (200/√50) • di -1.28155*200/sqrt(50) + 250 213.75229

Using the Central Limit Theorem • What level of CD4 count is the upper 2.5th percentile of the mean values? (Z > 0.025)? • ** remember Stata output invnormal is for <= • Calculate the Z value for this proportion: di invnormal(.975) 1.959964 • Transform back to CD4 count value • 1.96 = (X – μ) / (σ/ √n) = (X – 250) / (200/√50) • di 1.959964*200/sqrt(50) + 250 305.43717

Using the Central Limit Theorem • Now we have the lower and upper 2.5% percentiles of the distribution of the sample means. • The interior area contains 95% of the sample means. • 95% of the means from samples of size 50 that come from the underlying distribution ~N(250,200) will lie within this interval (194.6, 305.4)

Using the Central Limit Theorem • The interval for the means depends on the sample size n • As n increases, the width of the interval decreases

Confidence intervals for means • Interval estimation provides a range of reasonable values that contain the population parameter (in this case ) with a certain degree of confidence

Confidence intervals for means • We know from examining the standard normal distribution that P(-1.96 ≤ Z ≤ 1.96) = 0.95 95% 2.5% 2.5%

Calculating CI for means when we know the standard deviation Thus the lower 95% confidence limit for µ is And the upper 95% confidence limit for µ is We say we are 95% confident that the interval we calculate using the above formulae includes 

Calculating CI for means when we know the standard deviation • 90% confidence interval • Replace 1.96 in the formula with 1.64 • 99% confidence interval • Replace 1.96 in the interval with 2.58

Interpreting confidence intervals for means • The probability that the interval contains the true population mean is 95% • If we were to select 100 random samples from the population and calculate confidence intervals for each, approximately 95 of them would include the true population mean µ (and 5 would not)

Confidence intervals for means • How to get a tighter interval? • Decrease the confidence level • Increase n

What do we do if we don’t know the standard deviation (σ)? • Use the Student t Distribution • If X is normally distributed, and a sample of size n is chosen, then follows a Student’s t distribution with n-1 degrees of freedom

The t-distribution (t not z)

1) Calculating probability from t value • Use Table A.4: Gives P(T>t) at selected degrees of freedom 2) Using Stata • Stata is trying to confuse us even more! • Normal distribution: Stata gives P(Z<z) • Students t distribution: Stata gives P (T>t)

1) Calculating probability from t value • Stata Code P (T>t) • ttail (df,t) • Where df = n-1 • E.g., P(T>1.95) n=20 display ttail(19,1.95) .03304428

2) Calculating t value from probability • For example, for what t is P(T>t)=.05 for a sample of size 20? • Stata code • invttail(df,p) • Example display invttail(19,.05) 1.7291328

Normal approximation of the binomial distribution • Parameters • n = number of trials • p = proportion of success • np = mean • (np(1-p) = variance • √(np(1-p) = standard deviation • As n, the number of “trials”, increases, the binomial distribution more closely resembles the normal distribution

Binomial approximation to normal distribution • Considered valid when np≥5 and n(1-p) ≥5 • Why use it? • It is easier to use the normal distribution than to use table A.1. For example, if n=50, p=.45, and you wanted to know the P(X>=30), using table A.1 which gives you P(X=x), you would need to find P(X=30) + P(X=31) + .... + P(X=50) • Although in Stata the binomialtail function does actually give you P(X≥x)

Sampling distribution of a proportion • We often are more interested in the proportionof successes, rather than the number of successes • The true population proportion p is estimated by x = the number of successes or events n=the number of trials or people or observations

Sampling distribution of a proportion • If we take: • repeated samples of size n from a variable that follows the Bernoulli distribution (i.e. the outcome is 0 or 1) • calculate p̂=x/n for each of the samples (x=total count of successes) • if n is large enough, then p̂ will follow a normal distribution (by the central limit theorem) • The mean of this distribution is p • The standard deviation is which is also called the standard error

Sampling distribution of proportions • So if p̂ follows a normal distribution with mean p and standard deviation • Then ~ N(0,1) • Considered valid when np≥5 and n(1-p) ≥5

Sampling distribution of proportions • What proportion of samples of size 50 (n) from a population with p=.10 will have a p̂of .20 or higher? • Calculate the Z value for P(p̂ ≥ 0.20)? • (0.2 – 0.1) / √(0.1(1-0.1)/50 = 2.36 • Now want the P (Z>=2.36) ** remember correct Stata code** . display 1-normal(2.36) .00913747

Confidence intervals for proportions • Lower 95% confidence limit: • Upper 95% confidence limit: • However we don’t know p (if we did we wouldn’t be calculating these intervals). So we substitute p̂ into the formula for the SEM. • Lower 95% confidence limit: • Upper 95% confidence limit:

Confidence intervals for proportions • HIV prevalence in those testing at Mulago Hospital • Sample population n = 3389 (n) • n HIV+ = 1003 • Prevalence = 1003/3389 = 0.296 (p) • (.296 – 1.96*(√ [ .296*(1-.296)/3389 ]) , .296 + 1.96*(√ [ .296*(1-.296)/3389 ]) • (.281, .311) • Interpretation: we are 95% confident that the interval 0.281-0.311) includes the true HIV prevalence in the population

Biostats I – Review Lecture 4