360 likes | 506 Views
Biostats I – Review Lecture 4. October 11, 2012. Last Time. Probabilities Binomial Distribution Normal Distribution. This Lecture. Sampling a population Central Limit Theorem t Distribution Sampling distribution and CI for proportions. Sampling the Population.
E N D
Biostats I – Review Lecture 4 October 11, 2012
Last Time • Probabilities • Binomial Distribution • Normal Distribution
This Lecture • Sampling a population • Central Limit Theorem • t Distribution • Sampling distribution and CI for proportions
Sampling the Population μ = True population mean X bar: mean from the sampled population
Sampling the Population • In order to make inference about the entire population mean we must have: Random sample of the data • The larger the sample, the more reliable our estimates about the population parameters • Because we do not know μ (true mean) and only know x bar (sample mean) we use confidence intervals to quantify our uncertainty
Sampling Distribution If we sample many different times in the population, sample distribution will approach a normal distribution
Central Limit Theorem • With a random variable in a population with • mean = μ • standard deviation = σ • Then the sampling distribution of multiple sample means (x bars) with a sample size = n: • If n is large enough the sampling distribution will be approximately normal • The mean of the sampling distribution = μ • The standard deviation of the sample is σ/√n
Note • Population Standard Deviation • σis the standard deviation of the original distribution • Sample Standard Deviation • σ/√n is called the standard error, or more precisely, the standard error of the mean, and it is the standard deviation of the distribution of the sample mean.
Using the Central Limit Theorem • HIV population CD4 count • μ = 250 • σ = 200 • Repeated sample populations of n = 50 • X bar = μ = 250 • Standard error = σ/ √n = 200/ √50 = 28.3
Using the Central Limit Theorem • What proportion of samples will have mean value < 100 cells/mm3 (X)? • Convert our population to the normal distribution use: • Z= (X – μ)/σ ** remember to use σ/ √n • Z= (100-250) / (200/ √50) • Z = -5.3 • P (Z < -5.3) = < 0.0001
Using the Central Limit Theorem • What level of CD4 count is the lower 10th percentile of the mean values (Z <= 0.1)? • Calculate the Z value for this proportion: di invnormal(.10) -1.2815516 • Transform back to CD4 count value • -1.28 = (X – μ) / (σ/ √n) = (X – 250) / (200/√50) • di -1.28155*200/sqrt(50) + 250 213.75229
Using the Central Limit Theorem • What level of CD4 count is the upper 2.5th percentile of the mean values? (Z > 0.025)? • ** remember Stata output invnormal is for <= • Calculate the Z value for this proportion: di invnormal(.975) 1.959964 • Transform back to CD4 count value • 1.96 = (X – μ) / (σ/ √n) = (X – 250) / (200/√50) • di 1.959964*200/sqrt(50) + 250 305.43717
Using the Central Limit Theorem • Now we have the lower and upper 2.5% percentiles of the distribution of the sample means. • The interior area contains 95% of the sample means. • 95% of the means from samples of size 50 that come from the underlying distribution ~N(250,200) will lie within this interval (194.6, 305.4)
Using the Central Limit Theorem • The interval for the means depends on the sample size n • As n increases, the width of the interval decreases
Confidence intervals for means • Interval estimation provides a range of reasonable values that contain the population parameter (in this case ) with a certain degree of confidence
Confidence intervals for means • We know from examining the standard normal distribution that P(-1.96 ≤ Z ≤ 1.96) = 0.95 95% 2.5% 2.5%
Calculating CI for means when we know the standard deviation Thus the lower 95% confidence limit for µ is And the upper 95% confidence limit for µ is We say we are 95% confident that the interval we calculate using the above formulae includes
Calculating CI for means when we know the standard deviation • 90% confidence interval • Replace 1.96 in the formula with 1.64 • 99% confidence interval • Replace 1.96 in the interval with 2.58
Interpreting confidence intervals for means • The probability that the interval contains the true population mean is 95% • If we were to select 100 random samples from the population and calculate confidence intervals for each, approximately 95 of them would include the true population mean µ (and 5 would not)
Confidence intervals for means • How to get a tighter interval? • Decrease the confidence level • Increase n
What do we do if we don’t know the standard deviation (σ)? • Use the Student t Distribution • If X is normally distributed, and a sample of size n is chosen, then follows a Student’s t distribution with n-1 degrees of freedom
1) Calculating probability from t value • Use Table A.4: Gives P(T>t) at selected degrees of freedom 2) Using Stata • Stata is trying to confuse us even more! • Normal distribution: Stata gives P(Z<z) • Students t distribution: Stata gives P (T>t)
1) Calculating probability from t value • Stata Code P (T>t) • ttail (df,t) • Where df = n-1 • E.g., P(T>1.95) n=20 display ttail(19,1.95) .03304428
2) Calculating t value from probability • For example, for what t is P(T>t)=.05 for a sample of size 20? • Stata code • invttail(df,p) • Example display invttail(19,.05) 1.7291328
Normal approximation of the binomial distribution • Parameters • n = number of trials • p = proportion of success • np = mean • (np(1-p) = variance • √(np(1-p) = standard deviation • As n, the number of “trials”, increases, the binomial distribution more closely resembles the normal distribution
Binomial approximation to normal distribution • Considered valid when np≥5 and n(1-p) ≥5 • Why use it? • It is easier to use the normal distribution than to use table A.1. For example, if n=50, p=.45, and you wanted to know the P(X>=30), using table A.1 which gives you P(X=x), you would need to find P(X=30) + P(X=31) + .... + P(X=50) • Although in Stata the binomialtail function does actually give you P(X≥x)
Sampling distribution of a proportion • We often are more interested in the proportionof successes, rather than the number of successes • The true population proportion p is estimated by x = the number of successes or events n=the number of trials or people or observations
Sampling distribution of a proportion • If we take: • repeated samples of size n from a variable that follows the Bernoulli distribution (i.e. the outcome is 0 or 1) • calculate p̂=x/n for each of the samples (x=total count of successes) • if n is large enough, then p̂ will follow a normal distribution (by the central limit theorem) • The mean of this distribution is p • The standard deviation is which is also called the standard error
Sampling distribution of proportions • So if p̂ follows a normal distribution with mean p and standard deviation • Then ~ N(0,1) • Considered valid when np≥5 and n(1-p) ≥5
Sampling distribution of proportions • What proportion of samples of size 50 (n) from a population with p=.10 will have a p̂of .20 or higher? • Calculate the Z value for P(p̂ ≥ 0.20)? • (0.2 – 0.1) / √(0.1(1-0.1)/50 = 2.36 • Now want the P (Z>=2.36) ** remember correct Stata code** . display 1-normal(2.36) .00913747
Confidence intervals for proportions • Lower 95% confidence limit: • Upper 95% confidence limit: • However we don’t know p (if we did we wouldn’t be calculating these intervals). So we substitute p̂ into the formula for the SEM. • Lower 95% confidence limit: • Upper 95% confidence limit:
Confidence intervals for proportions • HIV prevalence in those testing at Mulago Hospital • Sample population n = 3389 (n) • n HIV+ = 1003 • Prevalence = 1003/3389 = 0.296 (p) • (.296 – 1.96*(√ [ .296*(1-.296)/3389 ]) , .296 + 1.96*(√ [ .296*(1-.296)/3389 ]) • (.281, .311) • Interpretation: we are 95% confident that the interval 0.281-0.311) includes the true HIV prevalence in the population