900 likes | 1.06k Views
MBP1010 - Lecture 2: January 14, 2009. 1. Density curves and standard normal distribution 2. Sampling distribution of the mean 4. Confidence Interval for the mean Hypothesis testing (1 sample t test). Reading: Introduction to the Practice of Statistics:
E N D
MBP1010 - Lecture 2: January 14, 2009 • 1. Density curves and standard • normal distribution • 2. Sampling distribution of the mean • 4. Confidence Interval for the mean • Hypothesis testing • (1 sample t test) Reading: Introduction to the Practice of Statistics: 1.3, 3.4, 5.2, 6.1-6.4 and 7.1
Standard deviation vs standard error for describing data Table 1. Characteristics of study subjects (n=35)
Importance of Normal Distribution* 1. Distributions of real data are often close to normal. 2. Mathematically easy to work with so many statistical tests are designed for normal (or close to normal) distributions). 3. If the mean and SD of a normal distribution are known, you can make quantitative predictions about the population. * also called Gaussian curve
Red bars = scores 6 Proportion = 0.303
Red area under the density cure are 6. Proportion = 0.293
Cumulative proportion for value x is the proportion of all observations that are x; this is the area to the left of the curve.
“The 68-95-99.7 Rule” Mean = 64.5 inches SD = 2.5 inches
The standard normal distribution is: a normal distribution with a mean of 0 and a SD of 1. Normal distributions can be transformed to standard normal distributions by the formula: where X is a score from the original normal distribution, μ is the mean of the original normal distribution, and σ is the standard deviation of original normal distribution. The standard normal distribution is sometimes called the z distribution.
Z-score A z score always reflects the number of standard deviations above or below the mean a particular score is. Ex. If a person scored 70 on a test with mean of 50 and SD of 10, then they scored 2 standard deviations above the mean. Converting the test scores to z scores, an X of 70 would be: So, a z score of 2 means the original score was 2 SD above the mean.
Z Scores • Provide a meaningful way to compare individuals from • different normal distributions – on the same scale • Ie. How many SD above or below the mean? • Eg, - bone density measures • - growth charts – height of children at different ages • - “normalized” data
Quantile-Quantile (Q-Q) Plot QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution is “normal”, we should observe a straight line.
Rice Virtual Lab in Statistics http://onlinestatbook.com/rvls/ Hyperstat Online Section 5. Normal Distribution - theory
Populations and Samples Population: entire group of individuals that we want information about Sample: a part of the population that we actually examine in order to gather information Goal: to try to draw conclusions about the population from the sample
Whole Population Mean = SD = Sample Inference Sample Mean = x SD = s
Parameter: - a number that describes the population - number is fixed but in practice we do not know its value (eg, μ) Statistic: - a number that describes a sample (eg, x). - its value is known when we take a sample, but it can change from sample to sample. - often used to estimate an unknown parameter .
Statistical inference is the process by which we draw conclusions about the population from the results observed in a sample.. Two main methods used in inferential statistics: estimation and hypothesis testing. In estimation, the sample is used to estimate a parameter and a confidence interval about the estimate is constructed.
Random Sampling is Key! - every individual in the population sampled must have a chance of being included in the sample - the choice of one subject does not influence the chance of other subjects being chosen - use a method of sampling in which chance alone operates - toss of a coin, draw from a hat - random number generators - random assignment in clinical trials results in randomly selected groups
Simple Random Sampling (SRS) - the chances for each individual in the population to be selected is equal - every possible sample an equal chance to be chosen Stratified Sampling - divide the population into strata - choose SRS in each stratum - combine these SRS to form full sample eg. Strata: prognostic factors in cancer patients; male/female, age - consult a statistician for more complex sampling
Sample mean (x) as an estimator of the population mean () What would happen if we repeated the sample several times? Sampling variability: - repeated samples from the same population will not have the same mean - depends partly on how variable the underlying population is and on the size of the sample selected
Sampling Distribution of X - the distribution of values taken by the mean (x) in all possible samples of the same size from the same population -
1. Mean of sampling distribution of x = 2. SD of sampling distribution = - called standard error of the mean 3. Shape of the sampling distribution is approximately a normal curve, regardless of the shape of the population distribution, provided n is large enough (Central Limit Theorem)
Simulation of Sampling Distribution Central Limit Theorum Rice Virtual Lab in Statistics http://onlinestatbook.com/rvls/
Population: All MBP1010 students n=37 = 1.00 cup = 1.07 cups
Population One Randomly n=37 Selected Sample n=12 x = 0.875 s = 0.78 = 1.00 = 1.07
Population Sampling Distribution n=37 1000 repeats of n=12 = 1.00 = 1.07 Mean = 1.00 SD = 0.26
Population Sampling Distribution One Sample n=37 1000 repeats of n=12 n=12 x = 0.875 s = 0.78 SEM = 0.23 Mean = 1.00 SD = 0.26 = 1.00 = 1.07 (SEM) s/n
95% Confidence Interval = 0.95 =0.025 =0.025 -1.96 1.96 2.5 th 97.5 th
95% Confidence Interval for a population mean If population known (not realistic) Pr (-1.96 z 1.96) = 0.95 Pr (-1.96 1.96) = 0.95 Pr (x -1.96/n x + 1.96/n ) = 0.95 x - 1.96(/n) and x + 1.96(/n) are the 95 percent confidence intervals on the population mean x - /n Express x in standardized form: z statistic
24 out of 25 samples included (96%) In the long run, 95% of all samples will have an interval that includes .
90% Confidence Interval = 0.90 =0.05 =0.05 -1.645 1.645 5 th 95 th
Confidence Interval for a population mean population NOT known (usual) - use sample standard deviation (s) as an estimate of - therefore, /nestimated from sample using: s/n (standard error of the mean;SE) - SE of the sample is the estimate of the SD that would be obtained from the means of a large number of samples drawn from that population
Problem: Critical Ratio = x - s/n is not normally distributed -need to consider reliability of both x and s as estimators of and respectively - shape of the distribution depends on the sample size n x - s/n Therefore follows the t distribution
t - distribution - a family of distributions indexed by the degrees of freedom (n-1) - degrees of freedom refer to number of independent quantities among a series of numerical quantities
Degrees of Freedom For SD: - there are n deviations around the mean - there is one restriction: sum of deviations = 0 - therefore once we have calculated n-1 deviations around the mean, the last number would be already determined as the sum must be 0 (ie. not independent). - for n deviatons around the mean there are n-1 degrees of freedom (DF)
95% Confidence Interval for a population mean population NOT known (usual) A sample consists of 25 mice with a mean tumor size of 2.1 cm and SD = 1.9 cm. x - t24,0.975 x s/n, x + t24,0.975 x s/n t24,0.975 = 2.064 (from tables of t dist) 2.1 - (2.064 x 1.9/ 25), 2.1 + (2.064 x 1.9/ 25) = 1.32 , 2.88 cm
Confidence interval for a Mean Estimate of mean tumor size = 2.1 cm; n=25. 95% CI = 1.32 , 2.88 cm Interpretation: - 95% of the intervals that could be constructed from repeated random samples of size 25 contain the true population mean - we are 95% confident that the mean tumor size is between 1.32 and 2.88 cm.
Factors affecting the length of the confidence interval x tn-1, .975 x s/n s/n = SE Sample size: as n increases, length of the CI decreases variation: as s, which reflects variability of the distribution of observations, increases, the length of the CI increases level of confidence: as the confidence desired increases (ie 90,95, 99% CI), the length of the CI increases.
Standard deviation vs standard error for describing data Table 1. Characteristics of study subjects (n=35)
Standard deviation vs standard error for describing data If the purpose is to describe the data (eg. to see if subjects are typical): standard deviation - variability of the observations If the purpose is to describe the results (outcome) of the Study: standard error confidence interval - precision of the estimate of a population parameter • Note: • can calculate one from the other • indicate clearly whether reporting SD or SE
What Formal Statistical Inference Cannot Do • tell you what population you should be interested in • ensure that you sampled properly from the population • determine whether measurements made are • biased (systematically wrong) • DOES: • - give a quantitative indication of how much random • variation may have affected your results
What/who are we trying to study? Target Population Patients with All rheumatoid voters arthritis Population Sampled Patients admitted telephone to a particular listings hospital Sample Studied Sample of sample of records of above listings above patients