600 likes | 705 Views
Lecture #4. Descriptive Statistics Other descriptive measures Displaying data in tables and graphs. Measures of Variability. Consider the following two data sets on the ages of all patients suffering from bladder cancer and prostatic cancer. The mean age of the two groups is 40 years.
E N D
Lecture #4 Descriptive Statistics Other descriptive measures Displaying data in tables and graphs
Measures of Variability • Consider the following two data sets on the ages of all patients suffering from bladder cancer and prostatic cancer. • The mean age of the two groups is 40 years. • If we do not know the ages of individual patients and are told only that the mean age of the patients in the two groups is the same, we may deduce that the patients in the two groups have a similar age distribution. • Variation in the patient’s ages in each of these two groups is very different. • The ages of the prostatic cancer patients have a much larger variation than the ages of the bladder cancer patients.
Measures of Variability • Measure the “spread” in the data • Some important measures • Range • Mean deviation • Variance • Standard Deviation • Coefficient of variation • Interquartile Range
Variability • The purpose of the majority of medical, behavioural and social science research is to explain or account for variance or differences among individuals or groups. Examples • What factors account for the variance (or difference) in IQ among individuals? • What factors account for the variance in treatment compliance among different groups of patients?
Range • The range tells us the span over which the data are distributed, and is only a very rough measure of variability • Range: The difference between the maximum and minimum scores • Example: The most amount of tips made in a night is 270 and the least is 150. Therefore, the range of tips made that night is 270 – 150 = $120 • Range is the simplest measure of dispersion. • It is not the best measure of dispersion as it depends entirely on the extreme scores and tells us nothing about the middle values.
Variation X 5 0.00 This is an example of data 5 0.00 with NO variability 5 0.00 5 0.00 5 0.00 = 25 n = 5 = 5
Variation X 6 +1.00 This is an example of data 4 -1.00 with low variability 6 +1.00 5 0.00 4 -1.00 = 25 n = 5 = 5
Variation X 8 +3.00 This is an example of data 1 -4.00 with higher variability 9 +4.00 5 0.00 2 -3.00 = 25 n = 5 = 5
Mean deviation • The best measures of dispersion should: • take into account all the scores in the distribution • and should describe the average deviation of the scores around the mean. • Normally, to find the average we would want to sum all deviations from the mean and then divide by n, i.e., BUT: We have a problem. will always add up to zero
Deviations from the mean • In any group of scores, the sum of the deviations from the mean equals zero: X X- µ n = 6 3 3 - 5.50 = -2.50 µ = ΣX/n 5 5 - 5.50 = -0.50 µ = 33/6 9 9 - 5.50 = +3.50 µ = 5.50 2 2 - 5.50 = -3.50 8 8 - 5.50 = +2.50 6 6 - 5.50 = +0.50 ΣX = 33 Σ(X- µ) = 0.00
Variance & Standard Deviation • However, if we square each of the deviations from the mean, we obtain a sum that is not equal to zero • This is the basis for the measures of variance and standard deviation, the two most common measures of variability (or dispersion) of data
Variance & Standard Deviation (cont) X 8 +3.00 9.00 1 -4.00 16.00 9 +4.00 16.00 5 0.00 0.00 2 -3.00 9.00 = 25 = 0.00 = 50.00 Note: The is called the Sum of Squares
Steps to calculate standard deviation • Compute the mean. • Subtract the mean from each observation. • Square each of the deviations. • Sum them. • Divide by one less than the number of observations (almost the mean). • Take the square root.
Variance of a Population • The sum of squared deviations from the mean divided by the number of scores (sigma squared):
Sample Variance • The sum of squared deviations from the mean divided by the number of degrees of freedom (an estimate of the population variance, n-1)
Standard Deviation Formulas Population Standard Deviation Sample Standard Deviation Sample standard deviation usually underestimates population standard deviation. Using n-1 in the denominator corrects for this and gives us a better estimate of the population standard deviation.
Why use Standard Deviation and not Variance!??! • Normally, you will only calculate variance in order to calculate standard deviation, as standard deviation is what we typically want. • Why? Because standard deviation expresses variability in the same units as the data. • Example: Standard deviation of ages in a class is 3.7 years (and the variance would be 13.69 years2 = (3.7)2).
Coefficient of variation • It is a dimensionless measure of the relative variation. • Constructed by dividing the standard deviation by the mean and multiplying by 100. CV = (s/x) (100) • Used to compare the variability in one data set with that in another when a direct comparison of standard deviation is not appropriate.
Coefficient of variation • The formula is: • CV = (s/x) (100) • Suppose two samples of human males yield the following results:
Interquartile Range • Quartilesrefer to the division of the distribution into 4 equal parts • Q1 refers to the first 25% of the scores -25th percentile • Q2 refers to the next 25% of the scores (from Q1 toQ2) – the median (50th percentile) • Q3 refers to the scores between Q2 and Q3 -75th percentile • Q4 refers to the final 25% of the scores – 100th percentile • The IQR contains the middle 50% of the scores. It is obtained by Q3 – Q1 (i.e. the 75th percentile – the 25th percentile)
Calculating IQR Step 1. Divide the scores into 4 equal parts (12/4 = 3) Step 2. Find Q1 and Q3 - Q1 lies midway between the 3rd and 4th score - Q2 lies midway between the 9th score 10th score Step 3. Calculate Q3-Q1
Example • Back to our example 150, 165, 170, 175, 180, 190, 210, 210, 235, 240, 260, 270 • Step 1: Divide the scores into 4 equal parts 150, 165, 170 175, 180, 190 210, 210, 235 240, 260, 270 Q1 Q2 Q3 • Step 2: Find Q1 and Q3 Q1 = (170 + 175)/2 Q3 = (235 + 240)/2 = 172.5 = 237.5 • Step 3: Calculate Q3-Q1 Q3 – Q1 = 237.5 – 172.5 = 65
Weighted Mean Problem: You have two classes, with 5 and 25 students, respectively. In the smaller class (n=5), the average grade is 60% In the larger class (n=25), the average grade is 45% What is the average overall? Not this!!!!!!!! (60 + 45)/2
Measures to use with nominal or ordinal data • When observations are measured on a nominal, or ordinal scale, the methods just discussed for describing the middle and the spread do not work. • Characteristics measured on nominal or ordinal scales do not have numerical values but are counts or frequencies of occurrence.
Example Proportions and percentages: • A proportion is the number (a) of observations with a given characteristic (such a dying) divided by the total number of observations that both lived and died (a+b) • Proportion = p = a/(a+b) or 98/945 = 0.104. • A percentage is a proportion multiplied by 100%. Ratios: • A ratio is the number (a) of observations in a given group with a given characteristic (such as dying) divided by the number (b) of observations without the given characteristic • ratio = a/b • A ratio is always defined as a part divided by another part. • 98/847 = 0.116 or 152/787 = 0.193.
Rates • Rates are similar to proportions except that a multiplier (e.g., 1000, 10,000, or 100,000) is used and they are computed over a specified period of time. The multiplier is called the base and the formula is: • Rate = a/(a+b)* base • For example, if the timolol study lasted exactly one year, the rate of death per 10,000 patients taking timolol per year is (98/945)* (10,000) = 1037 per 10,000 patients per year.
Categorical Graphs (Nominal or Ordinal) • Pie Charts • Bar Graphs
Pie Charts and Nominal Data • Pie charts are commonly used to represent the frequency of scores for nominal data • Example patients distributed according to grade • 20% have grade I; 70% of the patients have grade I; and 10% have grade III.
Barcharts and Nominal Data • Barcharts are sometimes used to represent the frequency of scores for nominal data • Here, frequency is expressed as a percentage of the total number of males and females • (78% and 68%)
Vertical Bar Graphs Index
Numerical Graphs • Histograms • Frequency polygons • Boxplots
Example What is the age of this group of children? 4 7 7 7 8 8 7 8 9 4 7 3 6 9 10 5 7 10 6 8 7 8 7 8 7 4 5 10 10 0 9 8 3 7 9 7 9 5 8 5 0 4 6 6 7 5 3 2 8 5 10 9 10 6 4 8 8 8 4 8 7 3 7 8 8 8 7 9 7 5 6 3 4 8 7 5 7 3 3 6 5 7 5 7 8 8 7 10 5 4 3 7 6 3 9 7 8 5 7 9 9 3 1 8 6 6 4 8 5 10 4 8 10 5 5 4 9 4 7 7 7 6 6 4 4 4 9 7 10 4 7 5 10 7 9 2 7 5 9 10 3 7 2 5 9 8 10 10 6 8 3
Frequency Tables • A frequency table shows how often each value of the variable occurs. • Also called frequency distribution table
Histograms • A way of visually representing information contained in a frequency table • Histograms are kind of like bar charts; bars are used instead of connected points • The bars typically cover “intervals” of values. The first bar here covers scores > 0 and < 1.
Histogram Note that these are analogous to counts and percents with bar charts
Frequency Polygon • Another way of visual representation of information contained in a frequency table • Align all possible values on the bottom of the graph (the x-axis) • On the vertical line (the y-axis), place a point denoting the frequency of scores for each value • Connect the lines • (Typically add an extra value above and below the actual range of values)
Boxplots Boxplots graphically represent the scores in a distribution Made using 5 number summary Within the box are all scores that fall between the 25th and 75th percentile The whiskers capture all scores within 1.5 IQRs of the box boundary Outliers are between 1.5 and 3 IQRs Extreme outliers are beyond 3 IQRs
Shapes of Distributions • These representational aides all describe frequency distributions: the way score frequencies are distributed with respect to the values of the variable • Distributions can take on a number of shapes or forms
Unimodal Distributions • The mode of a distribution refers to the most frequently occurring score • In a unimodal distribution, one score occurs much more frequently than others
Multimodal Distributions • In multimodal distributions, more than one mode exists (or approximately so) • In a bimodal distribution, two modes exist
Rectangular or Uniform Distributions • In a uniform distribution, all values are observed equally often
Symmetrical and Skewed Distributions • A symmetrical distribution is balanced: if we cut it in half, the two sides would be mirror images of one another • normal distribution: a particular kind of distribution that resembles a bell (bell-shaped distribution)
Skewed Distributions • A skewed distribution is unbalanced; there may be a cluster of scores piling on one end of the scale
Skewed positively skewed distribution (skewed right) negatively skewed distribution (skewed left)
Mean, median and mode mode median mean mode median mean
Using different measures of central tendency Two factors are important in making the decision of which measure of central tendency should be used: • Scale of measurement (ordinal or numerical) • Shape of the distribution of observations. • A distribution can be symmetric or skewed to the right, positively skewed or to the left, negatively skewed.
Using different measures of central tendencyThe following guidelines help the researcher decide which measure is best with a given set of data: • The mean is used for numerical data and for symmetric distribution.
Using different measures of central tendencyThe following guidelines help the researcher decide which measure is best with a given set of data: • The median is used for ordinal data or for numerical data whose distribution is skewed.