340 likes | 366 Views
Analyzing the dispersion and variation in systolic blood pressure data for three groups using dot plots and measures such as range, inter-quartile range, mean deviation, variance, and standard deviation.
E N D
Consider the following three data sets giving the systolic BP of three groups of individuals ( hypothetical data ) • Group1: • 120,125,125,120,120,125,128,120,122,130,125,120 • Group 2 • 120,120,130,110,125,110,110,145,120,125,135,130
Group 3 • 110,110,115,105,130,140,155,140,140,105,120,110 • The mean BP for each of the three groups is the same • Mean = 123.33 for each group • Let us plot the observations in relation to the mean value as dot plots.
Group 1: • x • x x x x x x _________________xx__*_x___x__x_________ mean
Quickly draw the dot plots for the other two groups. • Can you see the variations ? • Which group shows more variation? • Comments??????
What is dispersion? • Dispersion refers to the spread or variability of data.In a data set ,if all values are same, there is no dispersion( zero dispersion) • Usually ,the dispersion is measured around a central value. • A measure of dispersion conveys information regarding the amount of variability present in a data set.
Some measures of dispersion • Range: It is defined as the difference detween the largest and the smallest observations in a data set. • Where are largest and smallest values.
Range depends on the two extreme values of the data set. Larger the range, greater will be the spread/variability in the data set. • It gives an idea about the overall variability/range of variation in the data set. • For small samples( ≤ 10 ),we can use it as a measure of variability. • Simple measure, does not depend on all observations,useful in quality control techniques
Example: • Consider the BP data sets of three groups of people. • For group 1 R= 130-120 =10 • For group 2 R= 145-110 = 35 • For group 3 R= 155-105 = 50 • Group 3 is the most dispersed data set.
Inter-quartile range • Inter-quartile range is defined as the difference between Q3 and Q1. • IQR = Q3 –Q1 • IQR provides the range for the central 50% of observations in a data set, since there are 25% observations below Q1 and 25% observations above Q3. Q1 and Q3 provide the limits within which the central 50% values fall. • IQR/2 is called semi-interquartile range.
MEAN DEVIATION • Mean deviation is defined as the average of absolute deviations measured from a central value. We define mean deviation about median as follows: • M.D.(median) = Where ‘ m bar ‘ Is the median
Some observations • Absolute deviation is the numerical value of deviation of an observation from a central value,say,mean or median or mode. Only the numerical values are considered and sign is ignored. • Ex: 3-8 = -5 = actual deviation/difference. absolute diff. is 5 • 16-8 =8. absolute diff. is 8 • An important note: The sum of actual deviations of observations from their mean is zero.
Mean deviation about median is the smallest • Why?? • Do you have an intuitive answer? • Since median is the middlemost value,the absolute difference of each observation from the median value has to be small • M.D. about median is commonly considered.
Variance and standard deviation • Instead of taking absolute deviations, we can consider average of squared differences of observations from a central value as a measure of dispersion. • Variance is such a measure. • For a given population , variance is defined as the average of squared differences of observations from their arithmetic mean.
Let x1 , x2 , x3 ,…. xN be the N observations and μ be the population mean. Then the sum of squared deviations of the observations are: • (x1-μ)2+ (x2-μ)2+…+(xN-μ)2=∑(xi – μ)2 • Variance = ∑(xi – μ)2 / N • This is usually denoted as σ2 • The positive square root of variance is called standard deviation
For a sample of n observations, the sample variance is given by Here the denominator is (n-1) because the sum of the deviations from mean is zero and hence the degrees of freedom is (n-1). Also, the sample variance is an unbiased estimator of the population variance.
Since variance is computed from squared deviations,the unit of measurement is ‘squared’. Therefore, we consider the positive squareroot of variance/sample variance and call it ‘standard deviation’. • The population standard deviation is σ and the sample standard deviation is
Illustrative examples • Compute IQR , M.D. about median and standard deviation for the BP data of all three groups.
In this formula,n= total frequency i.e. ∑fi • fi = frequency of the ith value/interval • xi= ith value of the variable/midpoint of the ith class interval • = mean of the sample data
Formula for direct computation Ungrouped data:
An example • Consider the BP data for group 3 • Solve this using the direct computation formula
Properties of standard deviation • SD is the prefered measure of variability . It measures the typical distance of an observation from mean of the sample. • If all values are same,then SD=0 • SD is heavily influenced by outliers/extreme values. • For comparing variation in two or more samples,coefficient of variation is a better measure.
Coefficient of variation • Variance( or standard deviation ) is an absolute measure of dispersion . In order to compare dispersion in sample data measured in different units of measurement, we define relative measures of dispersion. • Coefficient of variation(C.V.) is one such measure. It is defined as the ratio of standard deviation to mean. • For population , C.V. =
In case of sample data,the sample coefficient of variation is defined as the ratio of sample standard deviation to sample mean. • cv =
CV measures the extent of variability in relation to the mean. It can also be interpreted as the variation per unit mean. • Cv is relevant in data types having a real zero. If the mean is close to zero/negative, CV is not meaningful. • In practice ,it is used as a percentage value by multiplying the ratio by 100. • CV is positive valued and it can exceed one.
What is the meaning of CV=0? • It can be used to compare the variability of two or more samples. Small value of CV is interpreted as less variability in the data set. • “less variability” implies “greater consistency” • Useful when two data sets are given in two different units of measurement, like,height measured in inches and centimetres.
Example: • Compute the CV for BP data of three groups of jndividuals. • Group1: • Sample mean = 123.33 • Sample variance=12.2424 • Cv =0.027
Group 2: • Sample mean = 123.33 • Sample var. = 115.152 • Cv = 0.087 • Group 3: • Sample mean = 123.33 • Sample var. = 287.879 • CV = 0.138
In the analysis of experimental data where measurements are made in different concentration levels, if SD increases in proportion to concentration, then CV is prefered to SD. • Adding CV’s or taking the average value of CV’s for several samples and using it as a measure of variation is not correct. • Can you do it with SD?
Some remarks • Consider this example: • Sample 1: mean=80,SD=12, • Sample2: mean=0.5,SD=0.2, • CV1=0.15 or 15%, CV2= 0.4 or 40% • If you compare the variability in the two samples using SD,then sample 1 is having more spread. But CV tells us that sample 2 has more variability. • Variability is a property observed in relation to a central value in the data set.
Some new measures: • Since CV cannot be defined when mean is negative,it is suggested to use absolute value of mean. The ratio is called relative standard deviation.(Not a popular one !!!). • AS a nonparametric counterpart,the ratio of the interquartile range to median is suggested. i.e. (Q3-Q1)/ median • Again, not a popular measure!!