240 likes | 342 Views
Applied statistics for testing and evaluation – MED4. Descriptive statistics (cont.) - variability. Lecturer: Smilen Dimitrov. Introduction.
E N D
Applied statistics for testing and evaluation – MED4 Descriptive statistics (cont.) - variability Lecturer: Smilen Dimitrov
Introduction • We previously discussed measures of central tendency (location) of a data sample (collection) in descriptive statistics – arithmetic mean, median and mode; and also the range as a measure of statistical dispersion (variability) • Here we continue with other important measures of variability – namely variance and standard deviations • We will also get acquainted with some parameters leading to their definitions • We will look at how we perform these operations in R, and a bit more about plotting as well
Variability and deviations • A measure of variability is perhaps the most important quantity in statistical analysis. • The greater the variability in the data, the greater will be our uncertainty in the values of the parameters estimated from the data, and • the lower will be our ability to distinguish between competing hypotheses about the data. • Measures of variability – a single number describing the variability of data – eventually we look for variance and standard deviation
Variability and deviations • Deviations – distances of the individual values in the data sample, from the mean value • Plotting – using lines in a for loop
Variability and deviations • The longer the lines – the more variable the data • Could we use the sum of the deviations as a measure of variability? • No – because of the definition of arithmetic mean, it is the line positioned such that the sum of the deviations cancels out. • Quick proof
Absolute deviations • The minus signs of the deviations could be seen as the reason for cancellation of the sum • We could try using the absolute deviations • Their sum will be obviously different from 0. • However, hard to compute – need an easier way
Squared deviations and sum of squares • Squaring the deviations is computationally less intensive • Their sum will, again, be obviously different from 0. • It is the well known sum of squares: • More properly – it is the sum of squared deviations • An unscaled, or unadjusted measure of dispersion
Scaling the sum of squares – Mean Squared Deviation • Now, what would happen to the sum of squares if we added an [additional] data point? • It would get bigger, of course. • So usually, the sum of squares will grow with the size of the data collection. • That is a manifestation of the fact that it is unscaled. • Scaling (also known as normalizing) means adjusting the sum of squares so that it does not grow as the size of the data collection grows. • We don't want our measure of variability to depend on sample size in this way, so the obvious solution is to divide by the number of samples, to get the mean squared deviation • The MSD can be taken to be the wanted variance parameter, but…
Degrees of freedom • Suppose we had a sample of five numbers and their average was 4, What was the sum of the five numbers? It must have been 20, otherwise the mean would not have been 4. So now let us think about each of the five numbers in turn: • We are going to put a number in each of the five boxes. • If we allow that the numbers could be positive or negative real numbers, we ask how many values could the first number take.
2 Degrees of freedom • If we allow that the numbers could be positive or negative real numbers, we ask how many values could the first number take. • You will realize it could take any value. Suppose it was a 2.
2 2 7 Degrees of freedom • How many values could the next number take? It could be anything. • Say it was a 7.
2 2 7 7 4 Degrees of freedom • And the third number could be anything. • Suppose it was a 4.
2 2 7 7 4 4 0 Degrees of freedom • The fourth number could be anything at all. • Say it was 0.
2 2 7 7 4 4 0 0 7 Degrees of freedom • Now, how many values could the last number take? • Just one - it has to be another 7 because the numbers have to add up to 20 because the mean of the five numbers is 4.
2 7 4 0 7 Degrees of freedom • We have total freedom in selecting the first number - and the second, third and fourth numbers. • But we have no choice at all in selecting the fifth number. • We have four degrees of freedom when we have five numbers (and their mean). • In general we have (n-1) degrees if freedom if we estimated the mean from a sample of size n. • More generally still, we can propose a formal definition of degrees of freedom: degrees of freedom is the sample size, N, minus the number of parameters, p, estimated from the data.
Scaling the sum of squares – variance • The mean is a parameter estimated from the data itself – hence we lose one degree of freedom • Thus we finally arrive at a definition for variance – sum of squares divided by the degrees of freedom • Only difference between MSD and variance – division with N or N-1, respectively
Standard deviation • Variance has a unit of measure which is squared (cm2 ) in relation to the original units (cm) • Therefore, another measure is used – standard deviation – measured in same units as the data
Sample and population parameters • Usually you are interested in drawing conclusions about the population from which your (random) sample of data is drawn. • It is very important to keep in mind the difference between the descriptive statistics that characterise your sample, and the corresponding parameters that characterise the population from which your sample is drawn. Population (finite, infinite) “true” parameters Sample (finite) Estimates of population parameters Ex. All raisin boxes ever produced by the company/factory Ex. The particular data collection for only 17 particular raisin boxes Needs (probability) distributions mean standard deviation variance
Geometric interpretations - quantity graph • Standard deviation – same units as the quantity
Geometric interpretations - quantity graph • Variance - area
Geometric interpretations - quantity graph • Variance - area
Geometric interpretation - histogram (frequency count) • More commonly – geometric interpretation on a histogram. • Makes it easier to see the spread • If no deviations – standard deviation is 0 – the whole histogram collapses to a single peak
Review • Arithmetic mean • Median • Mode • Range • Variance • Standard deviation Measures of Central tendency (location) Descriptive statistics Measure of Statistical variability (dispersion - spread)
Exercise for mini-module 3 – STAT03 Exercise Use the following data: The data in the following table come from three garden markets. The data show the ozone concentrations in parts per hundre million (pphm) on ten consecutive summer days • 1. Import the data into R, and for each garden, find the the central tendency parameters of the ozone concentrations. • 2. Using R, for each garden, find dispersion parameters - the sample variance and sample standard deviation. • 3. Using R, plot the relative frequency histogram for each of the gardens. Mark graphically the arithmetic mean on each graph and the one standard deviation range. Delivery: Deliver the collected data (in tabular format), the found statistics and the requested graphs for the assigned years in an electronic document. You are welcome to include R code as well.