190 likes | 210 Views
www.making-statistics-vital.co.uk. MSV 33: Measures of Spread. The Bee Academy. ‘And our topic today, my fellow bees, is spread!’. Professor Zzub. ‘Mmm...’. ‘No, no, no, Millie! I mean, How can we measure how spread out a data set is!’. ‘I’m lost. Example please...’.
E N D
www.making-statistics-vital.co.uk MSV 33: Measures of Spread
‘And our topic today, my fellow bees, is spread!’ Professor Zzub ‘Mmm...’
‘No, no, no, Millie! I mean, How can we measure how spread out a data set is!’ ‘I’m lost. Example please...’ ‘The data sets 1, 3, 5, 7, 9 and 3, 4, 5, 6, 7 have the same mean, but the first set is clearly more spread out than the second.’
‘So you are asking how we could measure that – how about the top number take away the bottom for each set? If the spread is big, that’ll be big!’ 1, 3, 5, 7, 9 and 3, 4, 5, 6, 7 ‘Nice idea, Ding – and this measure is used! It’s called the RANGE. So the range for our first set is 9 - 1 = 8, while the range for our second set is 7 - 3 = 4.’
‘Let me guess – there’s more to it than that.’ ‘Sadly, Brenda, the range is badly affected by extreme values or outliers. It can give a rather misleading picture of the data.’ 1, 3, 5, 7, 9, 11, 13 and 3, 4, 5, 6, 7, 8, 20 Range = 12 Range = 17
‘Okay, then, don’t take all the data; chuck away the lowest quarter, and the highest quarter, and THEN take the range. Just taking the middle 50%, you’ve got rid of all those extreme values.‘ ‘1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ‘Great idea, Paul – so for example with this small data set, we can add the quartiles, Q1, Q2 (the median) and Q3...’ ‘... And the Interquartile Range is Q3 – Q1 = 6, the range of the middle 50% of the data.
‘I’ve got another idea!’ ‘What’s that, Millie?’ ‘1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ‘Go back to this data set again. We could find the mean, then find the difference of each of these numbers from the mean, and then add the differences together. If the numbers are spread out, then this will be big!’
‘That is nearly a great idea, Millie, but watch what happens...’ ‘1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ‘So the differences add to 0. Always.’ ‘But that is easily fixed...’
‘Find the POSITIVE difference of each of these numbers from the mean, and then add these differences together. It won’t be 0 now!’ ‘Indeed, Virender, the sum now is 30. But is that a fair measure of spread?’
‘Surely you have to divide by the total number of numbers you have - to take an average!’ ‘Excellent, Ding! And this takes us to what is called ‘the mean deviation from the mean’. If we write it in symbols, we have
‘There is still a problem, however – The modulus function is not always easy to handle mathematically. It is true that |ab|=|a||b|, but it is not generally true that |a + b| = |a|+|b|.’ ‘Well, there are other ways to make the differences from the mean all positive. You could square the differences, for example!’ ‘Great idea, Millie. So we can find the square of the difference of each of these numbers from the mean, and then add these together. Then divide by the total number of numbers we have.’
‘This is called the MSD, or ‘the population variance’. If we multiply out, we get an alternative formulation that is usually easier to calculate, especially if the mean is not a whole number.’
‘As before.’ ‘So have we got it now? Is this the measure of spread we generally use?’ ‘We are very nearly there, Brenda. There is, sadly, a problem with the MSD. Most of the time we are taking a SAMPLEfrom a population. We would like the expectation of our variance statistic to be the variance of the population. But in order for that to happen...
‘We have to take our MSD statistic... ‘And divide by n-1 rather than n.’ ‘This statistic is called ‘the ‘sample variance’ or simply the ‘variance’. The expected value of this is the population variance. As with the population variance statistics, there is an alternative form... ‘Which is often easier to use.’
‘So is that all the measures of spread we need to know?’ ‘I should add, Virender, that we do use the square root of the MSD (called RMSD) and the square root of the variance (called the Standard Deviation) as measures of spread too. The advantages of the RMSD and the SD are that they are measured in the same units as the random variable we are interested in.’ ‘So to summarise...’
Interquartile range (IQR)= Q3 Q1, where the quartiles Q1, Q2 and Q3 divide the data set into four groups of equal size. Range = Top value – bottom value.
Mean Square Deviation (population variance). Root Mean Square Deviation = RMSD. Variance (or sample variance). Standard Deviation.
With thanks to pixabay.com www.making-statistics-vital.co.uk is written by Jonny Griffiths hello@jonny-griffiths.net