CRITICAL NUMBERS SAMPLING WITH CONFIDENCE

CRITICAL NUMBERSSAMPLING WITH CONFIDENCE

Last week we looked at: • Displaying data using stem & leaf plots, histograms, bar charts and box & whisker plots • The summary measures: mean, median, mode, range, interquartile range, standard deviation • Elementary properties of the Normal distribution However, displaying data is not everything, you will also need to consider what it is that you are comparing ………

Last week we were comparing a single observation to a reference group • This week we will be looking at samples and will want to see how they compare to a reference group

Sampling With Confidence At the end of the session you should be able to: • Distinguish between a population and a sample • Define different methods of sampling • Calculate and understand what is meant by the term standard error (se) • Understand the concept of repeated sampling and its applicability to a single sample • Understand what is meant by the term confidence interval and how to interpret them

The scenario “Our doctor has been asked to do an audit of the patients with diabetes at their practice, in order to see how the practice is managing with respect to blood glucose control. He picks a patient at random and looks at their notes - the result is not good…”

The blood test- HbA1c • Haemoglobin is produced for inclusion in red blood cells as the cells are formed. If there is a lot of glucose in the blood the haemoglobin is glycosylated • The percentage of the haemoglobin in the blood that is glycosylated is an indicator of how well controlled blood sugars have been over the past few months • It enables us to look at how good the glycaemic control is in a known diabetic • Ideally we want the percentage to be <7%

About the patients • There are 300 patients with Type II Diabetes at the practice. This disease is usually seen in older people and can be treated with changes in diet, tablets and with injections of insulin • Uncontrolled blood sugars make the complications of diabetes more likely • These include heart attacks and strokes plus damage to blood vessels which can result in kidney failure, blindness and amputations

Population • all individuals in which we are interested • Sample • group of individuals drawn from our population of interest, which we study in order to learn about the population The results from our sample are used as the best estimate of what’s true for the population (but the sample needs to be representative of the population)

It is rare that we look at the whole population (Census etc.) • More usually we have samples and from these we calculate certain quantities, such as the mean and standard deviation, which we then use to make inferences about the population • Quantities calculated for samples are known as sample estimates, and they are used to estimate population quantities (parameters) • Mean HbA1c in people with diabetes • Proportion of people with diabetes

Population and Sample

Methods of Sampling(not exhaustive list) • Convenience sampling • All patients available at a particular point in time assumed random • Eg last 20 patients to consult about diabetes • Random sample • All members of the population are equally likely to be picked, independently of each other • E.g. patients picked at random from a list of all patients on GP register with diabetes • Stratified random sample • Population is divided into groups beforehand then randomly sampled within those groups • E.g. males and females or by age

But what we want to know is how good is the sample mean as an estimate of the true population mean? ………

The clinical problem cont… • Sampling can give an estimate of the true values within a group • The larger the sample – the better the estimate • Using confidence intervals, derived from standard errors, we can quantify how good an estimate a sample result is likely to be

Random samples • We looked at the blood glucose results from all 300 diabetic patients for the purposes of this example • We then took some random samples of the results • One sample (A) was just 20 of the 300 patients • Another (B) was 50 patients, • And a third (C) was 100 patients

Random samples • The average HbA1c values (as a percentage of total Hb) for the samples were: • Sample A (20 patients) 7.4 • Sample B (50 patients) 6.48 • Sample C (100 patients) 6.65 • But which of these best estimates the mean value for all 300 patients? • How certain – or uncertain – are we?

So how good is the sample mean as an estimate of the true population mean? To answer this we need to assess the uncertainty of our sample mean • Different samples can give different estimates of the population mean • If we take repeated samples (of the same size) we get a spread of sample means which we can display visually in a dot plot (if the number is small enough), boxplot or histogram • The variability (spread) of these samples means gives us an indication of the uncertainty of our single sample mean

Repeated random samples Let’s redo each sample 50 times and see what results we get

Properties of the distribution of the sample means • The mean of all the sample means will be the same as the population mean • The standard deviation of all the sample means (not the individual data values!) is known as the STANDARD ERROR • The distribution of sample means will be roughly Normal, regardless of the distributionof the variable, given a large enough sample size (- Central Limit Theorem)

What is the Standard Error? • In practice we cannot repeatedly sample from the population • What we want to know is how likely are we (in our single sample) to have captured what is going on in the population • And so, when we have only one sample we calculate the STANDARD ERROR • The standard error (se) is an estimate of the precision of the population parameter estimate that doesn’t require lots of repeated samples. It provides a measure of how far from the true value the sample estimate is likely to be.

What is the Standard Error? • As we saw in the previous figure: as the sample size increase the approximation of the sample to the population improves i.e. the spread of the sample means gets smaller • Thus, all other things being equal, we would expect estimates to get more precise and the value of the se to decrease as the sample size increases.

SE for samples A, B and C Sample A Sample B Sample C • Size 20 50 100 • Mean 7.4 6.48 6.65 • SD 1.30 1.24 1.32 • SE 0.29 0.18 0.13

Standard Deviation versus Standard Error? • The standard deviation quantifies the spread of individuals • The standard error quantifies the spread of the mean • The standard error is sometimes called the standard deviation of the mean • Standard deviation is for description and describes the variability of the data • Standard error is for estimation and describes the precision of the mean Standard DEVIATION – DESCRIBING Standard ERROR – ESTIMATING

Confidence Intervals • The sample mean is the best estimate we have of the true population mean • However, we need to assess how good the sample mean is as an estimate of the true population mean?  Standard Error (measure of the precision of an estimate of a population parameter) • The distribution of sample means of many samples of the same size will be approximately Normally distributed regardless of the distribution of the variable – the Central Limit Theorem • From this we can use the sample mean and its standard error to construct a confidence interval – a range of values within which the population mean is likely to lie

Properties of the Normal Distribution • Bell shaped and symmetrical • Any position along the horizontal axis can be expressed as a number of SD away from the mean. • The mean and median will coincide. • About 68% of the observations will lie within 1 SD of the mean • More importantly, about 95% of the observations will lie within approximately 2 SDs of the mean (and conversely 5% of observations lie more than 2 SDs away from the mean) Area = 1

5% of observations lie outside the mean ± 1.96SD or p = 0.05

Confidence Intervals • Confidence intervals give limits in which we are confident (in terms of probability) that the true population parameter lies • Describe the variability surrounding the sample point estimate • In general, they depend upon making assumptions about the data

Confidence Intervals • For example a 95% CI means that if you could sample an infinite number of times • 95% of the time the CI would contain the true population parameter • 5% of the time the CI would fail to contain the true population parameter • Alternatively: it gives a range of values that will include the true population value for 95% of all possible samples

95% Confidence Interval A range of values which will include the true population mean with probability 0.95 Formulae : 95% Confidence Interval For the mean Difference in two means (n.b. 1.96 is often rounded to 2)

CI for samples A, B and C Sample A Sample B Sample C • Size 20 50 100 • Mean 7.4 6.48 6.65 • SD 1.30 1.24 1.32 • SE 0.29 0.18 0.13 • CI 6.82 to 7.98 6.12 to 6.84 6.39 to 6.91

Confidence Interval or Reference Range? • During the last session we looked at creating reference ranges from a ‘population’ of blood results • This was constructed by describing a range between two SDs below the mean and two SDs above the mean (mean ± twice the SD) • A Confidence interval describes the precision of a sample estimate and is constructed for the mean by describing a range between two SEs below the mean and two SEs above the mean (mean ± twice the SE) Standard DEVIATION – DESCRIBING Standard ERROR – ESTIMATING

Mean HbA1c values for the three groups, together with 95% confidence interval for the mean value Cut-off for good control

Mean HbA1c values for the three groups, together with 95% confidence interval for the mean value Cut-off for good control True population mean

Confidence Intervals • For the group of size 20, though the estimate is about 7.4, it could be as low as 6.8 or as high as 7.9 • Whereas, for the group of size 100, the limits are much closer, so that though the best estimate is 6.6, the range of plausible values is between 6.4 and 6.9 i.e. much closer range than for the smaller sample • So, which would we chose? • It’s clear from the graph that the confidence interval for the sample of size 20 is wide • For our purposes 50 would probably have been adequate as the confidence interval is smaller and does not include the cut-off value of 7 • The size of sample that is chosen depends on many things, including the purpose of the investigation, and how accurately you want to estimate the quantity of interest

The Clinical Problem • Our doctor needed to do an audit of how she was performing with regard to diabetes care • With 300 patients it would have been time consuming to examine the notes for all of them • Thus rather than look at the entire population, a sample could be taken to get an estimate of the true value in the population • Sampling can give an estimate of the true values within a group • The larger the sample- the better the estimate

The Clinical Problem • Using confidence intervals, derived from standard errors, we can quantify how good an estimate a sample result is likely to be • Examination of 20 patients’ notes suggested on average the patients with diabetes were poorly controlled • But in fact- if she could have tested all 300 she would have realised they were actually doing very well • So does our Dr need to measure all of the patients…… …of course not!

Session recap • Use random sampling, so as to minimise bias • Can use samples to make estimates of population quantities • Can estimate the precision of the sample estimates using the standard error • Can use confidence intervals to say how confident we are about our sample estimate

Session recap You should now be able to: • Distinguish between a population and a sample • Define different methods of sampling • Calculate and understand what is meant by the term standard error (se) • Understand the concept of repeated sampling and its applicability to a single sample • Understand what is meant by the term confidence interval and how to interpret one

Next week…….. • In the next “Every day numbers, what Doctors need to know” session we are going to look at estimation and hypothesis testing.

CRITICAL NUMBERS SAMPLING WITH CONFIDENCE