690 likes | 832 Views
Master class Data, understanding it, interpreting it and using it. Ruth Harrell Liann Brookes-smith. Agenda. 9.30am – 10.30am 10.30am break 10.45 – 11.30am 11.40 – 12.30pm 12.30 – 1.30pm lunch 1.30 – 2.30pm probability 2.30 – 2.45pm break 2.45 – 3.30pm sampling and curve
E N D
Master classData, understanding it, interpreting it and using it. Ruth Harrell Liann Brookes-smith
Agenda • 9.30am – 10.30am • 10.30am break • 10.45 – 11.30am • 11.40 – 12.30pm • 12.30 – 1.30pm lunch • 1.30 – 2.30pm probability • 2.30 – 2.45pm break • 2.45 – 3.30pm sampling and curve • 3.30 – 4.30pm confidence and risk
Introduction • Statistics may be defined as "a body of methods for making wise decisions in the face of uncertainty." ~W.A. Wallis • “There are three kinds of lies: lies, damned lies, and statistics.” Disraeli (according to Mark Twain) • 98% of all statistics are made up. ~Author Unknown • Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. ~Aaron Levenstein • If you can not measure it, it does not exist ~ Author unknown
Question to the Room • What are statistics? • Why are data important? • What do you feel about stats? • What do they tell us? • E.g. 40% of children on XX area have dental caries, what does that tell us? • List types of data you are aware of or use in your day to day
Practitioner competencies Obtain, verify, analyse and interpret data and/or information to improve the health and wellbeing outcomes of a population / community / group – demonstrating: a. knowledge of the importance of accurate and reliable data / information and the anomalies that might occur b. knowledge of the main terms and concepts used in epidemiology and the routinely used methods for analysing quantitative and qualitative data c. ability to make valid interpretations of the data and/or information and communicate these clearly to a variety of audiences
Aim for the day • Aim of the day is to improve people understanding of the data they use, how to analyse it and interpret it. • This session is concentrating on the data rather than things such as the study design but we are happy to discuss and answer questions on both; you can’t understand what the data is telling you without understanding how it has been collected and the potential for bias.
Topics covered • Types of data • Basic probability and stats • Understanding how data is collected • Measures of odds and ratios - comparing populations and study results. • Population sampling - Good samples and bad samples • Understanding Confidence intervals & p values - is the result reliable • How I apply data to what I am doing
Describing the data • We have a responsibility to present data in a way that can be easily understood, and which does not misrepresent the true meaning of the data. • Key decisions are made based on the data – or more accurately people’s impression of the data – so this has an impact on use of resources and eventually on patient care. • Accurate analysis and presentation of the data saves lives!
Quantitative vs. Qualitative Quantitative data measures quantity ie is numerical. • Qualitative data is usually more descriptive and not measured in numbers. • However, data originally obtained as qualitative information about individual items may give rise to quantitative data if they are summarised by means of counts;
Discrete – Continuous • Discrete data can only take certain particular values • Continuous falls on a scale. • For example height is continuous, but the number of siblings is discrete.
Nominal - Ordinal • Nominal comes from the Latin nomen, meaning 'name', and is used to describe categorical data. There is no quantitative relationship between the different categories (though sometimes a number may be assigned for ease of analysis). An example is ethnicity. • Ordinal data again describes categories but there is some order to them - though the relationship between them may not be well defined. For example, Agenda for change pay scales, since they are ordered and can therefore be put in sequence (but there is no numerical relationship between them).
Transforming the data • Sometimes the data you have isn't the most effective way of displaying the data. E.g. You have data on weight in Kilos. Having a list of continuous weights is not intuitive, therefore you convert this to BMI I.e., those who are underweight, healthy weight, obese and morbidly obese. Continuous to ordinal.
Transforming the data (2) With this you can display more meaningful data BUT You lose the detail, the number of the edge of each category (borderline). You cant transform it back. What you transform it to may not be the best use of data. You can also transform data using complex calculations doing a “log” of each number, this will sometimes convert skewed data to normal curved data (discussed later)
Exercise • Exercise 1 and 2
Displaying the data • What are the options? • Tables – simple descriptive, cross tab… (mention pivot table) • Graphs – bar, line, x-y or scatter, pie chart….
Basic statistics and probability • Having looked at the raw data and carried out any transformations you felt necessary, you now want to describe the features of this data. • Distributions – plotting the data is the first step in this. You need to consider the shape of the graph before you know how to best analyse the data.
Types of graph • Normal
Types of graph • Skewed
Types of graph • Bimodal
Types of graph • Uniform
15 minute Break!
Data measures Definitions: • Range: the difference between the highest and the lowest values in a set • Mean: the total value of measure values summed divided by the number of measures • Median: the middle measure • Mode: measure found most often • Interquartile ranges: is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles • Standard deviation: is a measure of how spread out numbers are.
Mean, median and mode • Mean= (sum of observations) (number of observations) • Mode = the most common observation • Median = the number where 50% of observations are below and 50% are above
Standard Deviation and IQR • Std Dev= sum of (difference squared between each observation and the mean) / (number of observations - 1) • IQR= the difference between the value at the 25th percentile and 75th percentile
Formulas • Sample mean x = ( Σ xi ) / n • Sample standard deviation = s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ] xi is each observation N is the number of observations Σ means ‘sum’
How reliable is my data? • Any data missing? • How old is it? • What is the denominator? • Who collected it • How was it collected? • Ways to avoid making statements about inaccurate data?
Interpret the graph • This graph is a graph showing the trend of obesity in adults from 1993 – 2007 • Percentage: of what (all adults presumed, all registered? All resident?) what age is defined as an adult? • Is the increase due to chance or an actual increase? • Data is quantitative/continuous
Bias • When looking at data sometimes the relationship we see is one caused by the way in which we are measuring not actually what is there.
Fudging • Rate or Number • You have 50 cases of COPD in area 1, and 150 cases in COPD in area 2. should you do something in area 2? • Area 1 has population of 2000 • Area 2 has population of 5000 • In area 1 rate in 50-74 year olds is 20/1000 • In area 1 rate in 50-74 year olds is 42/1000 • Area 1’s data was from 2004 • Area 2’s data was from 2005-2009 • Area 1 is 20/1000 confidence interval (12-48 per 1000) • Area 2 is 42/1000 confidence interval (18 – 56 per 1000) • Now what?
Exercise • Exercise 5 • What do these data tell you? Key message? • What would you ask of these data? What further information would you want to know?
Basics of probability • Probability is a way of quantifying the judgements that we make all the time – from ‘do I need an umbrella?’ to ‘shall I bet on that horse?’ • Probability is measured on a linear scale of 0 to 1 where 0 is impossible and 1 is absolutely certain.
Probability • Why is probability relevant to public health? • Probability gives us a quantitative measurement of the chances of something happening, and there are 2 key ways in which it is used in Public Health • It is another word for risk (or if it has a positive impact benefit). For example, the probability that some who smokes cigarettes will get lung cancer has been shown to be much higher than for someone who doesn’t smoke. • It helps us to answer the question ‘how likely is it that the observed effect is due to our intervention not just to chance?’, and is used in all types of studies – testing medical treatments, evaluating the impact of public health interventions, assessing need of one population compared to another.
Probability and risk • Odd – number of events divided by the number of opportunities • Risk in exposed– number of events divided by the number of exposed • Risk in un- exposed– number of events divided by the number of un-exposed • Relative risk or Risk ratio is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group • Absolute risk is the difference in risk between the exposed and unexposed.
Probability cont… • What is the probability of a 6 if you throw an unbiased dice? • What is the probability of a total of 6 if you throw two unbiased dice?
Welcome back!! • I'm not an outlier I just haven't found my distribution yet.
Exercise • Exercise 6 • Worse and early death = 0-3/10 • No change = 4-5 /10 • Cure = 2-6/10
Population sampling (1) • In the real world we don’t usually get data from everybody that we are interested in. Why not? • Cost and resources may be too large • People may choose to opt in or out • May have incomplete data (data entry problems etc)
Population sampling (2) • So what we need to do is measure a sample of people and infer from that sample what the population looks like. We can do this by tweaking the statistical formula used – but there are two things to consider; • If your sample size is too low you are unlikely to get a reasonable result – you can still use the formula but you need to bear this in mind when interpreting it • Think about who you have managed to sample – are they representative of the population? (imagine walking in to a large open plan office with a set of scales and asking people if they would mind being weighed – who is more likely to volunteer?)
Population sampling (3) • If we have a REPRESENTATIVE sample, we can apply a statistical tweak to help us to estimate the figure for the population. • If we don’t (if the sample is biased), though we can carry out the maths, it will always be flawed.
Population sampling (4) Principle – • Measure your sample • Calculate the mean and standard deviation (of the sample) • Calculate the standard error = standard deviation of the sample / n • To estimate your mean, we say our best guess is that the population mean is equal to the sample mean • Then we can use the standard error to estimate how close we think our estimate is. • First we need to talk about confidence intervals
Which one is an Insult. • Darling, you are two standard deviations below the mean • Of course your normal (mean 10, mode, 7) • You are mean • Your looks are in the 80% percentile • The difference between you and her is a standard deviation
Probability, Population Sampling and the Normal Curve Thinking about our data that fitted the normal curve – • By using the mathematical model we can easily calculate probabilities. The maths tells us that; • The total area under the normal curve is equal to 1. • The probability that any new observation will fall within one standard deviation of the mean is 68% • The probability that any new observation will fall within two standard deviations of the mean is 95% • The probability that any new observation will fall within three standard deviations of the mean is 99.7%
CERN experiments observe particle consistent with long-sought Higgs boson Geneva, 4 July 2012. “We observe in our data clear signs of a new particle, at the level of 5 sigma, in the mass region around 126 GeV. The outstanding performance of the LHC and ATLAS and the huge efforts of many people have brought us to this exciting stage,” said ATLAS experiment spokesperson FabiolaGianotti, “but a little more time is needed to prepare these results for publication.” At five-sigma there is only one chance in nearly two million that the result is wrong, i.e. the measurement seen is a random fluctuation.
Confidence intervals (1) if we measure one individual’s IQ we can be 95% sure that it would fall between 70 and 130 This ‘interval’ is called the 95% confidence interval. We use 95% by convention; sometimes other figures are used such as 98%. • If we measure the heights of a class of children and we have a mean of 1.2m, standard deviation of 0.1, what is your estimate for the height of a child randomly selected from the sample? • 1.2 +/-0.2, ie 95% of this sample lies between 1.0 and 1.4m