390 likes | 472 Views
Statistical Methods for Health Intelligence Lecture 2: Perspectives, Data Types & Summaries. Iain Buchan University of Manchester buchan@man.ac.uk. Course Material 1: Basic Text. Medical Statistics, 4 th Ed Campbell, Machin & Walters Wiley 2007
E N D
Statistical Methodsfor Health IntelligenceLecture 2: Perspectives,Data Types & Summaries Iain Buchan University of Manchester buchan@man.ac.uk
Course Material 1: Basic Text • Medical Statistics, 4th EdCampbell, Machin & WaltersWiley 2007 • Statistical knowledge level:Public health practitioner • How are you getting on? • Are you using any other learning materials?
Your Participation • Today: questions about your reading • Take notes on my comments • Prepare to reproduce exercises in R
Course Material 2: R • Statistics: An Introduction Using RCrawley, Wiley 2005 • cran.r-project.org • Reproduce each example in course text • Prepare to do submit R scripts for assessment
Course Material: Optional • Probability and Random Variables: a beginner’s guideStirzaker, Cambridge University Press 1999 • Bad ScienceGoldacre, Fourth Estate Ltd, 2008
Define • statistics • quantitative information about a topic • Statistics • The measurement of uncertainty
The Statistical Movement Circa 1900: Galton, Pearson, Edgeworth and Yule establish Statistics as a discipline Early/mid 1900s: Fisher consolidatesstatistical methods and experimental philosophy
Think • Whose perspective is Chapter 1? • Medical Statistician • Why must the Informatician look wider? • May not have the luxury of study design • Data- vs. hypothesis-driven research • Maximise information validity & utility
Health Statistics 1600-1860 Reasoning Summarisation Knowledge Observation
Health Statistics 1860-≈2000/now Reasoning Summarisation & Statistical Modelling Knowledge Observation± Experimentation
Early/mid 1900s: Greenwood, Bradford-Hill & Doll pushStatistics into medical research Evidence Based Medicine Causality Clinical Trials Mid-late 1900s: Cochrane pushes for the routine application of randomised clinical trials and leaves the evidence based medicine movement in his wake Effectiveness & Efficiency
Define • Epidemiology • the study ofthe distributionand determinantsof diseaseand health-related statesin populations JM Last, 2000
Define • Confounding factor • A factor associated with bothexposure and outcomebut not on the causal pathwayabout which the inference is being made • What confounded the water cancer vs. water fluoridation example in the book?
Causal Inference Exposure Outcome Causal pathway Association Confounder
Sieving Associations C = caffeine, MI = myocardial infarction (heart attack) Disciplined approach to causal inference, Bradford-Hill: Criteria (temporality, strength, dose-response,consistency, plausibility, consideration of alternatives,open to experiment, specificity, coherence)
Hard to Make a Confident Causal Inference • Plausible pathway to link outcome to exposure • Same results if repeat in different time, place person • Exposure precedes outcome • Strong relationship ± dose effect • Causal factor relates only to the outcome in question • Outcome falls if risk factor removed...
Think • What is the most important question a Statistician wants a medic to ask? • How might I be wrong? • In designing my study • In making an inference about an association • In generalising my inference beyond the study population • Statisticians are understandably conservativeInformaticians must be carefully informative
Exhausted Epidemiology Platform Problem 1:Dwindling hits from tools todetect independent “causes” Problem 2:Knowledge can’t be managedby reading papers any more The big public health problems e.g. Type 2 Diabeteshave “complex webs of causes” The “data-set” and structureextend beyondthe study’s observations
Evidence limits showing • Epidemiology has exhausted the big simple causes of ill health • Many trials have weak external validity • Public health interventions are largely unstudied Many patterns of ill health in society remain unexplained via conventional studies
Need Statistical Informatics Data Necessary Complexity of Models Human Resource
Define • Statistical Data-types & Measurement Scales • Categorical Qualitative measuring • Binary/Dichotomous • Nominal > 2 categories, without order • Ordinal (loose) • Nominal with order • Ordinal (ties = lack of measurement sensitivity) • Numerical Quantitative measuring • Counts • Continuous (any value in a range) • Interval (fixed and defined, meaningful mean difference) • Ratio (zero means something)
Caution • Don’t treat ordered nominal data as interval! • Why? • Give examples? • Relate these to software requirements
Programming Note • Which has the greater information utility?Sex = 1|2Sex = m|fGender = m|fMale = 1|0Gender_Male = 1|0 • Maximum informationMinimum ambiguityGender_Male = 1|0
Discuss • Why categorise continuous data? • Meaningful thresholds (e.g. Hypertensive) • Compact summary / easy presentation • Easier analysis (good / bad?) • Avoid regression to the mean (homework)
Think • What is audit? • A quality improvement process that seeks to improve a service through systematic against explicit criteria and implementing change • How does this differ from research? • Ethics • Constrained design • What is a natural experiment? • Homework...
Summarise Binary Data: r/n • Describe a proportion • r = outcome or feature present (numerator) • n = number of subjects observed (denominator) • p=r/n; RR = p1/p2; (A)RD = |p2-p1| • Relative Risk (RR) abuse • Pill ↑ risk DVT by (RR =) 2statistically significantclinically insignificant2 women in 10,000 pill-years
Summarise Binary Data: r/n~t • Describe a rate • r = outcome/success/failure (numerator) • n = number of subjects observed (denominator) • t = time over which subjects observed • n*t = person time – why important? • Some may drop out or be lost to follow-up • (incidence) rate IR=r/n, IRR • IRR = 1R1/IR2; IRD = |IR2-IR1|
25% 20% 15% Males 10% Females 5% 0% Year Percentage excess deaths in North vs. South England Source: John Hacking & Iain Buchan, pre-publication 2009
Summarise Binary Data: Crosstabs • Variables C1-Ck – what is a crosstab? • Cross-tabulate categorical variablessay disease registration by gender2 by 2 r by c tables • Usually two way or two dimensional • Models may need higher dimensionssay disease registration by gender by speciality • Is a data cube the same? • Data Cube: A relational aggregation operator generalizing group-by, crosstab, and subtotals
Contingency Table Dimension 1: Exposure/Treatment/Category 1 Absent Present b a Present Dimension 2:Outcome/Status/Category 2 c d Absent
Summarise Binary Data: Odds • How do odds differ from risk/proportion/probability? • Ratio of occurrence to non occurrence • Odds = p(1-p) • OR = (a/c)/(b/d)=ad/bc • p=a/(a+c),so if a<<c then a/(a+c) ≈ a/c and OR ≈ RR • OR_success = 1/OR_failure, not so for RR • Tractable computation with log odds
Caution • If the odds ratio is interpreted as a relative risk it will always overstate any effect size: the odds ratio is smaller than the relative risk for odds ratios of less than one, and bigger than the relative risk for odds ratios of greater than one • The extent of overstatement increases as both the initial risk increases and the odds ratio departs from unity • However, serious divergence between the odds ratio and the relative risk occurs only with large effects on groups at high initial risk. Therefore qualitative judgments based on interpreting odds ratios as though they were relative risks are unlikely to be seriously in error • In studies which show reductions in risk (odds ratios of less than one), the odds ratio will never underestimate the relative risk by a greater percentage than the level of initial risk • In studies which show increases in risk (odds ratios of greater than one), the odds ratio will be no more than twice the relative risk so long as the odds ratio times the initial risk is less than 100%
Visualise Categorical Data • When is a pie chart useful? • Seldom: arguably only in metaphor • How do you add dimensions to a bar chart? • Cluster • When is a 3D effect useful • Not in 2D concepts! • Showing additional dimensions e.g. 2nd level cluster
Preparation for 15 Feb • Read chapters 4,5,6 to understand natural distributions and sampling • Return to chapter 3, run the examples in R and generate some alternative examples • Prepare to show ideal visualisations and summaries with your R scripts