290 likes | 300 Views
Explore lectures, lecture notes, and resources for statistics in medical sciences division. Learn about data analysis, visualization, and statistical tests.
E N D
Statistics for Medical Sciences Division Kanishka Bhattacharya
Resources • Lectures • Lecture Notes • Andy Field (2009) Discovering Statistics Using SPSS (3rd Edition) Sage publishers • Ramsey and Schafer (2002) The Statistical Sleuth(2ndEdition) Thomson Learning • Email: kanishka@well.ox.ac.uk
Lesson Structure 4 lectures & practical sessions = ( 2 x 4 ) hrs • Overview of statistics, exploratory data analysis • Z-tests, t-tests, Analysis of Variance (ANOVA), post-hoc analysis • Categorical analysis, odds ratio • Linear regression
Analysing a set of data • Look at the data (initial checks on the data) • Downloading data, formatting, data collection, discrepant data, missing data • Visualize the data (exploratory data analysis) • Descriptive statistics, informative tables, well-constructed figures • Analyse the data (definitive analysis) • Formal statistical analysis • Quantify any interesting results • Report the findings
Computers and Statistics • Excel, SPSS, Minitab, Stata, Mathlab, R, etc… • Advantages • Speed, accuracy, ease of data manipulation • Easy to produce plots, cross-tabulation tables, summary statistics • Disadvantages • Inappropriate analysis / use of wrong tests • Data dredging
Sample Selection • Simple Random Sample • Stratified Sample • Cluster Sampling • Multistage Sampling • Multi-phase Sampling
Types of Variables • Often, test to use depends on the type of variable at hand • Two main classes of variables: • Categorical • Numerical • Categorical variables further divided into two sub-classes: • Nominal categorical (example: gender, ethnic groups) • Ordinal categorical (example: size of a car, quality of teaching)
Numerical Variables • Distinguish between discrete or continuous numerical variables • Discrete • Integer values (number of male subjects, number of episodes of flu outbreaks) • Continuous • Takes a whole range of values (height, weight) • Continuous variables treated as discrete (age)
EDA • Tabular EDA • Univariate tables, cross-tabulation of categorical variables • Numerical EDA • Location, spread, skewness, covariance and correlation • Graphical EDA • Frequency plots, histograms, boxplots, scatterplots • The precise form of EDA depends on the data at hand.
Tabular EDA • Useful for summarising categorical data. For example, the following table shows the classification of 2,555 DNA samples in a case-control study of Malaria onset into the respective phenotypes:
Tabular EDA • For two categorical variables: i.e. the distribution of genotypes for a sub-sample of the individuals affected and unaffected by severe malaria Question: Appears to be more affected individuals with the C allele (or conversely, more unaffected individuals have the AA genotype). Does this mean anything? (see later for test of independence)
Calculating informative numbers which summarise the dataset • What are the numbers useful for describing the age of 1,059 individuals with diabetes? • Location parameters (mean, median, mode) • Skewness Mean age (54.6 years) 20 30 40 50 60 70 80 AGE Numerical EDA • Spread (range, standard deviation, interquartile range)
Numerical EDA • Sample QuartilesQ1: 25th quantile (or value of the 25% ranked data)Q2: 50th quantile (also known as median of data)Q3: 75th quantile (or value of the 75% ranked data) • Interquartile range (IQR) IQR = Q3 – Q1 • Minimum, Maximum of data
Numerical EDA • Numbers can be informative to identify potential problems with the data • Example: Suppose the height for 1,496 individuals randomly sampled from the population produces the following summary IQR = Q3 – Q1 = 188 – 172 = 16 Range = Max – Min = 201 – 0 = 201
Numerical EDA Skewness b1 is unit-free, where 0 indicates symmetry about sample mean. Positive values indicate right-skewness, and negative values indicate left-skewness.
Covariance and Correlation • Two numerical variables: height and weight • Questions • Are there any relationship between these variables? • If there is, how do we quantify this relationship? • Covariance and correlation Measures the degree of association between two numerical variables.
Covariance and Correlation • Covariance is scale-dependent, and correlation is unit-free. • More intuitive to interpret correlation than covariance. • Example: Covariance for height and weight is 2.4 when assessed using metres and kilograms, but 240,000 when assessed using centimetres and grams. Correlation is a constant value at 0.83 for both scenario. • Correlation always bounded between –1 and 1 inclusive.
Graphical EDA • Visual summaries of the data • Flagging outliers, obvious relationships, check for distribution
Boxplots • Univariate boxplot: for 1 numerical variable Ends of box: Q1 and Q3 Length of box: IQR White line: Sample median Whiskers: 1.5 times IQR Lines outside whiskers: Outliers Circles: Extreme outliers
Boxplots • Multivariate boxplots: for 1 numerical variable across different levels of a categorical variable • Graphical comparison
Scatterplots • Graphical representation for 2 numerical variables
Lecture Notes and Datasets: http://www.well.ox.ac.uk/lectures