Lecture 3 – Data Summary Measures and Graphical Display of Results

Lecture 3 – Data Summary Measures and Graphical Display of Results Univariate Data – Analysis of one variable at a time

Why Think About/Explore Data? • Done to accomplish: • Checking for data entry errors • Describing demographic and study characteristics • Examining distributions of outcomes • Central tendency • Variability • Checking for outliers • Checking assumptionsfor subsequent analyses • Give a picture of your sample

In order to understand choices of which statistics could be appropriate, it is paramount to ascertain what measurement level the outcome (s) and predictor (s) have. Dependent variable = outcome Independent variable = predictor

Types of Data • Nominal – Qualitative Data Measured in unordered categories • Ordinal – Qualitative Data Measured in ordered categories • Continuous – Quantitative Data Measured on a continuum (summarize with %’s): (summarize with %’s): summarize with Many Summary Measures

Types of Data • Nominal – Qualitative Data Measured in unordered categories • Race • Blood Type • Dead/Alive • Ordinal – Qualitative Data Measured in ordered categories • Cancer Stages • Socio-economic Status (low, med, hi) • Continuous – Quantitative Data Measured on a continuum • Serum Creatinine • Height/Weight/BMI • Gender • On Dialysis/Not on Dialysis • Likert (unlikely, somewhat unlikely, neutral, likely, very likely) • Systolic Blood Pressure • Diastolic Blood Pressure • Others???

Continuous (Numerical) Measures of Location • Mean Arithmetic Average Sum of Values/Number of Values Nice mathematical/statistical properties • Median (a.k.a 50th Percentile) Value where half the sample is above, half the sample is below Better measure for skewed data. Robust to Extreme values • Mode Most Frequently Occurring value in Sample

Continuous (Numerical) NORMAL DISTRIBUTION

Continuous (Numerical) Measures of Variability • Range= (maximum - minimum) • Interquartile range = (Q3 – Q1) always covers half the sample(75th - 25th percentile) • Variance= average of the squares of the deviations of the observations from their mean • Standard deviation=

Continuous (Numerical) NORMAL DISTRIBUTION http://www.stattucino.com/berrie/dsl/index.html

Describing Data using Numerical Summaries • Descriptive statistics: Explore data in order to describe their main features Get an initial picture of data sample

Let’s Talk Data…

Categorical

Continuous

N = 115

BMI Mean: 32.2 Std: 5.4 Median: 31.8

Mean: 136.3 Std: 17.1 Median: 135

Mean: 189.77 Std: 148.9 Median: 154.11

symmetric skewed to the right skewed to the left Shape of a distribution Mean less than Median (negatively skewed) Mean greater than Median (positively skewed)

Mean: 136.3 Std: 17.1 Median: 135 Skewness: 0.38

Mean: 189.77 Std: 148.9 Median: 154.11 Skewness: 5.63

NORMAL DISTRIBUTION Normal Distribution – Has Excellent Statistical Properties Many Statistical techniques require normal distributions If data does not have Normal Distribution, need to consider alternative techniques appropriate for data

Box (and Whisker) Plots • A graph of the 5 number summary with suspected outliers plotted individually • 5 number summary: • Min, Q1, Median, Q3, Max • A line somewhere inside the box marks the Median • IQR = Q3 – Q1 • Cases more than 1.5*IQR are plotted individually (possible outliers) • Lines from the box extend to the smallest and largest values that are not more than 1.5*IQR

Outlier 1.5 x IQR 75th Percentile median 25th Percentile mean

Skewed to the left Skewed to the right Symmetric + + +

Normal Probability Plot • Plot that can help assess normality. • Idea: plot the observed levels of the variable against the expected levels corresponding to a Normal distribution. • If data lie in a reasonably straight diagonal line, then assumption of Normality is reasonable.

Normal Probability Plots Triglycerides BMI

Error Bar Plots Circle denotes the mean and the bars denote the standard deviation (in this case).

Part II – Measures of Association (plus a little more)

Measures of Association • Continuous Variables • Correlation • Agreement (reliability) • Categorical Variables • Two-way layout (2×2 tables) • “Risk” measures • Agreement • Others

Two Continuous Variables Correlation • General sense: the relationship between two variables (quantitative or qualitative) • Narrow (statistical) sense: measure of interdependence between two continuous random variables • The degree to which increases or decreases in Y occur with increases or decreases in X • Values range between -1 (perfect discordance) and 1 (perfect concordance) • A value of 0 indicates no association

Pearson Correlation Purpose - measures linear association between two continuous variables X and Y Data

Pearson Correlation The Pearson (product-moment) correlation coefficient can be calculated for 2 continuous variables in a sample (regardless of distribution) using the formula:

Correlation Figures Y A B C • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ρ = 1 • • • • • ρ = -1 • • • • ρ = 0 X No relationship Perfect positive relationship Perfect negative relationship E D • • • • • • • • • • • • • • • • • • • • • • • • • • ρ = -0.8 • ρ = 0.5 Strong negative relationship Moderate positive relationship

Correlation Inference • Easy “large sample” test for H0: ρ=0 For n ≥ 25, compute which has N(0, ) distribution under H0 • This test assumes X,Y~ NBiv(μX, μY, σX2, σX2, ρ) • Many times a tenuous assumption! • Beware positive skewness & outliers • Beware data not truly continuous

Timeout: ASSUMPTIONS • As with any mathematical or physical model, model assumptions are critical to making the correct inference • Dealing with assumptions has lead to development of: • Nonparametric statistics: techniques that reduce or eliminate dependence on the underlying distribution of the data • Robust statistics: techniques that are affected little by departures from assumptions

Correlation (resumed) • A nonparametric version of the correlation coefficient: Spearman’s Rank Correlation • Like ρ, rs : • ranges from -1 to 1 • 0 no correlation, 1 perfect agreement • only requires ordinal data

Correlation Example: SBP and DBP

Correlation Example: SBP and DBP • All Data: ρ = 0.42; rs = 0.71 • Outlier deleted: ρ = 0.75; rs = 0.82

Correlation Coefficient Questions – • Can we calculate a correlation coefficient between the incomes of a group of people and what city they live in? No, we cannot, since city is a categorical variable. Correlation requires that both variables be quantitative.

Correlation Coefficient Questions – • Does it change the correlation between height and weight if we measure height in inches rather than centimeters and weight in pounds rather than kilograms? No. Because ρ (and r) uses the standardized values of the observations, ρ does not change when we change the units of measurements of x , y, or both. The correlation ρ itself has no unit of measure; it is just a number.

Correlation Coefficient Question – • Does ρ = 0 mean there is no relationship between X and Y ? y • • • • • • • • • • • • • x Correlation only measures the strength of the linear relationship between two variables. Correlation does not describe nonlinear relationships between two variables, no matter how strong they are.

Correlation and Regression Y Y • • • • • • • • • • • • • • • • • • • • • • • • • • ρ = 0.5 ρ = -0.8 • X X Moderate positive relationship Strong negative relationship Y = α+βX

Correlation and Regression SBP and DBP example (continued) σSBP= 4.9 (mmHg) σDBP= 3.3 (mmHg) ρ = 0.75 SBP = 40.1 + 1.12×DBP DBP = 16.3 + 0.51×SBP

Correlation and Covariance • Suppose two random variables, X and Y: E(X) = μX, V(X) = σX2; E(Y) = μY, V(Y) = σY2; and Corr(X,Y) = ρ • Define Cov(X,Y) = E[(X-μX)(Y-μY)] Note: Cov(X,X) = E[(X-μX)(X-μx)] = E(X-μX)2 = σX2 • Population correlation (ρ) is defined as: • Thus Cov(X,Y) = ρσXσY

Correlation and Covariance What’s the big deal about covariance? Use it to find the variance of functions of random variables, e.g.: In general:

Correlation as Agreement Suppose two nurses are measuring SBP in the same patients and each nurse measures SBP 3 times in each patient.

Lecture 3 – Data Summary Measures and Graphical Display of Results

Lecture 3 – Data Summary Measures and Graphical Display of Results

Presentation Transcript

Tabular Display of Data

Chapter 3: Numerically Summarizing Data

A Comparison of Graphical Techniques for the Display of Co-Occurrence Data

Graphical Representation of Data

Topics 26 - 28

Chapter 3 Data Summary Using Descriptive Measures

How to display data badly

Lecture 17 Programming Graphics

Graphical display of data and results

Principles of Graphical Display

Unit 31: A Unified Perspective for Visual Display of Data (A work in progress)

Hints for better data display

Graphical Representation of data convey statistical results vividly and quickly than words

3.3 GRAPHICAL DISPLAYS OF DATA

Data Preprocessing

A Comparison of Graphical Techniques for the Display of Co-Occurrence Data

Chapter 2

2.4 Describing Distributions Numerically

2-1 Data Summary and Display

COSC 3461: Module 7