E N D
Statistics for clinicians • Biostatistics course by Kevin E. Kip, Ph.D., FAHAProfessor and Executive Director, Research CenterUniversity of South Florida, College of NursingProfessor, College of Public HealthDepartment of Epidemiology and BiostatisticsAssociate Member, Byrd Alzheimer’s InstituteMorsani College of MedicineTampa, FL, USA
SECTION 1.1 Module Overview and Introduction Introduction to biostatistics, descriptive statistics, SPSS, and Power Point.
SECTION 1.4 Introduction to SPSS
Introduction to SPSS • Database structure • Data view and variable view • Variable names, labels, and formats • Interactive menus • SPSS syntax generated from interactive analyses
SECTION 1.5 Summarizing Data in Charts
Summarizing Data – Charts 1. One categorical, >1 proportion/percentage (i) Bar chart (ii) Stacked bar chart (iii) Stacked bar chart (100%) 2. One categorical, >1 continuous variable (i) Box plot (ii) High-low (iii) Line (iv) Kernel-density plots 3. Two continuous variables (i) X-Y scatter (ii) Histogram (can be used for 1 variable)
1. One categorical, >1 proportion/percentage (i) Bar chart • Rectangular bars with lengths proportional to the values that they represent. • Bars can be plotted vertically or horizontally.
1. One categorical, >1 proportion/percentage (ii) Stacked bar chart • Can be counts or percentages. • Do not sum to a specified value % Obese Age Group
1. One categorical, >1 proportion/percentage (iii) Stacked bar chart (100%) Bar Charts and Stacked Bar Charts Important to select either row versus column percentages Example: Race and blood pressure classification Usually, the row variable is the “predictor”, and the column variable is the “outcome”. SPSS: Analyze Descriptive statistics Crosstabs
Bar Charts and Stacked Bar Charts Column Percentage: SPSS-CROSSTABS /TABLES=SCR_RACECAT3 BY SCR_BP_CLASS4 /FORMAT=AVALUE TABLES /CELLS=COUNT COLUMN /COUNT ROUND CELL /BARCHART.
Bar Charts and Stacked Bar Charts Row Percentage: SPSS-CROSSTABS /TABLES=SCR_RACECAT3 BY SCR_BP_CLASS4 /FORMAT=AVALUE TABLES /CELLS=COUNT ROW /COUNT ROUND CELL /BARCHART. Use row percentages in stacked bar chart (PP)
Power Point Chart Column 100% Stacked Column
Power Point Chart (Practice) Column - 100% Stacked Column Display Quality of Life from Poor to Excellent by Gender Column Percentages for QOL Row Percentages for QOL
Power Point Chart Column 100% Stacked Column
Power Point Chart Column 100% Stacked Column
2. One categorical, >1 continuous variable (i) Box plot • Also known as box-and-whisker diagram. • Displays 5 summary statistics: minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum • No assumptions on underlying statistical distribution – non-parametric SPSS: Graphs Chart Builder Boxplot Example: HDL Cholesterol (continuous) distribution by gender (categorical)
2. One categorical, >1 continuous variable (i) Box plot Question: Are HDL cholesterol levels positively or negative skewed? Run SPSS frequencies procedure
2. One categorical, >1 continuous variable (i) Box plot Question: Are triglycerides positively or negative skewed? Run SPSS frequencies procedure
2. One categorical, >1 continuous variable (i) Box plot (Practice) Draw a box plot of the distribution of HDL cholesterol by ethnicity: Hispanic: Min=30, Q1=40, Q2=46, Q3=56, Max=86 Non-Hispanic: Min=21, Q1=46, Q2=56, Q3=66, Max=131 Example:
2. One categorical, >1 continuous variable (i) Box plot (Practice) Draw a box plot of the distribution of HDL cholesterol by ethnicity: Hispanic: Min=30, Q1=40, Q2=46, Q3=56, Max=86 Non-Hispanic: Min=21, Q1=46, Q2=56, Q3=66, Max=131
2. One categorical, >1 continuous variable (ii) High-low • Can “trick” Power Point to use open-high-low-close chart (i.e. used for financials) to show distributions of continuous variables • Upper and lower ends (high-low) can represent any percentiles, such as 5th/95th percentiles
Total Cholesterol (mg/dl) P=0.003 Ptrend=0.009 EU>25% EU>85% EU>40% EU<40% EU<25% White Black Black Black White Black Self-Report Admixture Defined N (753) (464) (753) (68) (201) (195) The filled rectangles depict the interquartile range (25th and 75th percentile). The lower and upper limits of the vertical lines depict the 5th and 95th percentiles, respectively.
Total Cholesterol (mg/dl) U.S. Black vs. Ghana Urban: P=0.0001 U.S. Black vs. Ghana Rural: P<0.0001 Ghana Urban vs. Ghana Rural: P<0.0001 N=594 N=546 N=80 N=111 The filled rectangles depict the interquartile range (25th and 75th percentile). The lower and upper limits of the vertical lines depict the 5th and 95th percentiles, respectively.
5% 25% 75% 95% Male 137 175 224 271 Female 153 190 245 295 Total Cholesterol: (Practice in Power Point – first draw by hand) (mg/dl) The filled rectangles depict the interquartile range (25th and 75th percentile). The lower and upper limits of the vertical lines depict the 5th and 95th percentiles, respectively.
Total Cholesterol: (Practice in Power Point) (mg/dl) 5% 25% 75% 95% “Trick” Power Point Male 137 175 224 271 Open High Low Close Female 153 190 245 295 25% 95% 5% 75% The filled rectangles depict the interquartile range (25th and 75th percentile). The lower and upper limits of the vertical lines depict the 5th and 95th percentiles, respectively.
2. One categorical, >1 continuous variable (iii) Line chart • Typically represents trend in data over intervals of time (i.e. time series) • Often used to show repeated health outcome measurements over time. Prevalence of Use (%)” Crohn’s Disease Medications
In this example, the “categorical” variable is individual subject nested within each treatment arm of the trial
2. One categorical, >1 continuous variable (iv) Kernel density plots • Like a histogram, but constructs a “smooth” probability density function
3. Two continuous variables (i) X-Y scatter • Shows the relationship between two sets of continuous data • Also called a scatter chart, scattergram, scatter diagram or scatter graph. Body Density Body Mass Index
3. Two continuous variables (ii) Histogram(s) • Probability distribution of a continuous variable(s) displayed over discrete intervals (bins) • The bins contain frequency counts, or can be normalized to display relative frequencies (i.e. proportion of cases that fall into each category (bin) with total area = 1.0) # subjects
3. Two continuous variables (ii) Histogram(s) • Probability distribution of a continuous variable(s) displayed over discrete intervals (bins) • The bins contain frequency counts, or can be normalized to display relative frequencies (i.e. proportion of cases that fall into each category (bin) with total area = 1.0)
SECTION 1.6 SPSS Data Manipulation
SPSS Data Manipulation and Syntax Editor • Recode continuous variable into arbitrarily-defined or pre-defined categories • Visual binning of continuous variable • Transform a skewed variable • Using the SPSS Data Editor
SPSS Data Manipulation and Syntax Editor • Recode continuous variable into arbitrarily-defined or pre-defined categories • Example: Define age into 3 categories (arbitrary) • 45-54 • 55-64 • 65 and older • SPSS • Transform • Recode into different variables • Input variable is age • Output variable • Name: age_cat • Label: Age in 3 categories • Click on old and new values • Range – specify explicitly • 45-54 = value 1 • 54 64 = value 2 • 65 and older = value 3
SPSS Data Manipulation and Syntax Editor 2. Visual binning of continuous variable Example: Body mass index Put in output name for binned variable Make cutpoints Equal percentiles based on scanned cases Put in labels for frequency display in bar chart SPSS Code Visual Binning.
SPSS Data Manipulation and Syntax Editor 3. Transform a skewed variable Descriptive statistics for triglycerides in natural scale Mean, median, SD, min, max, skewness, kurtosis Chart = histogram with normal curve superimposed Triglycerides are skewed. Use a transformation to create a new variable and reduce the skew in triglycerides. SPSS Compute variable Target Variable: LOG_TRIG Numeric Expression: lg10(LAB_TRIG_VAP) SPSS Syntax: COMPUTE log_trig=lg10(LAB_TRIG_VAP).
SPSS Data Manipulation and Syntax Editor • 4. Using the SPSS Data Editor • SPSS: File: New (syntax) • Save the file with a new name • 1. Select males only (scr_sex=1) • Data • Select Cases • If scr_sex=1 • USE ALL. • COMPUTE filter_$=(SCR_SEX=1). • VARIABLE LABELS filter_$ 'SCR_SEX=1 (FILTER)'. • VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. • FORMATS filter_$ (f1.0). • FILTER BY filter_$. • EXECUTE. • Run descriptives for age • Copy code and repeat for females (scr_sex=2);
SPSS Data Manipulation and Syntax Editor 4. Using the SPSS Data Editor USE ALL. COMPUTE filter_$=(SCR_SEX=1). VARIABLE LABELS filter_$ 'SCR_SEX=1 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMATS filter_$ (f1.0). FILTER BY filter_$. EXECUTE. DESCRIPTIVES VARIABLES=SCR_AGE /STATISTICS=MEAN STDDEV MIN MAX. USE ALL. COMPUTE filter_$=(SCR_SEX=2). VARIABLE LABELS filter_$ 'SCR_SEX=2 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMATS filter_$ (f1.0). FILTER BY filter_$. EXECUTE. DESCRIPTIVES VARIABLES=SCR_AGE /STATISTICS=MEAN STDDEV MIN MAX.