450 likes | 611 Views
Intro to Statistics – Part 2. Maureen J. Donlin January 18, 2012. Take home exercises. Exercise 1: Using the child nutrition data set, answer the questions posed in the handout. Exercise 2: Using the breast cancer data set, conduct an exploratory analysis of the data.
E N D
Intro to Statistics – Part 2 Maureen J. Donlin January 18, 2012
Take home exercises • Exercise 1: • Using the child nutrition data set, answer the questions posed in the handout. • Exercise 2: • Using the breast cancer data set, conduct an exploratory analysis of the data. • We will use this data set and others during the next session
Child nutrition data set • Dataset: NutritionChildren.sav • Does the amount of juice consumed by the children affect their growth? • Variables: ChildID, Weight_lbs, Height_cm, Juice, Soda, Energy, Age • Ages: 94 are 2 years old and 74 are 5 years old • Gender: unknown in our data set
Recoding variables • Define short as ≤ 1.5 SD of the mean for age group • 82.7 cm for age 2 and 102.5 cm for age 5 • 6 met criteria for age 2 and 3 for age 5 • Define overweight as ≥ 1.5 SD of the mean for age group (BMI_level) • 18.8 for age = 2 and 18.4 for age = 5 • 6 met criteria for age = 2 and 3 for age 5
Recoding variables cont. • Excessive juice consumption (JuiceLevel) • Mean 5.5 oz/day ± 4.6 (SD) • Excessive juice ≥ 1.5 SD of the mean (12 oz/day) • 19 children drank ≥ 12 oz juice/day • Cross tab of JuiceLevel* Short • p-value = 0.001 • Cross-tab of Juice Level* BMI_level • p-value = 0.067
Breast cancer data • Dataset: BreastCancerData.sav • Explore the data • 338/1207 missing estrogen receptor status • 356/1207 missing progesterone receptor status • 86/1207 missing pathological tumor size • 12/1207 with a tumor size > 5 cm
Exploring breast cancer data, cont. • Distribution of the continuous variables • Age • Pathological tumor size • Number of positive lymph nodes • Use Explore with those 3 variables and no factors
Exploring breast cancer data, cont. • Dependence of pathological tumor size on the categorical variables estrogen and progesterone receptor status • Is there a difference in the size of the tumor when they are positive for estrogen or progesterone receptors? • Use Explore with pathological tumor size as the dependent variable and estrogen and progesterone receptor status as factors
Exploring breast cancer data, cont. • Is there a dependence of the tumor size on the presence of positive lymph nodes?
Recode into 4 groups: 20-45, 46-55, 56-66, 66 & older Recode into 2 groups: ≤ 55, 56 & older Explore the new category using a histogram
Breast cancer data, cont. • Is there a dependence of pathological tumor size on age?
Univariate modeling • Analyze -> General Linear Model -> Univariate • Dependent variable: Pathological tumor size • Fixed factors: agecat3 & Lymph nodes? • Model: full factorial • Plots: ln_yesno*agecat3 & agecat3*ln_yesno • Options: Display means for all 3 variables & check: • Descriptive statistics • Estimates of effect size • Observed power
Error type model • Type II: assumes balanced design • Type III: works with balanced and unbalanced designs (default option) • Type IV: can be used when there is missing data
Presence of positive lymph node is associated with larger tumor size at all age categories, but the effect is larger for the younger ages.
Univariate analysis of effect of age on tumor size Univariate analysis of effect of positive lymph node on tumor size
The model explains ~ 27% of the variance. • The effect of age and presence of positive lymph nodes each explain about half of the total variance. • The effect of age and lymph node status together is negligible. They do not interact.
Effect size Magnitude of the observed effect: t = t from a t-test and df = degrees of freedom r = 0.10 (small effect; ~1% of the total variance) r = 0.30 (medium effect; ~9% of the total variance) r = 0.50 (large effect; ~25% of the total variance)
Effect size: Eta2 (η2) • Effect size used in Anova & univariate modeling from SPSS • η2 varies between 0 and 1 • Interpretation: • 0.01 ~ small • 0.06 ~ medium • 0.14 ~ large • Square root of η2 approximates r
Effect size: • The dependence of tumor size on age has approximately the same effect size as the dependence on the presence of positive lymph nodes. • The interaction of age and positive lymph nodes has very little effect.
Calculating effect size from t-test T-test of dependence of tumor size on presence of positive lymph nodes: r = 0.21; or a moderately small effect
Other questions to consider • Is tumor size associated with receptor (estrogen or progesterone) status? • Tumor size was recoded into categories (pathcat) • Do cross-tabs with pathcat*estrogen status (er) or pathcat*progesterone status (pr) • Is positive estrogen or progesterone receptor status associated with larger or smaller tumors?
Survival analysis • What fraction of population will survive past a certain time? • What is the probability of survival on condition A versus condition B? • Kaplan-Meier estimator • Estimates survival function from life-time data • Can deal with some types of censored data (i.e. patient withdraws from study before final outcome)
The steps down represent each point where a subject has died. • The tick marks represent censored data
Censoring • Removing a patient from the survival curve at the end of their follow-up time is “censoring” the patient. • Shown as a tick mark on the survival curve • Once a patient is censored, the curve becomes an estimate of survival because we no longer know the end point for censored patients
Kaplan-Meier estimator • S(t): probability of surviving beyond time t • Rank death times in order: • 0 < t(s) < t(2) < t(3) < t(4) ... t(r) • Within each interval, calculates probability of dying within that interval • Probability of dying in interval 4: • # deaths in interval 4*number alive at time(3) • S(t(4)) = probability of surviving beyond interval 3 * probability of surviving interval 4 • S(4) = S(3) * (1-probability of dying in interval 4)
7 patients with survivals of: • 1, 2+, 3+, 4, 5+, 10, 12+ • + indicates censored patient
Kaplan-Meier estimator • Dataset: leukemia.sav • Remission times of acute leukemia in weeks • 2 treatment groups, 42 observations • Placebo: • 1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23 • 6-mercaptopurine: • 6 6 6 6*7 9* 10 10* 11* 13 16 17* 19* 20* 22 23 25* 32* 32* 34* 35* • First censored time is 6, means patient was observed for 6 weeks follow-up, but no remission occurred • Outcome: 0 = censored; 1 = death
Analyze -> Survival -> Kaplan-Meier • Time: time to remission • Status: outcome (1) • Define event: Single value: 1 • Factor: treatment • Compare factor: Log rank, Pooled over strata • Options: • Statistics: Survival table(s); Mean & Median survival • Plots: Survival
Interpretation of Kaplan-Meier • Survival table • Provides estimate of survival for each event • Means & Medians Survival Time • Data summarized in a table that you can report • Estimated survival times: • Placebo: 8.6 weeks • 6-mecaptopurine: 23.2 weeks • Highly significant difference between the 2 groups
Linear regression • Model relationship between scalar variable y and one or more exploratory variables X • Used for: • Prediction, what is y given X? • Strength of a relationship between y and Xj
Linear regression • Dataset: LifeExpectancybyTVandPhysicans.sav • Handout describes the dataset • Question: • Is there a relationship between life expectancy in the different countries and the ratio of people/TV or people/physicians?
Linear regression • Analyze -> Regression -> Linear • Dependent: LifeExp • Independents: TV & Physicians (do as separate analyses) • Method: Enter • Statistics: Estimates, Model fit, Descriptives • Plots: Y: *SDRESID; X: *ZPRED; Histogram and Normal probability plot
Modeling effect of TVs The life expectancy is equal to: -0.036*Ratio people/TV + 69.6 For a country with 500 people/TV, the life expectancy is predicted to be 51.6 years.
Model summary • R: linear correlation between the observed and model predicted variables. Moderate value indicates a moderately strong relationship • R square: coefficient of determination; about 35% of the variation in LifeExp is explained by the model
Once you’ve lowered ratio of people/TV or people/physicians, there is not further effect of those on the life expectancy. Or is there? Try a log-transformation both TV and physicians
Redo the linear correlation, using either logTV or logPhysicians as the independent • What is the predicted life expectancy for a country with 1000 people/physician? • Hint: Need to take the log value of 1000 first • The life expectancy is equal to: -0.11.45*log(1000) + 102.8 • Answer: 68.5 years
Take home points • Plot your data • Anscombe’s quartet: • 4 datasets with identical statistical properties. • AnscombesData.xlsx • Consider effect size • Statistical significance does not mean clinical significance • Does the relationship make sense? • Association but is it causative?