1 / 45

Intro to Statistics – Part 2

Intro to Statistics – Part 2. Maureen J. Donlin January 18, 2012. Take home exercises. Exercise 1: Using the child nutrition data set, answer the questions posed in the handout. Exercise 2: Using the breast cancer data set, conduct an exploratory analysis of the data.

imelda
Download Presentation

Intro to Statistics – Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intro to Statistics – Part 2 Maureen J. Donlin January 18, 2012

  2. Take home exercises • Exercise 1: • Using the child nutrition data set, answer the questions posed in the handout. • Exercise 2: • Using the breast cancer data set, conduct an exploratory analysis of the data. • We will use this data set and others during the next session

  3. Child nutrition data set • Dataset: NutritionChildren.sav • Does the amount of juice consumed by the children affect their growth? • Variables: ChildID, Weight_lbs, Height_cm, Juice, Soda, Energy, Age • Ages: 94 are 2 years old and 74 are 5 years old • Gender: unknown in our data set

  4. Recoding variables • Define short as ≤ 1.5 SD of the mean for age group • 82.7 cm for age 2 and 102.5 cm for age 5 • 6 met criteria for age 2 and 3 for age 5 • Define overweight as ≥ 1.5 SD of the mean for age group (BMI_level) • 18.8 for age = 2 and 18.4 for age = 5 • 6 met criteria for age = 2 and 3 for age 5

  5. Recoding variables cont. • Excessive juice consumption (JuiceLevel) • Mean 5.5 oz/day ± 4.6 (SD) • Excessive juice ≥ 1.5 SD of the mean (12 oz/day) • 19 children drank ≥ 12 oz juice/day • Cross tab of JuiceLevel* Short • p-value = 0.001 • Cross-tab of Juice Level* BMI_level • p-value = 0.067

  6. Breast cancer data • Dataset: BreastCancerData.sav • Explore the data • 338/1207 missing estrogen receptor status • 356/1207 missing progesterone receptor status • 86/1207 missing pathological tumor size • 12/1207 with a tumor size > 5 cm

  7. Exploring breast cancer data, cont. • Distribution of the continuous variables • Age • Pathological tumor size • Number of positive lymph nodes • Use Explore with those 3 variables and no factors

  8. Exploring breast cancer data, cont. • Dependence of pathological tumor size on the categorical variables estrogen and progesterone receptor status • Is there a difference in the size of the tumor when they are positive for estrogen or progesterone receptors? • Use Explore with pathological tumor size as the dependent variable and estrogen and progesterone receptor status as factors

  9. Exploring breast cancer data, cont. • Is there a dependence of the tumor size on the presence of positive lymph nodes?

  10. Recoding age

  11. Recode into 4 groups: 20-45, 46-55, 56-66, 66 & older Recode into 2 groups: ≤ 55, 56 & older Explore the new category using a histogram

  12. Looking for evenly sized categories

  13. Breast cancer data, cont. • Is there a dependence of pathological tumor size on age?

  14. Significant difference between groups?

  15. Post-hoc testing (Bonferroni)

  16. Univariate modeling • Analyze -> General Linear Model -> Univariate • Dependent variable: Pathological tumor size • Fixed factors: agecat3 & Lymph nodes? • Model: full factorial • Plots: ln_yesno*agecat3 & agecat3*ln_yesno • Options: Display means for all 3 variables & check: • Descriptive statistics • Estimates of effect size • Observed power

  17. Error type model • Type II: assumes balanced design • Type III: works with balanced and unbalanced designs (default option) • Type IV: can be used when there is missing data

  18. Presence of positive lymph node is associated with larger tumor size at all age categories, but the effect is larger for the younger ages.

  19. Univariate analysis of effect of age on tumor size Univariate analysis of effect of positive lymph node on tumor size

  20. The model explains ~ 27% of the variance. • The effect of age and presence of positive lymph nodes each explain about half of the total variance. • The effect of age and lymph node status together is negligible. They do not interact.

  21. Effect size Magnitude of the observed effect: t = t from a t-test and df = degrees of freedom r = 0.10 (small effect; ~1% of the total variance) r = 0.30 (medium effect; ~9% of the total variance) r = 0.50 (large effect; ~25% of the total variance)

  22. Effect size: Eta2 (η2) • Effect size used in Anova & univariate modeling from SPSS • η2 varies between 0 and 1 • Interpretation: • 0.01 ~ small • 0.06 ~ medium • 0.14 ~ large • Square root of η2 approximates r

  23. Effect size: • The dependence of tumor size on age has approximately the same effect size as the dependence on the presence of positive lymph nodes. • The interaction of age and positive lymph nodes has very little effect.

  24. Calculating effect size from t-test T-test of dependence of tumor size on presence of positive lymph nodes: r = 0.21; or a moderately small effect

  25. Other questions to consider • Is tumor size associated with receptor (estrogen or progesterone) status? • Tumor size was recoded into categories (pathcat) • Do cross-tabs with pathcat*estrogen status (er) or pathcat*progesterone status (pr) • Is positive estrogen or progesterone receptor status associated with larger or smaller tumors?

  26. Survival analysis • What fraction of population will survive past a certain time? • What is the probability of survival on condition A versus condition B? • Kaplan-Meier estimator • Estimates survival function from life-time data • Can deal with some types of censored data (i.e. patient withdraws from study before final outcome)

  27. The steps down represent each point where a subject has died. • The tick marks represent censored data

  28. Censoring • Removing a patient from the survival curve at the end of their follow-up time is “censoring” the patient. • Shown as a tick mark on the survival curve • Once a patient is censored, the curve becomes an estimate of survival because we no longer know the end point for censored patients

  29. Kaplan-Meier estimator • S(t): probability of surviving beyond time t • Rank death times in order: • 0 < t(s) < t(2) < t(3) < t(4) ... t(r) • Within each interval, calculates probability of dying within that interval • Probability of dying in interval 4: • # deaths in interval 4*number alive at time(3) • S(t(4)) = probability of surviving beyond interval 3 * probability of surviving interval 4 • S(4) = S(3) * (1-probability of dying in interval 4)

  30. 7 patients with survivals of: • 1, 2+, 3+, 4, 5+, 10, 12+ • + indicates censored patient

  31. KM survival curve

  32. Kaplan-Meier estimator • Dataset: leukemia.sav • Remission times of acute leukemia in weeks • 2 treatment groups, 42 observations • Placebo: • 1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23 • 6-mercaptopurine: • 6 6 6 6*7 9* 10 10* 11* 13 16 17* 19* 20* 22 23 25* 32* 32* 34* 35* • First censored time is 6, means patient was observed for 6 weeks follow-up, but no remission occurred • Outcome: 0 = censored; 1 = death

  33. Analyze -> Survival -> Kaplan-Meier • Time: time to remission • Status: outcome (1) • Define event: Single value: 1 • Factor: treatment • Compare factor: Log rank, Pooled over strata • Options: • Statistics: Survival table(s); Mean & Median survival • Plots: Survival

  34. Interpretation of Kaplan-Meier • Survival table • Provides estimate of survival for each event • Means & Medians Survival Time • Data summarized in a table that you can report • Estimated survival times: • Placebo: 8.6 weeks • 6-mecaptopurine: 23.2 weeks • Highly significant difference between the 2 groups

  35. Linear regression • Model relationship between scalar variable y and one or more exploratory variables X • Used for: • Prediction, what is y given X? • Strength of a relationship between y and Xj

  36. Linear regression • Dataset: LifeExpectancybyTVandPhysicans.sav • Handout describes the dataset • Question: • Is there a relationship between life expectancy in the different countries and the ratio of people/TV or people/physicians?

  37. Linear regression • Analyze -> Regression -> Linear • Dependent: LifeExp • Independents: TV & Physicians (do as separate analyses) • Method: Enter • Statistics: Estimates, Model fit, Descriptives • Plots: Y: *SDRESID; X: *ZPRED; Histogram and Normal probability plot

  38. Modeling effect of TVs The life expectancy is equal to: -0.036*Ratio people/TV + 69.6 For a country with 500 people/TV, the life expectancy is predicted to be 51.6 years.

  39. Model summary • R: linear correlation between the observed and model predicted variables. Moderate value indicates a moderately strong relationship • R square: coefficient of determination; about 35% of the variation in LifeExp is explained by the model

  40. Once you’ve lowered ratio of people/TV or people/physicians, there is not further effect of those on the life expectancy. Or is there? Try a log-transformation both TV and physicians

  41. Redo the linear correlation, using either logTV or logPhysicians as the independent • What is the predicted life expectancy for a country with 1000 people/physician? • Hint: Need to take the log value of 1000 first • The life expectancy is equal to: -0.11.45*log(1000) + 102.8 • Answer: 68.5 years

  42. Normal PP plot

  43. Take home points • Plot your data • Anscombe’s quartet: • 4 datasets with identical statistical properties. • AnscombesData.xlsx • Consider effect size • Statistical significance does not mean clinical significance • Does the relationship make sense? • Association but is it causative?

More Related