1 / 33

Study design and sampling (and more on ANOVA) 

Study design and sampling (and more on ANOVA) . Tron Anders Moger 8.10.2006. More on ANOVA. Last time: A bit difficult to understand what the goal of ANOVA really is Today: Back to basics, illustrations from agriculture ANOVA was initially constructed for agricultural sciences.

dflaherty
Download Presentation

Study design and sampling (and more on ANOVA) 

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Study design and sampling(and more on ANOVA)  Tron Anders Moger 8.10.2006

  2. More on ANOVA • Last time: A bit difficult to understand what the goal of ANOVA really is • Today: Back to basics, illustrations from agriculture • ANOVA was initially constructed for agricultural sciences

  3. Recall: Could put data in a table as this: • Each type of test was given three times for each type of subject

  4. Testing different types of wheat in a field Interested in finding out if different types of wheat yields different crops Outcome: E.g. wheat in pounds Your field resembles an ANOVA data matrix! One-way ANOVA: Testing if mean crop per 1000 sq. feet is different for different types of wheat!

  5. More complex designs: • Want to test different fertilizers also Do different wheat types give different crops? Do different fertilizers give different crops? Two-way ANOVA! Do e.g fertilizer 1 work better for wheat 1 than for wheat 2 and 3? Is there interaction between wheat and fertilizer? Two-way ANOVA with interaction!

  6. Groups and blocks • In the example: Arbitrary if we put wheat type in group or block • Equally interested in wheat and fertilizer effects in the example • Another example: Want to test 3 different treatments for pigs • Only interested in treatment effect (Group) • Design a study for one-way ANOVA, everyone’s happy

  7. Are we happy? • Is a pig a pig no matter what? • Different species of pigs could give different treatment effects (this is serious for pharma companies) • If we do one-way ANOVA, we won’t find out! • Specifically, if we sample pigs at random, might end up with 5 pigs of the species that responds badly, and 50 pigs of the other species • Results for the 5 pigs will drown in the results for the other pigs, so we won’t even suspect that something is wrong • Blocking variable: Ensure that you sample e.g 30 pigs from each species

  8. Pigs: Two-way ANOVA • Still only really interested in the treatment effect • But, would like to control for the confounding effect of species of pig • Model for one-way ANOVA: Xij=µ+Gi+εij • µ is total mean, Gi is group effect, εij is N(0,σ2) • σ2 includes variation due to all confounders, including species of pig • Only effect we describe in the model, is the treatment effect

  9. Pigs: Two-way ANOVA cont’d • Two-way ANOVA model: Xijl=µ+Gi+Bj+Iij+εijl • Describe both treatment effect (Gi), pig effect (Bi) and interaction (Iij) • Remove variation due to pigs from σ2 (and from it’s primary estimator, MSE) • Means that σ2two-way<σ2one-way • Recall: Test for treatment effect (Gi =0), compares MSG to MSE (MSG/MSE~F-dist), reject if sufficiently large • Similar tests for the other effects, but based on MSB and MSI

  10. Pigs: Two-way ANOVA cont’d • If there is a treatment effect, MSG will be a biased estimator for σ2 • If there is a block effect, denominator MSE will be smaller here than MSW for one-way ANOVA • Value of test statistic will be larger! • Easier to get significant effects! (More power) • Also get more correct estimates for the group means (because of the sampling) • Similar to regression: The more significant variables you include in your model, the greater R2 becomes, and you get more correct estimates for the regression coefficients • R2 increases because σ2 decreases the more variables you include

  11. ANOVA and linear regression • Regression: Split the distance from each data point to the total mean into: • 1. Distance from mean to regression line • 2. Distance from regression line to data point • Got sums of squares SSR (1.), SSE (2.) and SST • Used for estimation and measuring how close data points were to regression line (R2) • However; also used for an F-test on whether all Bi=0 (From slide of detailed explanations of SPSS output) • This is ANOVA in linear regression!

  12. Design differences: ANOVA and regression • Wheat example, additional confounders: Earth quality or amount of sun could vary across the field • ANOVA: Control for this by using a field where they don’t vary, or, repeat study until all types of wheat have been grown in each part of the field • Regression: Collect information on earth quality and sun amounts, and include in the model

  13. Designing a study: • Ideally, should know in advance: • The basic hypotheses you want to test • What information you need in order to test the hypotheses • Which population do you want the results to apply for? • How to collect that information; sampling, design • If regression: What important confounders do you need information on

  14. Sampling in practice • Newbold mentions: • Information required? Has the study been done before? Is it possible to get the information • Relevant population? • Sample selection? Random? Systematic? Stratified? • Obtaining information? Interviews? Questionnaires? • Inferences from sample? Which methods? • Conclusions? How to present your results? • Nonsampling errors; Missing data, dishonest or inaccurate answers, low reliability or validity

  15. Reliability and validity • Validity of a research instrument: The degree to which it measures what you are interesting in measuring • Reliability of a research instrument: The extent that repeated measurements under constant conditions will give the same result • A research instrument may be reliable, but not valid

  16. Types of sampling • Simple random sampling: Select subjects at random • Every subject in the population has same probability of being sampled • Ex: One-way analysis of pigs • If large enough sample, gives you a representative sample compared to the population • Problem: If small sample, will give to few data on interesting sub-groups • Systematic sampling: As random sampling, but you include e.g every 5th subject in your sample

  17. Types of sampling cont’d: • Stratified sampling: Want to ensure that interesting sub-groups of the population are sampled in sufficient numbers (over-sampled) • Divide the population into K strata, randomly sample ni from each stratum • Ex: Pigs, two-way ANOVA • Problems: How many pigs in each stratum? • Cluster sampling: Similar to stratified sampling, but considers geographical units • Divide the population into M clusters, randomly sample m of them • Include all subjects in the sampled clusters

  18. Types of sampling cont’d: • Two-phase sampling: Carry out an initial pilot study, where only a small sample is collected • Then proceed with collecting the main sample • Advantages: Get initial estimates on effects • Initial estimates on variance in data-> sample size, how much data do you need to reject H0? • Disadvantages: Costly, time-consuming • NOTE: Most methods I’ve mentioned requires adjusted formulas for estimation, described in the book

  19. Some study types • Observational studies • Cross-sectional studies • Cohort studies • Longitudinal studies • Panel data • Case / control studies • Experimental studies • Randomized, controlled experiments (blind, double-blind) • Interventions

  20. Cross-sectional studies • Examines a sample of persons, at a single timepoint • Time effects rely on memory of respondents • Good for estimating prevalence • Difficult for rare diseases • Response rate bias

  21. Cohort studies and longitudinal studies • A sample (cohort) is followed over some time period. • If queried at specific timepoints: Longitudinal study • Gives better information about causal effects, as report of events is not based on memory • Requires that a substantial group developes disease, and that substantial groups differ with respect to risk factors • Problem: Long time perspective

  22. Panel data • Data collected for the same sample, at repeated time points • Corresponds to longitudinal epidemiological studies • A combination of cross-sectional data and time series data • Increasingly popular study type

  23. Case – control studies • Starts with a set of sick individuals (cases), and adds a set of controls, for comparison • Retrospective study – Start with finding cases and controls, then dig into their past and find out what made them cases and controls • Cases and controls should be from same populations • Matching controls • Cheap, good method for rare diseases • Problem: Bias from selection, recall bias

  24. Epidemiology • Epidemiology is the study of diseases in a population • prevalence • incidence, mortality • survival • Goals • describe occurrence and distribution • search for causes • determine effects in experiments

  25. Measures of risk in epidemiology • Relative risk (used for prospective studies) • Odds ratio (used for retrospective studies)

  26. Op-nurses cont’d: • Relative risk: Proportion of abortions among op.nurses divided by proportion of abortions among others RR= =3.1 • Odds ratio: Odds for abortion among op.nurses: 10/26 Odds for abortion among other nurses: 3/31 • Gives the odds ratio: OR: =4.0

  27. Correcting for finite population in estimations • Our estimates of for example population variances, population proportions, etc. assumed an ”infinite” population • When the population size N is comparable to the sample size n, a correction factor is necessary. • Used if n>0.05N • Examples: • Variance of population mean estimate: • Variance of population proportion estimate:

  28. Determining sample size • An important part of experimental planning • The answer will generally depend on the parameters you want to estimate in the first place, so only a rough estimate is possible • However, a rough estimate may sometimes be very important to do • A pilot study may be very helpful

  29. Sample size for means (large samples) • We want to estimate mean • We want a confidence interval to extend a distance a from the estimate • We guess at the population variance • A sample size estimate: • Small samples: If we have a population of size N, and want a specified , we get at 95% confidence level

  30. Example: Have dental costs increased since 1995? • Want to compare dental costs in 1995 (adjusted to 2006-kroner) and 2006 • Could do a paired sample t-test. How many individuals do we need to ask? • We believe a difference of 1500 kroner is important • From experience, we think for the difference is 2500 kroner • Need 4*25002/15002=at least 12 individuals to find a significant difference if our assumptions are correct

  31. Sample size for proportions (large samples) • We want to estimate proportion P • We want a confidence interval to extend a distance a from the estimate • Recall: CI for P=P+Zα/2√P(1-P)/n • A sample size estimate: • Largest possible value of this expression is 1/a2 (P=0.5) at 95% confidence level

  32. Example: Poll • Want to estimate the proportion voting Labour with 95% confidence interval extending +3% • Need to include at most 1/0.032=1112 people in our study • Would probably stick with 1112 if we don’t have any reason to believe P is smaller than 0.5

  33. Next time: • Some more on time-series analysis from chapter 19 • Presentation of results: How do you do it? • Recap of the different methods we’ve learnt

More Related