170 likes | 283 Views
Task force summary. Method. Design Don’t ‘pretend ’ it’s something it’s not Hypothesis generating vs. Hypothesis testing Or exploratory vs. confirmatory Both can be of great value and they are not mutually exclusive even within a study
E N D
Method • Design • Don’t ‘pretend’ it’s something it’s not • Hypothesis generating vs. Hypothesis testing • Or exploratory vs. confirmatory • Both can be of great value and they are not mutually exclusive even within a study • Populations can be anything, make sure it’s clear which you are trying to speak to • Sampling • Can actually be a quite complex undertaking, make sure it’s clear how the data was arrived at
Method • Random assignment • Critical in experimental design • Do not think you are random • Humans are terrible at it, e.g. let software decide assignment1 • In cases of non-experimental design, ‘comparison’ groups may be implemented but are not true controls and should not be implied as such • Control can be introduced via design and analysis • Random assignment and control do not provide causality • Causal claims are subjective ones made based on evidence, control of confounds, contiguity, common sense etc.
Measurement • Variables • Precision in naming is a must • Variable names should reflect operational definitions of constructs • For example: Intelligence no, IQ test score yes • Nothing about how that value is derived should be left to question • Range and calculations must be made extremely clear • Instruments • Reliability standards in psychology are low, and somehow getting worse • The easiest way to ruin a study and waste a lot of time is using a poor measure; It only takes one to muck up everything • You are much better off assuming that a previously used instrument was a bad idea than assuming that it’s ok because someone else used it before • Even when using a well-known instrument you should report the reliability for your study whenever possible. • This not only informs about what populations a measure may or may not be reliable for, it is crucial for meta-analysis • Recall that there is no single ‘Reliability’ for an instrument, there are reliability estimates for that instrument for various populations
Measurement • Procedure • Methods of collection must be sound and every aspect about it must be communicated so others can be sure of lack of bias • “Missing” data can be accounted for in a variety of ways this day and age • And the worst way to do it is completely ignoring incomplete cases, which can introduce extreme bias into a study • Power and sample size • Don’t be lazy, get a big sample. • It is very easy to calculate the sample size needed for typical analyses • However there are many problems with such estimates both theoretical and practical as we will discuss later • The main thing is that it should be clear how the present sample size was determined
Results • Complications • Obviously any problems that arise should be made known • You will be able to do so easily with a thorough initial examination of data • Search for outliers, miskeys etc. • Test statistical assumptions • Identify missing data • Inspecting your data is not fishing, snooping or whatever, it is required for doing minimally adequate research1 • Visual methods are best and really highlight issues easily • From the article “if you assess hypotheses without examining your data, you risk publishing nonsense.” • “If you assess hypotheses without examining your data, you will publish nonsense.” Fixed.
Results • Analysis • Your analysis is determined before data collection, not after • If you do not know what analysis to run and you’ve already collected the data, you just wasted a lot of time • Theory Research Hypotheses Analysis ‘family’1 Appropriate measures for those analyses Data collection • The only exception to this is when using archival data, but then if doing that, you have a whole host of other problems to deal with. • “Do not choose an analytic method to impress your readers or to deflect criticism.” • Unfortunately it seems common in psych for researchers to choose the analysis before the research question, mostly for the former reason (at which point they do it poorly and have the opposite effect on those who do know the analysis) • While “the simpler classical approaches” are fine, I do not agree that they should have special status if for no other reason than because neither data nor sufficiently considered research questions conform to their use except on rare occasion2. Furthermore, we also have the tools to do much better and as easily understood analyses, and saying an analysis is ‘complex’ is often more a statement about familiarity than it is about difficulty.
Results • Statistical computing • Regarding programs specifically • “There are many good computer programs for analyzing data.” • “If a computer program does not provide the analysis you need, use another program rather than let the computer shape your thinking.” • Regarding not letting the program do your thinking for you. • “Do not report statistics found on a printout without understanding how they are computed or what they mean.” • “There is no substitute for common sense.” • Is it just me or are these very clear and easily understood statements? Would you believe I’ve actually had to defend them?
Results • Assumptions • “You should take efforts to assure that the underlying assumptions required for the analysis are reasonable given the data.” • Despite this it is often difficult to find any mention of analysis of assumptions or appropriate and modern ways of dealing with the problem of not meeting them. • Hypothesis Testing • “Never use the unfortunate expression ‘accept the null hypothesis.’” • Outcomes are fuzzy, that’s ok.
Results • Effect sizes • “Always present effect sizes for primary outcomes.” • “Always present effect sizes.” Fixed. • Small effects may still have practical importance or maybe that finding is more important to others than to you. • Confidence intervals • Reporting uncertainty of estimate is important. Do it. And do it for the effect sizes. • “Interval estimates should be given for any effect sizes involving principal outcomes”
Results • Multiple comparisons/tests • First, pairwise methods… were designed to control a familywise error rate based on the sample size and number of comparisons. Preceding them with an omnibus F test in a stagewise testing procedure defeats this design, making it unnecessarily conservative. • Second, researchers rarely need to compare all possible means to understand their results or assess their theory; by setting their sights large, they sacrifice their power to see small. • Third, the lattice of all possible pairs is a straightjacket; forcing themselves to wear it often restricts researchers to uninteresting hypotheses and induces them to ignore more fruitful ones. • Again, fairly straightforward in the recommendation of not ‘laying waste with t-tests’.
Results • “There is a variant of this preoccupation with all possible pairs that comes with the widespread practice of printing p values or asterisks next to every correlation in a correlation matrix… One should ask instead why any reader would want this information.” • People do not need an asterisk to tell them whether a correlation is strong or not. • The correlation is an effect size and should be treated accordingly • Humans are good pattern recognizers, if there is a trend they will likely spot it on their own or you might make it more apparent in summary statements that highlight such patterns. Putting asterisks all over the place1 doesn’t imply anything more than that you are going to prop up poor results with statistical significance, or worse, that some ‘fishing’ went on.
Results • Causal claims • Establishing causality is tricky business, especially since it can’t technically be done • There is no causality statistic, and neither causal modeling nor experimentation establish it in and of themselves • However, we do assume causal relations based on evidence and careful consideration of the problem itself, but be prepared for a difficult undertaking in attempting to establishing.
Results • Tables and figures • People simply do not take enough time or put enough thought into how their results are displayed • Like anything else, you need to be able to hold your audience’s attention • People spend a lot of time going back over tables and figures, and more than they do rereading the text. • It is very easy to display a lot of pertinent information in a fairly simple graph, and this is the goal: max info min clutter. • Furthermore, what can be displayed in in a meaningful way graphically is not restricted1 • Any number of graphs you’ve never come across may be the best • This is where you can really be creative, allow yourself to be! • Unfortunately, many limit themselves to the limitations of their statistical program, and while trying to spruce up bad graphics end up making interpretation worse • E.g. 3-d bar chart • Stats programs are in general behind in their offerings compared to what graphics programs are available (obviously), and some are so archaic as to actually make customizing simple graphs a labor intensive enterprise.
Discussion • Interpretation • Credibility, generalizability, and robustness • Conclusions • Do not reside in a vacuum but must be placed within the context of prior and ongoing relevant studies • Do not overgeneralize. In the grand scheme of things one study is rarely worth much and no study has value without replication/validation • Thoughtfully make recommendations on issues to be addressed by future research and how they may do so • “Further research must be done…” Is already known before you started coming up with theories to test. Might as well say “Future research should be printed in black ink.”, it’d be about as useful.
The real problem • The initial approach laid out • Fisher, R.A. (1925). Statistical Methods for Research Workers. • Fisher, R.A. (1935). The Design of Experiments. • Neyman, Jerzy (1937). "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability", Philosophical Transactions of the Royal Society of London. Series A.1 • Immediate criticism • Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association. • Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association. • Later criticism • Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834. • Recent criticism • Harlow, Mulaik, Steiger (1997). What if there were no significance tests? • Problems with power • Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences. • On the utility of exploration • Tukey, John W (1977). Exploratory Data Analysis. • Emphasis on use of relevant graphics • Tufte, Edward R. (1983). The Visual Display of Quantitative Information • Effect sizes • Correlation coefficient • Pearson, K (1896). Regression, heredity and panmixia. Philosophical Transactions A. • Peirce, C.S. (1884). The Numerical Measure of the Success of Predictions. Science. • Standardized mean difference • Cohen, J. (1969). Statistical power analysis for the behavioral sciences. • Issues regarding causality2 • Aristotle, Physics II 3. • Hume, D. (1739). Treatise of human nature. • Related methods: SEM, Propensity score matching • Some ‘Modern’ methods • Bootstrapping • Bradley Efron (1979). "Bootstrap Methods: Another Look at the Jackknife". The Annals of Statistics 7 (1). • Robust methods • Huber, P. J. (1981) Robust Statistics.3 • Bayesian • Bayes, T. (1764). Essay Towards Solving a Problem in the Doctrine of Chances . • Robbins, H. (1956) An Empirical Bayes Approach to Statistics, Proceeding of the Third Berkeley Symposium on Mathematical Statistics. • Structural Equation Modeling • Wright, Sewall S. (1921). "Correlation of causation". Journal of Agricultural Research, 20.
The real problem • The real issue is that most of these problems and issues have existed since the beginning of statistical science, been noted since the beginning, have had many solutions offered for decades and yet much of psych research exists apparently oblivious of this or… • Are researchers simply ignoring them? • Task Force on Statistical Inference initial meetings and recommendations • 1996 • Official paper 1999 • Follow up study 2006 • Statistical Reform in Psychology: Is Anything Changing? • Cumming et al. • Change, but Little Reform Yet • “At least in these 10 journals1, NHST continues to dominate overwhelmingly. CI reporting is increasing but still low, and CIs are seldom used for interpretation. Figures with error bars are now common, but bars are usually SEs, not the recommended CIs...2 • If we can’t expect the ‘top’ journals to change in a reasonable amount of time what are we to make of our science?