On 2016 Jan 16, Lewis G Halsey commented:
Fay worries that in the real world of data collection, the power of a study is not known in advance. If true, this argument would only compound our own, which is that unless power is very high (>90%) P has surprisingly low repeatability (Halsey et al., 2015), and the power of most studies calculated after the data analysis is far lower than this (Button et al., 2013, Maxwell, 2004). Therefore, researchers would not be able to design their experiment to ensure it has a very high power, and they could not rely on good fortune instead for their experiment to turn out this way.
But anyway, an integral step in testing the null hypothesis generates an estimate of the variance of the pooled population. Using this estimated parameter, obtained as the data are analysed, the researcher can immediately gauge the study’s power. The exact parameters of the population are never known. These parameters are hypothesised and then estimated, with varying certainty, according to the sample that we have. With a limited sample, these estimates can vary substantially each time an experiment is repeated. If our samples, and the estimates they generate, suggest that power is poor, then any P value that we obtain, low or not, is untrustworthy. A small P value is of little import: a repetition of the same study would give another result (our study, figure 4). This is like looking at the world through a pinhole. When the theoretical power is 0.48, P values less than 0.05 are no more likely than P values greater than 0.05. Why get excited if P is <0.01, when the next replicate experiment could give a P of 0.6?
Fay asks if there is a single better measure than P to test the likelihood that the null hypothesis is untenable. First, this brings us back to the nub of the problem - P is only a good test of the null in the ideal circumstances that study power is very high. Second, there are long-held, big concerns about the value of null hypothesis significance testing as a method for analysing and interpreting data (Cohen, 1994).
Lewis G Halsey and Gordon B Drummond
BUTTON, K., IOANNIDIS, J., MOKRYSZ, C., NOSEK, B., FLINT, J., ROBINSON, E. & MUNAFO, M. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365-376. COHEN, J. (1994) The Earth is round (p < 0.05). American Psychologist 49, 997-1003. HALSEY, L., CURRAN-EVERETT, D., VOWLER, S. & DRUMMOND, G. (2015) The fickle P value generates irreproducible results. Nature Methods, 12, 179-185. MAXWELL, S. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147-163.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.