39 Matching Annotations
  1. Oct 2020
    1. I've made a number of comments throughout the text, but I'd like to summarize my take on it:

      • The study suffers from a basic lack of control. It's entirely possible that any course in statistics would achieve similar results, but that the relative modest improvements in knowledge simply don't stick over time.
      • The improvements are relatively small. The authors claim that they are meaningful, but they don't provide confidence intervals for the actual effects and they don't make clear how they determine that an effect is meaningful. In fact, the authors spend no time at all on the question of the practical questions associated with the actual size of improvement or how to interpret them.
      • The improvement in the most pervasive errors (inverse fallacy for p values) show the least improvement. Moreover, almost all of the improvement results from a decrease in "I don't know" responses rather than a decrease in actual errors. A researcher with an opposite ideological take could interpret this same data to be saying that the most pervasive errors seem to be impervious to teaching.
      • The number of mistakes made in the Bayes factor is the smallest at the outset and the improvement in correct responses for the Bayes factor is the largest. This could be interpreted as indicating that Bayes factors are easier to understand and easier to teach. The authors should at least acknowledge this alternative interpretation.
      • The paper is written as though the main, or only, objection to p values is that they are associated with errors. The authors should acknowledge that this is only one of the issues people have raised with p values and, more generally, with NHST. There are many critics who would continue to object to the use of p values even if they could be properly taught.

      Other, more minor issues, are raised int he specific comments annotated below.

    2. and in fact, what we found was that a two-factor model (with subdimensions for frequentist vs. Bayesian concepts; see Supplementary Information) fit the databetterthantheone-factor model.

      I completely didn't understand from this paragraph why you chose to go with the 1 factor model instead of the 2 factor model in presenting your results.

    3. Another potentialmethodological confound isthe possibilitythat improvements were caused by a testingeffect, wherebymere exposure to items and their correct answers could have prompted individuals to score higher at subsequent post-tests. Similarly, however, if this occurredwe should have arguably observedmore pronouncedincreases inall items

      I found your argument against this limitation unconvincing. In future, some questions should be included that are not covered int he material at all in order to assess the baseline pre-post test improvement.

    4. What isworth noting of course is that observed increases in performance were certainly farfrom perfect, with some concepts, such as the group of items measuring then inverse probabilityand replicationfallacies, yieldingrelatively minimal improvements.Should such trends reflect a higher level of difficulty in elucidating some of these more tricky statistical concepts

      So, a persons with a different ideological take could look at this same data and conclude that it shows that, as people have suspected, p values (and especially the inverse fallacy) are very difficult to teach correctly and prone to mistakes which even dedicated teaching cannot correct.

    5. Conceptsthat were, on average,either lesser known(i.e. Bayes factors) or more problematic (i.e.confidence intervals),demonstratedunsurprisingly more pronounced improvementsin immediate learning, as compared to p-values.

      As above, another possible interpretation that must be considered is that the greater improvement is because these concepts are easier to learn.

    6. analyses yielding non-significant interaction effects of lag*time(ps > .472)

      I'm also surprised you don't do equivalence tests anywhere in the paper where you interpret lack fo significant effect as a lack of effect.

    7. thatthe amount of time that elapsed between pretest andthefirst post-testdid not have a statistically significant impact on degree of improvementobserved across individuals

      I'd really like to see the numbers for this. It came up twice and you are assuming a lack of effect from a lack of significance and then not telling us the effect size. That troubles me. This is especially important since you are claiming that the post-test pretty much immediately after the class is reflective of more long term effects of the class. For this, the question of whether there might be an effect of lag is really important.

    8. f 3.85 (SD= 1.15, subset1) and 2.72 (SD= 0.54, subset 2) at post-test 1

      Reporting style is really inconsistent in this particular paragraph. What I'd like to know is what was the improvement and the SEM of improvement for each of the subtests. If you want to test that improvement was smaller for the p value than the others, that would be fine, too, or at least consistent with practice throughout the paper.

    9. Specifically, main effect of timeon quiz scoreswas most notable for leastfamiliaritems (i.e. the Bayes factorsitems which incurred the highest proportion of “I don’t know” responses) and more problematicitems(i.e. the confidence intervalswhich incurredhigh proportions of incorrect responses

      This interpretation is reasonable. But it's also possible that Bayes and CI are just easier to teach than p values.

    10. lag (continuous, mean-centered)

      So, back in the methods you told that you'd tested that lag was not badly skewed, according to some hypothesis test, but you never report the skew or show us the distribution of lags. I would like to see the actual numbers and some assessment of the effect of the skew on your analysis that is not just backed up by some test with a name attached to it.

    11. 2.19timesmore likely (p< .001

      Here, and many other places in the text, it would be more informative to provide a CI for the estimate than the p value. The p value doesn't really tell us what the range of credible or reasonable values might be for this.

    12. except for the Bayes factor items:Rates showedthat among those who attempted to answer the Bayes factoritems (i.e. did not opt for the “I don’t know” option), accuracy was markedly highe

      All this is consistent with the general claim that Baye's factor may be easier to understand than p values.

    13. Baseline accuracy rates were computed in twoways

      I woudl argue that "I don't know" should be counted with the correct responses. What we care about -- following the introduction -- is whether scientists will make errors. If they know they don't know, we're in good shape.

    14. I don’t know(%)27.26

      It seems like most of the improvement in correct responses reflects a drop in "I don't know" rather than a reduction in errors. Since the central issue being addressed is whether use of p values is error prone, this seems like a key issue. It seems to me that the drop in errors is actually quite minor over the course and consistent at the end of the course with values in other studies discussed in the introduction.

    15. markedresponse attrition was observedacross the six quizzes,

      One thing that needs to be carefully considered is the possibility that the improvement data is based on a self-selecting sample. It is possible, and needs to be considered, that the other 1,800 or so students would not have shown the improvement if they had stuck with it.

    16. finally n6= 276

      It seems like this should be the n you are reporting in your abstract and generally discussing when discussing the paper. After all, the central claim of the paper is about improvement and so this is the relevant N.

    17. Specifically, the scale was composed of eight p-value items (seven targeting the four aforementioned fallacies, and one item for the correct p-value interpretation), three Bayes factor items, and three confidence interval items (see Table 1)

      I agree with the person on Twitter who said that students who learn a strategy of just choosing against absolute statements will do well on this quiz without understanding much.

    18. retained learning (post-test2,i.e. Pop Quiz 6in week 8)

      I worry that week 8 -- the very end of the course -- is too early to assess retained learning. One simple hypothesis is that any gap between the results here and the results in other single measurement studies is explained by the distance from the most recent stats course. Perhaps this is addressed later in the paper.

    19. CFI = .90, RMSEA = .05, SRMR

      Can we spell out abbreviations before they are used. Also, for statistical tests that are not t-tests, ANOVA, or correlations, can we please provide references so people know what the methods are.

    20. whereas a non-significant p-value necessarily implies a small effect size

      Is this the correct use of the phrase "effect size"? I normally use it to refer to the actual effect in the population, so that a non-significant p wouldn't necessarily imply a small effect size. It does imply that the estimated effect size is small. Perhaps, though, what you write here is standard usage, so I'm not sure.

    21. I'm annotating this file at the request of Daniel Lakens. I hope my ideas don't annoy anyone. Please feel free to contact me for further discussion if anything seems misguided or out of place.

    22. Before we abandon one of the most widely used approaches to statistical inferences due to misuse

      Here, I think there is a paper out there making precisely the case that, while use of p values may perhaps be improved, it should be abandoned until this has been demonstrated because the current damage is so great. Do you know the one I'm talking about? It should be referenced.

    23. As such, simply replacing p-values with other statistical tools is unlikely to resolve the problem

      This claim is being made without evidence, and that should be made clear. In fact, following the logic of the paper, the possibility that one approach is easier to teach or understand would seem like the first thing to check empirically. Only after this baseline level of understanding is established does it make sense to figure out how easy it is to improve understanding.

    24. Results demonstrated statistically significant improvements

      In deference to the part of the audience that is not a fan of NHST, it would be nice to have the actual effects in the abstract.