I've made a number of comments throughout the text, but I'd like to summarize my take on it:
- The study suffers from a basic lack of control. It's entirely possible that any course in statistics would achieve similar results, but that the relative modest improvements in knowledge simply don't stick over time.
- The improvements are relatively small. The authors claim that they are meaningful, but they don't provide confidence intervals for the actual effects and they don't make clear how they determine that an effect is meaningful. In fact, the authors spend no time at all on the question of the practical questions associated with the actual size of improvement or how to interpret them.
- The improvement in the most pervasive errors (inverse fallacy for p values) show the least improvement. Moreover, almost all of the improvement results from a decrease in "I don't know" responses rather than a decrease in actual errors. A researcher with an opposite ideological take could interpret this same data to be saying that the most pervasive errors seem to be impervious to teaching.
- The number of mistakes made in the Bayes factor is the smallest at the outset and the improvement in correct responses for the Bayes factor is the largest. This could be interpreted as indicating that Bayes factors are easier to understand and easier to teach. The authors should at least acknowledge this alternative interpretation.
- The paper is written as though the main, or only, objection to p values is that they are associated with errors. The authors should acknowledge that this is only one of the issues people have raised with p values and, more generally, with NHST. There are many critics who would continue to object to the use of p values even if they could be properly taught.
Other, more minor issues, are raised int he specific comments annotated below.