On 2013 Dec 12, Adam Eyre-Walker commented:
I thank Dr. Cherry for another set of insightful comments.
He points out we may have overestimated the stochasticity associated with the accumulation of citations. He correctly notes that if assessors tend to err erroneously in the same direction in their judgments, then the errors associated with their assessments will be correlated. One might imagine, for example, that assessors tend to over-rate papers in high impact factor journals, by particular authors, or from a particular institution. Such correlated errors will mean that the correlation between assessor scores underestimates the error associated with making an assessment and this will in turn imply that the stochasticity associated with the accumulation of citations is less than we have estimated. However, if the error associated the accumulation of citations is also correlated to the error associated with the assessment then the stochasticity associated with the accumulation of citations may have been underestimated. The errors associated with assessments and citations might be correlated given that citations depend, to some extent, on post-publication subjective assessment.
A likely bias, and hence a source of correlated erros, is a tendency for assessors to overestimate papers in high-ranking journals. As we showed, the partial correlation between assessor scores and between assessor scores and the number of citations, controlling for impact factor, are very weak (r<0.20). This suggests that within journals, subjective estimates of merit and the accumulation of citations are dominated by error. The weaknesses of these correlations might be because there is little variation in merit within journals, with most of the variance in merit being between journals. However, it seems unlikely that journals are a perfect arbiter of merit because their judgments are based on subjective assessment, which we have demonstrated to be poor. Furthermore, as noted above, it is quite likely that the errors associated with assessments are correlated to errors associated with the accumulation of citations. The system is clearly complex and it may prove to be very difficult to accurately estimate the variance associated with the accumulation of citations.
We argued in our original paper that the impact factor might be the best of the methods currently available for assessing merit, though we emphasized that it was likely to be very error prone. We argued that it might be a reasonable measure, because the IF is a form of pre-publication review – in accepting a paper for a particular journal, the scientific community has decided that the paper of sufficient merit to be published where it is accepted. This decision is likely to be the consensus of several individuals, with some individuals, such as editors, having a greater say than others. Dr. Cherry points out that using the IF as a measure of merit might potentially be matched by combining the post-publication assessments of several individuals. He shows that if we ignore any potential biases (i.e. correlated errors), for example assessors being influenced by the IF, then the estimated correlation between assessor score and merit is 0.60, and the correlation between the IF and merit is expected to be 0.80. Dr. Cherry arrives at these estimates in the following manner (he elaborated upon this in a subsequent email). If the errors are uncorrelated then the correlation between two variables, which are correlated to X, is expected to be the product of their correlation to X; e.g. if the correlation between variable 1 and X is r1 and the correlation between variable 2 and X is r2 then the correlation between 1 and 2 is expected to be r1*r2. The correlation between assessor scores in the Wellcome Trust data is 0.36, which implies the correlation between a single assessor score and merit is SQRT(0.36) = 0.60. The square of this is the proportion of the variance in score explained by merit, which is equivalent to our equation 1 (the square of the correlation between a single assessor and merit is the expected correlation between two assessors). From this equation we can estimate the ratio of the error to merit variance, which is 1.78 from the Wellcome Trust data. If we have n independent assessors we expect this ratio to be reduced by a factor n; hence the expected correlation between the mean score from n assessors and merit is 0.60 (n=1), 0.73 (n=2) and 0.80 (n=3). The inferred correlation between the IF and merit is 0.80; this comes from noting that the correlation between assessor score and IF is expected to be the product of the correlation between assessor score and merit, and the IF and merit; given that the correlation between assessor score and merit has been estimated to be 0.60, and the observed correlation between assessor score and IF is 0.48, we estimate that the correlation between IF and merit is 0.80. Hence we would need 3 independent assessors to match the correlation between IF and merit. For comparison, the correlation between the number of citations and merit is inferred to be 0.69 if we use the correlation between IF and the number of citations, and 0.63 if we use the correlation between assessor score and the number of citations to make the estimate. Hence the IF is the best measure of merit, and would only be rivaled by subjective assessment if we engage 3 independent reviewers; the number of citations is estimated to be better than a single reviewer, but worse than two reviewers. However, I would emphasise that these estimates all assume that errors are uncorrelated.
Finally, Dr. Cherry points out a problem with using the correlation coefficient on a bounded scale. The correlation coefficient is typically scale independent – i.e. if you add, subtract, multiply or divide one of the variables by some value then the correlation coefficient remains unchanged. However, this is only true if the scale is unbounded; if there is a maximum or minimum value then the correlation may be poor, even if the reviewers agree on the ranking of the papers. For example, if one assessor tends to rate harshly and another generously then the correlation may be poor because most of the harsh reviewer’s scores are the lowest mark and most of the liberal reviewer’s scores are the highest mark. The solution to this problem is to offer an essentially unbounded scale. However, as we pointed out in our original article, the tendency for reviewers to differ in their average mark could potentially have serious consequences in an assessment exercise, particularly if individuals or universities are assessed by a limited number of individuals.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.