Reviewer #2 (Public review):
Summary:
In this paper, Selberg et al present an extension of their widely used BUSTED family of codon models for the detection of episodic ("site-branch") positive selection from coding gene sequences. The extension adds an "error component" to ω (dN/dS) to capture misaligned codons. This ω component is set to an arbitrarily high value to distinguish it from positive selection, which is characterised by ω > 1 but assumed not to be so high.
The new method is tested on several datasets of comparative genomes, characterised by their size and the fact that the authors scanned for positive selection and/or provided filtering of alignment quality. It is also tested on simple simulations.
Overall, the new method appears to capture relatively little of the ω variability in the alignments, although it is often significant. Given the complexity of codon evolution, adding a new parameter is more or less significant, and the question is whether it captures the signal that is intended, preferably in an unbiased manner.
Strengths:
This is an important issue, and I am enthusiastic to see it explicitly modeled within the codon modeling framework, rather than externalised to ad hoc filtering methods. The promise of quantifying the divergence signal from alignment error vs selection is exciting.
The BUSTED family of models is widely used and very powerful for capturing many aspects of codon evolution, and it is thus an excellent choice for this extension.
Weaknesses:
(1) The definition of alignment error by a very large ω is not justified anywhere in the paper. There are known cases of bona fide positive selection with many non-synonymous and 0 synonymous substitutions over branches. How would they be classified here? E.g., lysosyme evolution, bacterial experimental evolution.
Using the power of the model family that the authors develop, I would suggest characterising a more specific error model. E.g., radical amino-acid "changes" clustered close together in the sequence, proximity to gaps in the alignment, correlation of apparent ω with genome quality.
Also concerning this high ω, how sensitive is its detection to computational convergence issues?
(2) The authors should clarify the relation between the "primary filter for gross or large-scale errors" and the "secondary filter" (this method). Which sources of error are expected to be captured by the two scales of filters? What is their respective contribution to false positives of positive selection?
Sources of error in the alignment of coding genes include:
a) Errors in gene models, which may differ between species but also propagate among close species (i.e., when one species is used as a reference to annotate others).
b) Inconsistent choice of alternative transcripts/isoforms.
Both of these lead to asking an alignment algorithm to align non-homologous sequences, which violates the assumptions of the algorithms, yet both are common issues in phylogenomics.
c) Sequencing errors, but I doubt they affect results much here.
d) Low complexity regions of proteins.
e) Aproximations by alignment heuristics, sometimes non-deterministic or dependent on input order.
f) Failure to capture aspects of protein or gene evolution in the optimality criteria used.
For example, Figure 1 seems to correspond to a wrong or inconsistent definition of the final exon of the gene in one species, which I would expect to be classified as "gross or large-scale error".
(3) The benchmarking of the method could be improved both for real and simulated data.
For real data, the authors only analysed sequences from land vertebrates with relatively low Ne and thus relatively low true positive selection. I suggest comparing results with e.g. Drosophila genomes, where it has been reported that 50% of all substitutions are fixed by positive selection, or with viral evolution.
For simulations, the authors should present simulations with or without alignment errors (e.g., introduce non-homologous sequences, or just disturb the alignments) and with or without positive selection, to measure how much the new method correctly captures alignment errors and incorrect positive selection.
I also recommend simulating under more complex models, such as multinucleotide mutations or strong GC bias, and investigating whether these other features are captured by the alignment error component.
Finally, I suggest taking true alignments and perturbing them (e.g., add non-homologous segments or random gaps which shift the alignment locally), to verify how the method catches this. It would be interesting to apply such perturbations to genes which have been reported as strong examples of positive selection, as well as to genes with no such evidence.
(4) It would be interesting to compare to results from the widely used filtering tool GUIDANCE, as well as to the Selectome database pipeline (https://doi.org/10.1093/nar/gkt1065). Moreover, the inconsistency between BUSTED-E and HMMCleaner, and BMGE is worrying and should be better explained.
(5) For a new method such as this, I would like to see p-value distributions and q-q plots, to verify how unbiased the method is, and how well the chi-2 distribution captures the statistical value.
(6) I disagree with the motivation expressed at the beginning of the Discussion: "The imprimatur of "positive selection" has lost its luster. Researchers must further refine prolific candidate lists of selected genes to confirm that the findings are robust and meaningful." Our goal should not be to find a few impressive results, but to measure accurately natural selection, whether it is frequent or rare.