Reviewer #3 (Public Review):
The authors compiled a collection of published and newly sequenced genomes to assemble the largest collection of Drosophila genomes to date. Using this dataset they extracted a set of single copy orthologs to use for phylogenomic analyses, with a focus on estimating a time-calibrated phylogeny and introgression.
This new dataset is a valuable resource that will serve the broader community of Drosophila researchers opening many new avenues for future phylogenomics research. The workflow of focusing on BUSCO genes for all comparative analyses is simple in a good way -- it is easy to understand how the data were collected and it should be easily reproducible -- which makes it easy to read past the genomics details and focus on the analyses of these data.
However, I feel this is an important aspect of the paper that should receive more details, perhaps in the supplement. I may have missed it, but I could not find statistics about this ortholog data set. On average, how long is each locus, how many variable sites are there, how many taxa are missing data for any given locus due to paralogy? Do the BUSCO genes include both introns and exons? It is also unclear from the description exactly how the BUSCO genes were extracted from genomes. Are they extracted from the final assembled genomes, or do you perform variant calling after identifying them to call heterozygous site? If heterozygosity is excluded, how might this impact metrics such as the branch length tests, especially among close relatives? It likely impacts node age estimates as well?
The authors use this dataset to infer phylogenetic relationships among taxa using both ML concatenation (IQtree) and a two-step MSC approach (Astral) which yielded quite similar topologies, and they examined the impact of filtering loci with treeshrink, which had minimal impact. This new topology represents a substantial step forward for understanding the relationships among major Drosophila clades.
One of the main results of this study is a new set of node age estimates on the tree. For this they estimated branch lengths in mcmctree from a concatenated matrix of 1000 loci in the presence of fossil calibrations. The fossil calibration scheme selected as the best option includes three fossils, one dating the divergence at the split from mosquitos (uniform 195-230Ma) and two ingroup calibrations (U(43,64) and U(15,43)). To me, the credible intervals on node ages seem incredibly narrow. The authors mention this as an improvement compared to earlier studies, but they also mention later that the total amount of sequence data does not greatly impact node dating. So I'm a bit confused why the node ages are expected to be more accurate here. It seems to me that time calibrations should be most accurate when the greatest number of fossils are available, and when very appropriate Bayesian priors on set on the analysis. The effect of sequence variation is then relatively small. But here there are very few fossils, one of which is hugely distant, and so I would not expect highly precise age estimates. So I guess my question to the authors is, what do you think is going on here? Perhaps further description in the supplement of how the mcmctree method implemented here differs from traditional node dating done in a program like BEAST would help to clarify.
Considering that this paper aims to infer the new best time calibrated tree for the Drosophila community, I think that the current description of fossil calibration schemes, which primarily refers to other publication names in the supplement, is insufficient. Which fossils are used in those studies, are you using those fossils as calibrations here, or are you implementing secondary calibrations based on their phylogenetic results? The reader should not have to read every one of those papers to understand the basis of the calibrations in this paper.
Fig.1 shows nodal age posterior probabilities. Are these 95% confidence intervals? The taxon labels are too small in this figure, both on the large tree and especially in the inset figure. The legend refers to fossil taxon names used for calibrations, but because it is still unclear to me where the fossils are placed on the tree. Are the calibrations indicated somewhere in the figure?
The authors demonstrate evidence of introgression by showing mostly overlapping evidence from two different types of tests. Together, these tests show that most major clades contain significant imbalanced discordance in gene tree counts or branch lengths. The taxon labels in Figure 2 are unfortunately quite unreadable, especially the matrix labels, which makes it difficult to interpret.
I do not see a reason for presenting new names and acronyms for the introgression tests used in this study. The "DCT" is described as being similar to a suite of existing tests which are also based on comparison of rooted-triplet gene tree frequencies. These methods have been presented in many frameworks (BUCKy, D-stat, f4, etc.) and the only difference here seems to be the precise method used to determine significance. Similarly "the BLT is conceptually similar to the D3 test" could be replaced by just saying we implemented the D3 test which we refer to here as a 'branch length test (BLT)' to clarify that you have not in fact created a new test (e.g., you say "The first method we developed was the discordant-count test...")
I am not very satisfied with the estimates of the "upper bounds" of introgression used here. It seems that there could possibly be many ways in which admixture edges could be drawn on the tree to explain the matrix of significant test results, and it is better to let formal network inference methods (e.g., SNAQ, Phylonet) infer these edges rather than guess at their placement. The current approach of "placing introgression events between pairs of branches for which most descendant extant taxa show evidence of introgression" leaves significant room for subjectivity.
The authors did implement phylonet, but not very exhaustively. Why only fit a single edge on the tree instead of multiple? The authors state "networks with more reticulation events would most likely exhibit a better fit to observed patterns of introgression but the biological interpretation of complex networks with multiple reticulations is more challenging". I don't think this type of result is any more complicated to understand than the current approach used by the authors of drawing edges manually. And it is much less subjective. The authors say that it is computationally intractable, and this may be true for clades above ~15 tips, but testing on smaller trees by subsampling 10-12 tips seems feasible. From my experience network inference using pseudo-likelihood methods in SNAQ or phylonet takes a few minutes to fit 1 edge, and a few hours to fit 2-3 edges.
Currently the two major results of the paper seem disjointed. The authors infer a time-calibrated tree, and they infer introgression events, but there is not much connection between the two. I applaud the authors on one hand for being cautious in interpreting their "upper bounds" of introgression to say too much about when they think introgression has occurred in the context of the time-calibrated tree. I think there is insufficient confidence in the introgression timing estimates to do that. But, what about the inverse relationships? Does this extent of introgression across the tree impact your confidence in the estimated timing of divergence events? One expectation would be that it is biasing all of the divergence times to appear younger. See my suggestions for addressing this.
Overall, this study presents an impressive new dataset and important new results that greatly impact our understanding of the evolutionary history of Drosophila. Although the estimates of node ages and introgression events may be imperfect, they are clearly a step forward. It is clear from these results that introgression has occurred throughout the history of Drosophila, and this study paves the way for further investigation of these patterns, as the authors propose in their conclusions.