214 Matching Annotations
  1. Last 7 days
    1. A tentative first step was taken by Lajaaiti et al. (2023), who combined a graph-based representation of phylogenetic trees with Graphical Neural Networks (GNN). This combination, however, resulted in a poor performance due to “over-smoothing” and “hop neighbourhood” problems.

      But see Leroy et al., 2025 (https://doi.org/10.1101/2025.08.14.670341) that addresses this issue using an improved pooling operator. However, as they discuss in their preprint, the performance they achieve (exceeding that of the MLE) still likely does not represent a ceiling on their performance here, as the architecture is quite simple. Use of more sophisticated graph-based architectures including graph transformers (which combat oversmoothing and can more readily account for both local and global patterns) will likely increase this performance further.

  2. Aug 2025
    1. We anticipate that layers that account for this depth order, e.g. through convolutions or possibly self-attention (as used in spatio-temporal graphs (e.g. Guo et al. 2019, Su et al. 2020)), will often be complementary to other layers acting on the topology (encoded in the phylogenetic graph), e.g. through graph convolutions.

      Related to the pooling operator, I think large gains may come from the use of 1) edge weights in your GCN layers so that not all neighbors are treated equally by the message passing mechanism, and 2) alternative MPNN layer types, including use of the graph attention mechanism (i.e. GAT) or graph transformers, which use the attention mechanism to learn which neighbors are more "important." I suspect that even with simple mean-pooling, these alternative layer types will be much more performant and generalizable (e.g. from CRBD to BiSSE). In effect the GCN layers (particularly without using edge weights) is more akin to the CRBD in that it assumes uniform, homogeneous contribution by all neighbors to feature updates.

    2. the LTT-based statistics are less useful under BiSSE, which explains why the PhyloPool procedure loses part of its edge against global pooling (used in GNN-avg): preserving the phylogenetic order is intuitively less important when estimating under the BiSSE model, where consecutive nodes may be under different states.

      Again, I think this is a case for exploring the use of more general pooling operators, such as EdgePooling, etc, that might capture the relevant signal, but without imposing such rigid inductive biases on the architecture that could prove harmful to more general application.

    3. We consider two GNN architectures, both starting with graph convolutional layers. The first architecture (GNN-avg) aggregates the outcome of the last convolution through a global average pooling layer (as in Lajaaiti et al. 2023), the second (GNN-PhyloPool) through our PhyloPool procedure (see Table S4 and S5 in the Appendix for details on both GNNs).

      Have you considered other pooling layers other than mean-pooling? For instance existing pooling operators like EdgePooling might confer similar benefits by retaining the temporal signal through iterative edge contraction. https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.pool.EdgePooling.html#torch_geometric.nn.pool.EdgePooling

    4. In graph convolutional layers, the update for each node starts from an average of the node itself and its direct neighbors in the graph, normalized by the degree of the nodes (see Fig. 2.a).

      Related to the above (regarding the number of GNN layers used), have you considered/looked into using either gate residual connections, or using something like jumping knowledge (https://arxiv.org/abs/1806.03536) between layers to mitigate oversmoothing in deeper networks which GNNs tend to be prone to? This could be another simple modification with outsized benefits.

    5. In GNNs, each node has an initial embedding vector that is then iteratively updated using the embeddings of its neighbors through graph convolutional layers, a common update scheme.

      I presume you're using the "base" GCN layer here? If so I'd be clear about this, since unlike the CNN and MLP, there is a massive diversity of MPNN layer types - its worth being very explicit about what you're using, since their choice could have massive impacts on the performance of these architectures!

    6. We provide the GNN with the phylogeny’s topology and one attribute per node as its embedding: the distance to the tips, with the tips assigned 0 and the root at a negative coordinate corresponding to the depth of the phylogeny.

      Is there a reason you chose not to include edge weights/attributes corresponding to each respective branch length? I suspect this would be quite useful/informative for providing additional context to the message passing layers and could boost performance further. Otherwise, each neighbor is assumed to contribute equally to feature updates, which is not something we innately suspect to be true.

    7. These are followed by pooling layers to capture global features of the graph, and finally fully connected layers.

      True, but only necessarily when the prediction task is graph-level. This is not necessary for many other common GNN prediction tasks, including node or link prediction. This is a bit of a nitpick, but I think useful to distinguish since this architecture is so new to the field!

    8. We passed the CDV representation through a convolutional neural network (CNN) to predict the parameters of interest (see Table S3 in the Appendix for details on this CNN).

      I know these details are present in the supplement, but here and for the other NNs, it could be useful to specify the number of layers used - particularly for the GNN, where this relates to the effective "diameter" of visibility or neighborhood size (i.e. how many hops away the message passing mechanism aggregates neighbor information).

  3. Jun 2025
    1. This result is consistent with the hypothesis that the nonlinear latent space coordinates capture more information about the phenotypic state of the system under study, allowing for more accurate predictions of out-of-sample data. Furthermore, increasing the dimensionality of the nonlinear latent space from two to three dimensions only marginally improves the predictive power of the model, suggesting that the 2D nonlinear latent space captures most of the information about the phenotypic state of the system (see Supplementary Material for a detailed discussion).

      It seems worth communicating here that in 5 (6?) out of 8 total antibiotics tested, the 2D-VAE outperformed the 2D-RHVAE for out-of-sample prediction. That said, it is noteworthy that, at least for one antibiotic (NQO), the magnitude of improvement over the VAE by the RHVAE is the largest observed.

      I should emphasize too - I don't think this is necessarily a problem for the RHVAE - it clearly has significant benefits outside of the capacity to generalize! If anything, I'd say it's reasonable to hypothesize that perhaps it doesn't generalize as well because it does a better job of learning where it can and can not make accurate predictions, leading to sharper boundaries in prediction accuracy.

    2. a neural network architecture that embeds high-dimensional data into a low-dimensional space while simultaneously learning the geometric transformations that map the learned low-dimensional space back into the high-dimensional space via a metric tensor

      I'm very intrigued by this idea. At the very least, this seems like an interesting and effective way to drive VAEs towards learning more biologically meaningful, and increasingly expressive prior distributions rather than using a simple normal. On the other hand, this suggests the promise of a latent space that may even be interpretable.

      I'm wondering if you've looked into this - whether these meaningful distances in the high-dimensional latent space are, or eventually could be made to be interpretable such that the contributions of different phenotypes to each latent dimension is quantifiable?

  4. Apr 2025
    1. Such stacking of attention layers is commonly used in large language models, including those used to model proteins. We use three layers because they collectively capture both pairwise and higher-order interactions, and empirical tests showed that adding more layers did not improve performance.

      Did you consider the use of (potentially gated) residual skip connections (Savarese & Figueiredo 2017)? This (or a related approach) will likely help to increase/improve the expressivity of these attention layers and prevent oversmoothing by allowing for a more persistent signal from first- and second-order epistatic interactions, potentially allowing for the use of additional layers (if necessary).

    1. significantly different from the background (one-sided Kolmogorov-Smirnov test P = 0.011) (Fig 6). Collectively, these findings classify the convergent genes as exhibiting intermediate levels of pleiotropy, supporting previous predictions that moderate pleiotropy facilitates adaptation 63–65.

      Is there a significant difference in the mean level of phenotypic plasticity? It's fair to ask whether the distributions of the extent of pleiotropy is the same for convergently vs non-convergently evolved genes, but the conclusions you are reaching about a directional effect (i.e. convergent genes exhibit greater pleiotropy) are not aligned with the tests being conducted, which simply address distributional different, irrespective of direction.

    1. given the number of single gene translocations (jumps) since the divergence of the respective organisms from their common ancestor.

      How is the number of single gene translocations since divergence from their common ancestor determined? Does this assume or leverage some rooted phylogeny with an outgroup? Put another way, how can you distinguish between derived vs ancestral & conserved gene order?

  5. Feb 2025
    1. 8 attention heads, a 2048 feed forward hidden size, and 0.1 dropout.

      Earlier in the paper you highlight that the METL-Global model tends to overfit fairly strongly to the in-distribution training data. I can't help but wonder whether the increased usage of multi-headed attention here and equal, but still relatively low dropout may be contributing to this. Did you explore the impact of increasing dropout on the ability of this model to generalize? Related, did you explore whether implementing dropout to the attention heads themselves improved the models capacity to generalize? I fully appreciate this can be tricky to nail down, but I can't help but wonder if doing so might help somewhat for out-of-distribution predictions, even if it is to the slight detriment of in-distribution prediction.

    2. dropping variants with NaN values for any of the biophysical attributes

      For how many variants did this occur? Was this concentrated in specific proteins in the global dataset? I'm just wondering/curious if these NaN's might be associated with some specific, biologically relevant phenomena leading to some potential acquisition bias in the simulated training data, or if these instead can be interpreted as an unbiased occurrence.

    3. The strong performance of linear regression, which relies on additive assumptions, suggests the functional landscape is dominated by additive effects.

      I agree with this interpretation insofar as it suggests that the sampled functional landscape for these proteins is dominated by additive effects. But, even in cases where multi-mutants have been sampled, these datasets typically only have sampled multi-mutants at a handful of sites at small hamming distances from WT.

      For instance, in the case of the avGFP dataset which includes mutants up to an impressive hamming distance of 15, the sampling density of multi-mutants steeply drops off at increasing distances. Taking into account the fraction of possible multi-mutants sampled at those increasing distances, it's apparent that we've barely scratched the surface of these protein's sequence space and presumably functional landscapes.

      This is all to say - these datasets (naturally, due to the nature of their generation) have very strongly biased sampling of the functional landscape, being extremely localized around a single WT protein. It may well be that these WT sequences reside on fitness peaks around which local mutational effects are largely additive.

      I would be interested in seeing the regime extrapolation performance of these methods broken down by mutational distance from WT, or by the count of mutations. I can't help but suspect that the strong performance of linear models decays strongly with mutational distance. When considering that these datasets often have sampling biases towards shorter mutational distances, it seems likely that the correlation performance will be dominated/confounded by the high abundance of "easier," local multi-mutants.

  6. Dec 2024
    1. Performance Analysis

      I would like to see figures depicting the performance (error) of these different methods for each parameter estimate, but plotted as a function of tree size (i.e. a head map of error as a function of parameter value and tree size) - I can't help but wonder if part of the pattern of increasing error rates as a function of increasing diversification parameter is simply a result of there being greater variability in simulated tree shape/size at these larger parameter values.

      Additionally, I think it could be quite informative to plot a heatmap of error with speciation and extinction rates as the x and y axes - I suspect this would highlight a clear, predictable pattern, particularly with increasing error rates being characteristic of parameter combinations where both speciation and extinction rates are high, leading to high species turnover and thus greater "volatility" of diversification outcomes.

    2. GNN: Predictions obtained by the graph neural network using the phylogenies.

      Again, I can't help but wonder/suspect that the performance of the GNN here is limited due to the fact that GraphSAGE cannot actually leverage/take advantage of edge weights/branch lengths.

      Related, the performance of the GNN may be improved further if node positional information in the tree is encoded using some of the node representations implemented by pytorch geometric as node features (e.g. the Laplacian Eigenvector positional encodings - https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.transforms.AddLaplacianEigenvectorPE.html#torch_geometric.transforms.AddLaplacianEigenvectorPE).

    3. We use the AdamW (Adaptive Moment Estimation with decoupled weight decay) optimizer (Loshchilov and Hutter, 2017) to iteratively update the neural networks’ parameters to minimize the loss function. We used default AdamW argument settings.

      It's nice to see that you used AdamW over the original Adam optimizer, as it correctly implements L2 regularization. But, it might be worth considering the use of either adaptive learning rate schedulers on loss plateau to improve/generalize the ability of each of these three models to learn during the training procedure, as well as to possibly explore the impact of varying the value of weight_decay (regularization strength), as the default value for this parameter (and realistically for the learning rate) is unlikely to be the same for these three models which differ in model complexity.

    4. We started with a GNN to make initial predictions and explored the effectiveness of both DNN and LSTM for correcting residuals, either individually or in sequence.

      Did you explore the effect/impact of ordering your model predictions on the outcomes? If so, what sort of variability in outcome did you find?

    5. In the GNN, the full phylogeny was interpreted as a graph and could in that form be used as input data

      I comment on this in Appendix C, but did you explore the use of different GNN architectures? GraphSAGE does quite well for inductive learning tasks and for aggregating node information from multiple hops, but it does not take into account edge weights or edge attributes, which here could be interpreted as branch lengths - I suspect this information would be incredibly useful for a GNN model capable of leveraging this information.

    6. For the graph neural network, we used GraphSAGE

      Is there a particular reason why you chose to use GraphSAGE? I understand it does well in aggregating information from nodes that are multiple hops away and thus performs well in large graphs, but it is unable to take advantage of edge weights (i.e. branch lengths) which could be incredibly information rich for diversification parameter estimation. Additionally, although you capped the tree sise to 1500 tips (3000 nodes), these graphs really are not terribly large in the context of GNN applications - they are also quite sparse, as they are fully bifurcating.

      Mostly just interested in your thought process here! Did you explore the use of alternative GNN convolutional layers or models?

    7. R package “eveGNN” (a codebase of phylogeny simulation, data transformation, neural network training and MLE computation for our study)

      Is this meant to be EvoNN? Or is this a separate package/repo? If the latter, please be sure to include a link to the repo!

    8. Our analyses encompass three different diversification scenarios for which likelihood-based inference approaches already exist:

      I think it certainly makes sense to evaluate these different model architectures for these commonly studied diversification scenarios for which existing, comparable PCMs exist. However, I do think it's going to be incredibly important to evaluate how these do under increasingly complex scenarios, such as those where diversification rates are time- and branch-heterogeneous and either depend on some other continuously (or discretely) varying trait(s), or themselves evolve according to some model of evolution (e.g. Martin et al., 2023 - https://doi.org/10.1093/sysbio/syac068).

      I say this because although these diversification scenarios may require even more information-rich, larger phylogenies to accurately infer diversification parameters/make phylogenetic predictions, these are likely to be the scenarios where more classical MLE-based approaches exhibit a high degree of model inadequacy, leading these new approaches to shine.

    9. Phylogenetic trees can also be viewed as graphs, suggesting that graph neural networks have potential applicability in phylogenetics.

      It's great to see this recognition/appreciation for the potential of GNNs in the context of phylogenetic inference - there are a huge range of applications here!

  7. Nov 2024
    1. Similar results were observed for epistatic models other than pycofitness.

      Given the strength of the claims you are making here, I would suggest reporting these results explicitly - as is the reader has no ability to evaluate this claim.

    2. JackHMMER

      While useful in assembling sequences for your MSA, HMM methods for MSA construction may no be, or are unlikely to be the most accurate. Given that you are interested in the impact of MSA construction on your predictions, it might be worth considering the more accurate methods typically used in phylogenetic inference.

    3. Finally, we used one feature based on the protein 3D structure, and defined a way of combining it with the evolutionary features described above:

      Is there a reason why you only considered the one feature based on protein 3D structure?

    4. We focused on single amino acid substitutions and computed the corresponding changes in folding free energy upon mutations, defined as ΔΔG = ΔGmt − ΔGwt, where wt and mt stand for “wild-type” and for “mutant” respectively.

      Have you considered evaluating how the use of a reference-free approach (as described here) might influence your findings here? By evaluating everything with respect to the wild-type, it can be difficult to disentangle local-epistatic effects from global nonlinearity in the sequence-function relationship.

      https://doi.org/10.1038/s41467-024-51895-5

    5. the latter being the primary driver of evolution [5, 6, 7]

      While certainly of importance, I would suggest this statement be tempered slightly - I'd suggest instead emphasizing that protein stability is one of the strongest forces constraining molecular evolution.

    1. The ESM score

      I would suggest clarifying earlier in the results section that across all models, the feature encodings used were ESM embeddings - as is this doesn't clearly come across without first going into the methods.

    2. depth of multiple sequence alignments (MSAs)

      I would suggest describing/defining here exactly what you mean by MSA depth - does this involve collection/use of out-of-sample natural sequences to construct the MSA? Is depth here the number of aligned sequences outside of those you are predicting fitness for?

    3. including the best Hamming distance ensemble, averaged across 12 landscapes. Shading indicates standard deviation.

      I presume this is in reference to "Hamming distance EVmutation," plotted as the brown dotted line? This does quite well, but it's unclear from the figure with this actually is/how it incorporates the two features to make its predictions, since it's not described/indicated elsewhere in the figure.

    4. We reasoned that the number of KDE peaks, reflecting the distribution modalities of fitness, could serve as a proxy for the underlying landscape navigability, which impacts the outcome of DE.

      I like this idea - but it seems like it would be quite valuable for you to explicitly assess the extent to which this is true. For instance, does the number of KDE peaks correlate with landscape ruggedness? Or the number of fitness peaks, or fitness sinks?

      I might suggest adding in some of these basic exploratory tests investigating the relationships between the fitness statistics and fitness landscape measures to your work, either in the main text or supplement.

    5. Considering the variability in throughput and expense of experimental screens, we explored a range of total number of variants screened (total sample size), from 120 to 2,016 samples (Figure 2a; Table S2).

      You haven't yet mentioned anything about how what ML model or what model architecture you are actually using to make fitness predictions - these details may be in the methods, but I would strongly suggest a section/paragraph briefly describing this to improve clarity!

    6. ten three-or four-site landscapes of the thermostable β-subunit of tryptophan synthase (TrpB)

      Is it actually appropriate to consider these as being fundamentally distinct from each other? It seems like it would be more apt to describe them as different, but (in some cases) partially overlapping regions of the same activity landscape.

  8. Oct 2024
    1. The calculated Ka/Ks values are much less than 0.5, indicating that the overall sequences are under significant purifying selection.

      This is the first time you mention that you've calculated Ka/Ks - I would suggest first describing these analyses and the motivation behind them.

    2. Considering that tandem duplications are usually highly unstable and without the action of natural selection, amplified gene arrays rapidly disappear from populations

      This is a bit awkwardly phrased - maybe reword to something like "Tandem duplications are usually highly unstable and typically disappear rapidly from their natural populations in the absence of natural selection acting to maintain them"

    3. Through phylogenetic tree analysis

      Given that all methods are in the supplement, I think it would be best if you specify the exact methods you're using here - are these trees produced as part of the base OrthoFinder workflow? How did you infer multiple sequence alignments?

    4. analyze the complete protein sequence files of more than 600 bacterial strains

      Were these protein strain datasets pre-processed to remove any potential artificial gene duplicates. That is, identifying/removing highly redundant sequences that arise due to bioinformatic reasons, rather than actual gene duplication events? Inferences of orthogroups using tools like OrthoFinder and subsequent downstream analyses are highly impacted by the persistence of such artifacts.

  9. Aug 2024
    1. Phyloformer starts (bottom left) from a one-hot encoded MSA

      Below I point out that PF+FastME seems to do well when it comes to inferring branch lengths (as indicated by low KF distances), but it struggles with topological accuracy (as indicated by RF distances), particularly for larger trees.

      I wonder if you might be able recover some topological accuracy by supplementing the one-hot encoded MSA with some form of complementary, or even alternative feature representation. One potential that comes to mind is that perhaps obtaining protein feature vectors derived from a Position-Specific Scoring Matrix (PSSM) constructed from the MSA.

      This would be a 20 AA x 20 AA = 400-long feature vector for each sequence (such as implemented by PSSMCOOL - https://doi.org/10.1093/biomethods/bpac008). Including this as either a substitute for the one-hot encoded MSA, or as a complementary layer may provide additionally relevant information for PF to learn, resulting in improved topological accuracy while still using FastME.

    2. Performance measures for different tree reconstruction method. a) Kuhner-Felsenstein (KF) distance, which takes into account both topology and branch lengths of the compared trees; b) mean absolute error (MAE) on pairwise distances, which ignores topology; c) normalized Robinson-Foulds (RF) distance, which only takes into account tree topology. The alignments for which trees are inferred, were simulated under the LG+GC sequence model and are all 500 amino acids long. For each measure, we show 95% confidence intervals estimated with 1000 bootstrap samples.

      These results paired with the runtimes are really quite impressive! But the contrast between the results for KF distances as compared to RF distances are interesting, and seems like they may be worth unpacking.

      In particular, it's notable that the RF distances at greater tree sizes for PF+FastME seem to converge with FastME, being greater than seen for IQTree/FastTree, with the difference increasing along with tree size.

      As you say, RF is just the sum of differences in bipartitions between two trees, whereas KF considers both differences in topology and branch length. You find that PF+FastME consistently infers trees with lower or equivalent KF distances to IQTree and FastTree. But, as tree size increases, RF distances increase for PF+FastME at a high rate, exceeding those of FastTree and IQTree starting at relatively small trees (~20 tips).

      Together, these results would suggest that PF+FastME estimates branch lengths well. This is maybe expected but a great thing to see, since PF is effectively trained to infer those evolutionary distances that FastME uses to infer branch lengths! However, despite accurately inferring branch-lengths, there seems to be a larger number of topological errors in the larger trees inferred by PF+FastME as compared to the other methods.

      Do you have any intuition as to why this discrepancy arises? Or any thoughts on how you might modify the model/model architecture to better account for and mitigate this effect?

    3. Tree comparison metrics for different tree reconstruction methods on the LG+GC+indels test set (alignment length=500). Legend as in Fig. 2, with Phyloformer finetuned on alignments with gaps named PFIndel+FastME and in cyan.

      Seems we see the same pattern as in Fig 2 here - this time with RF distances for PF+FastME converging with FastME at slightly larger tree sizes (~50-60 tips).

    4. Normalized Robinson-Foulds distance (above) and Kuhner-Felsenstein distance (below) for different tree reconstruction methods on the Cherry (left) and SelReg (right) test sets (alignment length=500).

      We see a similar pattern as before (in fig 4) in the SelReg dataset, where KF distances show FastME and PF+FastME inferring trees with better branch lengths, but (particularly for larger trees) greater topological errors. Additionally, with increasing tree size, most methods seem to show a trajectory wherein RF distance decreases with increasing tree size, whereas PF+FastME is still increasing (for both the Cherry and SelReg datasets). I think it would be worth expanding the datasets if possible to see what happens at even larger tree sizes (e.g. 250, 500, 1000 leaves). Do these patterns continue or plateau? If so, where?

    5. Comparison of topology reconstruction accuracy between Phyloformer and other methods on empirical data. In both panels, we show the normalized RobinsonFoulds distance between reconstructed gene trees and the corresponding concatenate tree.

      Given that you report results for both RF and KF distances above in figure 5, I'd suggest doing the same here, as it seems clear that RF and KF are capturing quite different features of topological differences, particularly for IQTree vs FastME, and PF+FastME.

  10. Jul 2024
    1. We used our dataset to test the power of ALG-based macrosynteny as a taxonomic tool by asking whether it can be used to reliably identify characteristics for defining monophyletic groups of annelids.

      I know the this is just the results section, but it would be useful if you could clarify whether you're arriving at these conclusions using statistical methods or not. As written it's a bit unclear.

    2. In particular, chromosome fusion-with-mixing events have high potential as phylogenetically informative rare genomic changes (i.e. molecular synapomorphies) because they are irreversible (Rokas and Holland 2000; Schultz et al. 2023; Steenwyk and King 2024).

      Related to this, it might also be interesting to explore the use of your inferred macrosynteny as characters for phylogenetic inference! For instance, rather than using the concatenated single copy genes to infer your species tree, would use of macrosynteny lead to the same conclusions with respect to the relative prevalence of fusions/splitting events?

    3. (B) Ideogram plots of clitellate genomes colored by earthworm ancestral linkage groups.

      Is this not the same figure as in 3A, just excluding T. lapidaria an and colored according to the earthworm ancestral linkage groups?

      I think in some cases these ideogram plots are somewhat redundant across figures. This is ultimately a matter of preference, but it might help to streamline things by limiting the number of figures.

    4. Bilaterian ancestral linkage groups are often fused but rarely split in errantian and sedentarian annelids.

      It's a bit difficult to gain clear intuition here regarding the relative prevalence of fusion vs splitting events. Could you perhaps plot the phylogeny with nodes labeled according to the count of fusions/splits (e.g. something like 2/0, where # fusions = 2, # splits = 0)?

    5. From this, it can be inferred that the annelid ancestral state was 20 ALGs and, therefore, 20 chromosomes.

      Where is this coming from? Are these numbers from explicit statistical methods, e.g. ancestral state reconstruction? I'd be clear about this, because as it, it reads as more of a hypothesis than an inference.

    6. concatenated alignment of 537 single-copy orthologs from 23 annelid genomes.

      Why did you decide to use concatenation rather than a partitioned analysis, or more formal species tree models capable of leveraging multi-copy gene families such as SpeciesRax (https://doi.org/10.1093/molbev/msab365), Asteroid (https://doi.org/10.1093/bioinformatics/btac832), or Astral-Pro 2 (https://doi.org/10.1093/bioinformatics/btac620)?

      I'd definitely suggest exploring the use of one of these alternatives, as the models for species trees are quite different from those for concatenation. Furthermore, in some cases concatenation can lead to highly "resolved" phylogenies with strong bootstrap support. For a more detailed discussion of the matter, I'd suggest taking a look at Liu et al 2015 (https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.12747)

    7. Phylogeny of chromosome-level annelid genomes

      The order here seems a bit off, and led to one of my comments later on the that stemmed from this confusion. I would suggest ordering such that the sections describe: 1) The collection/curation and gene model annotation of the 23 genome assemblies and summaries of them. 2) the inference of orthology 3) phylogenetic inference, and 4) the remaining analyses of macrosynteny

      I suggest this because currently, you describe phylogenetic analyses which use the orthogroups inferred using orthofinder, but those analyses or results are not described until after the phylogenetics. I think restructuring in this way might be a bit more intuitive for the reader.

    8. The topology is largely consistent with transcriptome-based phylogenies and supports the widely accepted division of the bulk of annelid diversity into two monophyletic groups, Errantia and Sedentaria, with Oweniidae and Sipuncula as basal lineages.

      Given the earlier comments about uncertainty in transcriptome based phylogenetics and the fact that the new phylogenetic hypotheses presented here using support those from transcriptome datasets, I might suggest either tempering the earlier statements, or describing in greater detail/specificity here exactly where there are differences in topology between the two.

    9. We built a maximum likelihood phylogeny of annelids using the chromosome-level genomes and newly annotated gene models (fig. 1; supplementary fig. S3).

      As written it's a bit ambiguous what the data used for phylogenomic inference here were - I might suggest rewording slightly to something along the lines of:

      "We built a maximum likelihood phylogeny of annelids using >500 single-copy genes sampled from the chromosome-level genome assemblies and newly annotated gene models"

    10. However, the current understanding of annelid phylogeny is largely based on transcriptomic data (Struck et al. 2011; Weigert et al. 2014; Andrade et al. 2015; Weigert and Bleidorn 2016) and subsequently retains a degree of uncertainty.

      Can you elaborate on/discuss why you believe transcriptome based phylogenomic inferences are more likely to retain uncertainty than from other genome-wide sources of data?

      I ask because there will always be variable degrees of uncertainty in phylogenetic/phylogenomic inference, irrespective of the source of data used. Furthermore, (measured) uncertainty can be confounded by the methods used, source of data, and more (see Simon 2022 for a discussion: https://academic.oup.com/sysbio/article/71/4/921/5904279).

  11. Jun 2024
    1. Once the model was evaluated, we chose our top-performing model for further analysis, in which we integrated the evolutionary features with composition-based features and the ML score with the BLAST score and named the hybrid methods

      It's a bit confusing to me why you would carry out your model selection procedure without using the intended feature-set. My worry would be that certain models might perform better or worse with different feature types and that you might be missing that here. Could you elaborate on this?

    2. Feature selection techniques

      Did you also consider exploring the use of regularization? I Would suggest looking into the use of L1 or L2 regularization to reduce the contribution of some of your features to mitigate the potential for overfitting.

    3. The reliability of a method depends on the quality of the dataset used for training and evaluation.

      Along these lines, it's important to make significant efforts to identify sources of bias in your training dataset and mitigate their potential impact on predictions. This is true for the training set, but it's similarly true for the test (referred to here as the validation) set - if the test set is biased or imbalanced with respect to some relevant biological feature, then the resultant prediction accuracies may not reflect true model performance.

      What I would like to see explored more thoroughly here is whether there are taxonomic biases in this curated set of proteins used in training and testing of your model? If, for instance, some species/taxonomic groups are disproportionately represented in both your training and validation sets, potentially leading to elevated prediction accuracies.

    4. two datasets, the main and the validation

      This is a bit confusing - common terminology to use here would be to refer to this as the training dataset (subsequently subdivided into the K-folds for cross-validation), and the latter as the test set, rather than the validation set as you've done here.

    5. The major limitations of the existing methods is their dataset, as these methods have been evaluated on unreviewed data.

      Can you elaborate on this some? As is it's not clear whether all of the listed studies above have evaluated their methods on unreviewed data, and for those that have, what the sources of the unreviewed data are.

  12. Apr 2024
    1. Parsimony-based ancestral state reconstruction was performed on the carnivorous clade of the tree in Mesquite version 3.70 (Maddison and Maddison 2021).

      It surprises me that you've chosen to use parsimony to infer ancestral character states when the rest of this paper and the motivation of the CAnDI software itself is clearly motivated by explicit, formal models of evolution.

      If you were to infer ancestral character states using a Markov model of discrete states, are there major differences in ancestral state reconstructions as compared to the parsimony reconstructions?

      The trait distribution is quite simple (not a lot of within clade trait heterogeneity), but at the very least I might suggest conducting formal model comparison among (for instance) an equal-rates, symmetric / asymmetric rates, and all-rates-different Mk models and reporting the inferred character states under these model and compare them to your parsimony reconstructions. If they are concordant, that's fantastic! But if not, it would be worth commenting on how this might impact your interpretations.

    2. The sequence with the greatest number of characters was retained for each gene while lower-quality duplicates were discarded using a script written in Python (https://github.com/HollyMaeRobertson/treebuilding-scripts/blob/main/remove_repeats.py)

      Can you provide some more specific explanation as to what this script is doing under the hood?

      For instance, what characteristics lead you to classify gene copies as lower quality duplicates? How could these decisions impact or influence your inferences surrounding conflict, and could removal of legitimate but divergent duplicate gene copies bias your results in some way?

    3. conflicts of interest

      I'm not entirely clear what you mean to say with this - could you perhaps use other language to help clarify?

      Do you mean to communicate that the method allows users to identify specific instances or concerted patterns of conflict that are coincident with or predictive of trait evolution?

  13. Jan 2024
    1. We employed multiple divergence dating analyses

      Given that the dates inferred from your time calibration are of such critical importance to the conclusions made herein, I might strongly suggest using more robust methods that can take advantage of your fantastic UCE dataset. In particular, I would recommend looking into either using BEAST or MCMCTree.

    2. Divergence times for all host anemone species were calculated at less than 25 MYA, with 9 of 10 host species having divergence times between 6.9-11.6 MYA

      What were the times inferred from each method respectively?

    3. We employed multiple divergence dating analyses to convert phylogenetic trees into ultrametric chronograms and estimate divergence times for the clownfish-hosting sea anemones for the first time [8] (see Supplemental Methods).

      I understand that the detailed methods are in the supplement, but you really should at least state which methods you are using, otherwise this statement is not particularly informative.

      One thing I did notice when reading the supplement - it doesn't really make sense to use the IQ-TREE implementation of LSD2 here if the only constraints you have are for the root age. This method is primarily intended for use in phylodynamic contexts wherein you have tip-dates for most or all of your rapidly evolving pathogen samples. At its core, the method dates the tree by regressing the tip-to-root distance against sampling date - if you only have a date for the root, the method is unlikely to be accurate.

      From the LSD2 github page (https://github.com/tothuhien/lsd2): "The input date file should contain the date of most of the tips and possiblly some internal nodes if known."

    4. Diversification of the clownfish-hosting sea anemones was coincident with the clownfish adaptive radiation.

      This plot is a bit misleading, as diversification events plotted here include both geographic subdivision within species (e.g. Entacmaea) as well as diversification among species. Furthermore, without seeing the clownfish phylogeny, it's quite difficult to correspond the tempo of their diversification to the tempo of diversification in anenomes as seen here.

    5. The clownfish-hosting sea anemones thus diversified coincidently with clownfishes, potentially facilitating the clownfish adaptive radiation, and providing the first strong evidence for co-evolutionary patterns in this iconic partnership.

      This statement really is quite strong, however I'm not convinced that it's warranted simply given a coincident interval in which extant species of both groups diversified.

      For instance, given the close ecological interactions among the anemones and clownfishes, they certainly share many ecological conditions. From this, it seems clear that although one hypothesis to explain the pattern of co-diversification is that of coevolution, an alternative hypothesis is that each diversification in each group was driven by some shared feature of their life history, independent of the interactions between anemones and clownfishes.

      I might suggest including a description of alternative hypotheses, and the extent to which your data and analyses support/refute them.

  14. Dec 2023
    1. Figure 3.

      Again, I'd strongly recommend reanalyzing these data (particularly 3B) using phylogenetic comparative methods! Although some taxonomic groups seem to exhibit a strong trend (e.g. flowing plants) others do not.

    2. With some exceptions, alternative splicing is unrelated to genome factors (see Figure 2—figure Supplement 1, Figure 2—figure Supplement 2, and Figure 2—figure Supplement 3).

      This is where I would strongly suggest reanalyzing these data using phylogenetic comparative methods - you may be surprised to find that previously non-significant relationships become significant after using the more appropriate methods!

    3. This percentage is much higher than in most multicellular life forms, where coding comprises less than 50% of genes.

      Are you referring to genes here? or the genome? In other words, do you mean to say that less than 50% of genes in most multicellular life forms are coding? Or do you perhaps mean to say that less than 50% of the genome is coding? If the latter (which is what I suspect you mean to say), I would revise throughout for clarity/accuracy.

    4. Figure 1.

      This figure demonstrates clearly the strong effect of shared evolutonary history (i.e phylogeny) on the relationships recovered herein. Another potentially interesting analysis you could consider here is a phylogenetic analysis of covariance (phylANCOVA - originally published by Fuentes-G et al., 2016 - https://doi.org/10.1086/688917).

      This test is effectively equivalent to a normal ANCOVA in that it formally tests whether different groups (here, taxonomic groups) are best explained by different regression models. In other words whether groups share slopes, intercepts, or both. The difference is that the test can account for evolutionary non-independence using a phylogeny - this test could be conducted using the a phylogeny comprised of the full suite of species for which you have these data.

      This could be a nice way to formally assess which groups should actually be modeled separately, and assessing whether differences among groups are statistically significant. Currently, difference among groups is implicitly assumed, by modeling each using their own model.

    5. Genome, genes, and coding sizes increase proportionally to each other at a lineage-specific level (see Figure 1—figure Supplement 1,Figure 1—figure Supplement 2, and Figure 1—figure Supplement 3). Specifically, we observe a significant correlation between each pair of genome variables in all cases, except for birds, where the genome size is weakly correlated to gene content and coding DNA. Furthermore, the slopes given by the linear regressions differ for each taxonomic group, which suggests key differences in their genome evolution. Figure 1A-C shows the relationship between each pair of genome variables, differentiated by each taxonomic group. Because genome variables are related to each other in an embedding structure, we show in Figure 1D the relation between the percentage of coding composing genes and the respective percentage of gene content in genomes.

      To conduct your regressions in a manner that is statistically robust to evolutionary non-independence within each taxonomic group, I suggest you conduct phylogenetic least squares (PGLS). This can be conducted in a very straightforward manner using R (for instance, as described here: http://www.phytools.org/Cordoba2017/ex/4/PGLS.html). All you will need is: 1. The per-species measurements (i.e. genome size, gene content, length of coding sequence, etc), and 2. a time-calibrated phylogeny (branch lengths are in units of time) that contains the species for which you have these data.

      Since you have measurements for so many species (a fantastic dataset!), I might suggest simply curating your list of species and obtaining a phylogeny from timetree.org. These phylogenies will not necessarily be the most accurate, as they are effectively aggregated from the published literature, but they should be sufficient for your purposes. Although these phylogenies my not include all species in your dataset, they may include closely related, representative species that you can use in their stead.

      Even if this approach requires the removal of some species in your data, use of phylogenetic comparative methods like PGLS are critical for the statistically robust analysis of these data, and thus interpretations reached herein. Without conducting these improved analyses, it unfortunately will be impossible to evaluate the accuracy of the results presented here, particularly with respect to parameter estimates, and differences in the role of alternative splicing in driving the evolution of genome complexity.

    6. Thus, a comparative analysis of species spanning the whole tree of life has revealed certain evolutionary trends in alternative splicing, prevalence in specific lineages, and relation to genome compositional structures.

      I really do like the idea of the metric you've come up with here, but my major concern is that in these correlative tests you have not accounted for the confounding effect of evolutionary non-independence between species (Felsenstein 1985: https://www.jstor.org/stable/2461605).

      That is, because species share evolutionary histories, more closely related species will be more similar in their phenotypes and relevant to this study, in their genome content. This induces, in effect, statistical non-independence of species - phylogenetic pseudoreplication - that must be accounted for formally in your statistical analyses.

      Failure to do so can lead to (for instance), biased or incorrect parameter estimates (both intercept and slope) within models (i.e. within taxonomic group), and can confound or mislead interpretations among groups (i.e. among models fit for each taxonomic group). Fortunately, the solution is relatively straightforward - I've detailed some fixes in the sections below.

    7. high taxonomic groups

      I would suggest avoiding language such as this, since it alludes the the inaccurate "Scala Naturae"-type ladderized view of evolution, progressing from "low" complexity organisms towards "higher" complexity multicellular organisms.

    8. Transcript diversification is driven by two main mechanisms: gene duplication (Lynch and Conery, 2000; Holland et al., 2017) and alternative splicing (Bush et al., 2017)

      Maybe specify "of extant genes" to clarify how these two processes are not inclusive of the formation of ALL new transcripts, including those for genes not yet contained within the genome of a given lineage. That is, horizontal gene transfer and de novo gene birth are also two common mechanisms by which new genes (and thus transcript diversity) arises.

  15. Nov 2023
    1. +j

      Okay, yes, so I'd check to see whether the parameter estimates under a model without the jump parameter are consistent with these results, or if the J model has led to very different biological conclusions.

    2. Biogeographic reconstruction using BioGeoBEARS.

      I'll reiterate again (and hopefully I'm mistaken!) that if this tree includes both distinct species and multiple specimens from the same species, application of ancestral range (or state) reconstruction is not correct. These methods should only be applied to a species or a uniform taxonomic rank/biological unit

    3. A summary coalescent species tree was constructed from the resulting gene trees using hybrid-ASTRAL implemented in ASTER (v1.15) (Zhang and Mirarab 2022).

      Am I correct in understanding that in this hybrid-ASTRAL tree, you have collapsed specimen into distinct species? If so, how many species in total?

      If not, this poses problems for the ancestral state and ancestral range reconstruction, as each should only be applied to a single taxonomic rank (i.e. species), rather than multiple (i.e. species and individuals sampled from different populations)

    4. b) lineages-through-time plot calculated with the “ltt.plot” function in the R pack-age “ape.”

      This is not reported in the methods section, but should be.

      Additionally, if I'm understanding correctly, is each tip in this tree a specimen, not a distinct species? If so, it does not make sense to use a lineage through time plot here unless you have retained only a single representative per species.

    5. The most likely model was chosen according to AIC and weighted AIC score calculated in BioGeoBEARS.

      Which models did you fit? These details - both the models you fitted, and their relative fits to the data - really must be reported here in the methods. Unless I'm missing something, they don't seem to be reported in the supplement either.

  16. Oct 2023
    1. copy number variation 11–13, gene losses 14–16, or gene gains (e.g., horizontal gene transfer or HGT) 17,18.

      Gene loss and gene gains are just the mechanisms by which (gene) copy number variation arises "Maybe reword this passage as: "but can also arise through distinct molecular paths and by distinct evolutionary mechanisms, such as copy number variation that arises via either gene losses or gene gains (e.g., horizontal gene transfer or HGT)."

    2. 1–3.

      I think it would be worth citing this recent excellent paper that synthesizes the many different perspectives on convergence vs parallel evolution, etc. The framework they outline for distinguishing these different processes would be useful to borrow from throughout, particularly as you begin to delve into the hierarchical nature of convergence (described nicely in Rosenblum et al., 2014).

    3. Convergent evolution, the repeated evolution of similar traits among distantly related taxa

      I'd suggest just clarifying up front that convergence specifically refers to the case where similar traits independently evolve from distinct ancestral states, as is commonly the case in distantly related species. This differs from parallel evolution, in which similar traits evolve, but from the same ancestral state - this can still occur among distantly related species, but more commonly manifests at the population level, or among more closely related species.

      This could be as simple as: "Convergent evolution, the repeated evolution of similar traits from distinct ancestral states, is ubiquitous in nature, commonly occurs among distantly related taxa, and has been..."

  17. Sep 2023
    1. Materials and Methods

      I'm realizing as I read through the methods that it is not described here how πS was actually calculated. This is so crucially important for the paper, so these details really must be included in detail. For instance, reading the methods it's unclear to me whether πS was calculated in sliding windows across the genome, using the core genes, or something else. The details may be outlined elsewhere in the manuscript, but the really must be here in the methods. It's quite difficult to evaluate the results otherwise.

    2. We used bootstraps to estimate the variation in each gene set’s adaptation rates.

      I would suggest being more specific about what you mean here - which specific parameters are you referring to as "adaptation rates"?

    3. The distribution of fitness effects (DFE) was determined based on the SFS information of synonymous and non-synonymous sites of concatenated genes using the GRAPES software (3,14).

      Why were these methods applied to the concatenated genes? Would it not be more informative to fit these models on a gene-by-gene basis so as to gain more insight into how the DFE varies within and among species? Alternatively, if the issue is a matter of statistical power, could this be instead be done on the concatenations of the three datasets described above? (non-secreted, secreted, and effectors)

    4. we inferred the distribution of fitness effects (DFE) of mutations across species using six models (Figure 2B; Supplementary table S9).

      I would be curious to see whether Ne or the recombination rate vary in some predictable way among the species corresponding to each model.

    5. we estimated πS reflecting the neutral genetic variation in the population and here considered a relative proxy for the effective population size

      It seems to me that accurate estimates of Ne will be critical to your interpretations throughout the manuscript, will it not? If so, and given the incredible dataset you're working with here, I can't help but feel that using πs here rather than actual estimates of Ne is surprising.

      At the very least, I might suggest using a method such as GONE (https://academic.oup.com/mbe/article-abstract/37/12/3642/5869049) to estimate Ne for each species - at the very least to demonstrate that the two approaches leads to comparable or equivalent solutions. The method performs quite well for estimating contemporary effective population sizes from modest numbers of samples. a comparison of this and other related methods (including those that better infer historical pop sizes) can be found here: https://academic.oup.com/mbe/article-abstract/37/12/3642/5869049

  18. Jul 2023
    1. Higher gene family evolution rates could be due to underlying gene duplication, neo-functionalization and/or genome rearrangements, all of which are implicated in polyphagous feeding (Murad et al., 2021; Seppey et al., 2019).

      Horizonal transfer (including introgression following hybridization) is another potential explanation - this would be a good place to discuss this point.

    2. The root of the species tree was time calibrated to 87 Myr to reflect the divergence of the three families used in the analysis (Espeland et al., 2018).

      This needs to be explained further. How exactly was this done? Did you convert the inferred species tree to be ultrametric and then change the root depth to 87Myr? If so, this is not an appropriate method for time-calibration.

      At a minimum, I would suggest either using this root age and r8s as implemented in geiger (https://rdrr.io/cran/geiger/man/r8s.phylo.html), or PATHd8 (https://www2.math.su.se/PATHd8/). Alternatively you could use another species tree time calibrated with other more intensive methods that includes some proportion of these species, and use the congruificaton method (again implemented in geiger (https://rdrr.io/cran/geiger/man/congruify.phylo.html) which uses PATHd8 or treePL under the hood.

      If you've used one of these methods to time calibrate your tree, be sure to describe that here.

    3. We additionally used Orthofinder for constructing our species and gene phylogenetic tree.

      How sensitive might your results be to the inferred species tree topology? Numerous efficient methods exist that are capable of inferring species trees from multiple copy gene families, including asteroid (https://github.com/BenoitMorel/Asteroid) and ASTRAL-Pro2 (https://github.com/chaoszhang/ASTER/blob/master/tutorial/astral-pro.md).

      Additionally, SpeciesRax (https://github.com/BenoitMorel/GeneRax/wiki/SpeciesRax) is capable of inferring a rooted species tree using multi-copy gene family trees (in this respect similar to the STAG algorithm implemented in OrthoFinder) under a model of gene duplication, transfer, and loss.

      I would suggest applying one of these methods to assess how robust your species tree topology is to choice of methods, as the inferred gene family evolutionary dynamics will naturally be sensitive to this.

    4. CAFE v. 5.0 (Mendes et al., 2020) was used to analyze gene family evolution under a phylogenetic framework.

      Could you briefly explain why this approach was chosen rather than species/gene tree reconciliation methods like ALE (https://github.com/ssolo/ALE) or GeneRax (https://github.com/BenoitMorel/GeneRax)?

      Because CAFE only models gene duplications and losses, it's unable to account for or infer potential horizontal gene transfer among your focal species, a process that could (and likely does) explain some proportion of the variation in gene copy number observed in the analyzed gene families.

      I would suggest briefly addressing these points, first in the methods section, and subsequently in the discussion. HGT is a plausible alternative hypothesis to gene duplication, and one that is sensible in the context of gene family-diversity associated increases in a species' capacity for hostplant range expansion.

    5. We additionally filtered OGs that showed high variance across species as they can lead to biases in gene family evolution estimation as well as preventing convergence among replicates in the analysis.

      This makes good sense, though I might suggest elaborating on what sorts of biases high variances might introduce (e.g. 'cascading' influence on inferred expansions on branches neighboring unsusually high copy numbers?).

      Additionally, how exactly did you filter - did you calculate variance among species (or std. deviation for instance) and then filter based on some empirical threshold?

  19. Jun 2023
    1. The high number of protein families present only in symbiotic lineages that split from each other over 40 million years of evolution suggests convergent evolution due to the symbiotic lifestyle.

      How exceptional is this count of shared gene families among the two symbiotic lineages? Does it exceed what is shared between only (for instance) Ev and Po, or S1/2 and Po?

      I feel that convergence of the gene families shared between the two symbiotic clades is an alternative hypthothesis that should be tested, rather than assumed. An alternative explanation is that these are genes that arose once within S1, and were subsequently lost within Ev.

    2. These results suggest that in addition to convergent evolution in the symbiotic lineages, these lineages of Symbiodiniaceae have experienced a greater extent of pseudogenisation than has the free-living Ev.

      Again, I would just be cautious in the assumption of convergent evolution here, as this is an alternative hypotheses that has not yet been explicitly tested.

    3. (A) Number of protein families is shown at each node and branch represents those that are shared among or specific to S1, Ev, S2, and/or Po. Number of families that are exclusive to Ev (green), to S1 (light blue), to Ev+Po (dark blue), to S2+S1 (yellow), and to S2+S1+Po (orange) were highlighted.

      This sub-plot would again support an interpretation in which (at least some/many) genes related to a symbiotic life history were gained once in S1, and subsequently lost in Ev.

    4. Incidentally, Po shared more protein families with symbiotic lineages (1,290; S1+S2+Po) than with Ev (221; Ev+Po);

      Okay, this addressed my comment above, but I would say that this supports an interpretation that the large numbers of genes shared between the symbiotic clades arose once in S1, but were subsequently lost in Ev, being retained in S2.

    5. The driving mechanism for this trend may reflect one of two evolutionary scenarios: (a) intron expansion in Ev, or (b) intron contraction in S1/S2.

      I think this is a case in which using explicit phylogenetic comparative methods might be useful. If all species were included in the species tree (rather than collapsed by clade as discussed above), and intron lengths were mapped onto the terminal branches of that phylogeny, ancestral character reconstruction could be performed to gain more explicit intuition as to whether independent expansions or contractions occurred, and how many times.

      This could feasibly be done both using continuous measures of intron length, or alternatively by binning introns into "small" and "large" size categories (if intron length follows a bi-modal distribution for example), and using a discrete model of character transition to explicitly obtain counts of the number of transitions.

    6. For the S1, S2, Ev, and Po groups of Suessiales, the mean of isolate-specific protein families, and the mean ± standard deviation of estimated genome sizes are displayed.

      Is there a reason why S1/S2 were collapsed, rather than showing all sampled Suessiales species the species tree? It seems this is a loss of information that would be nice to see in this early figure.

      In particular, since Symbiodinium is not exclusively symbiotic, it would be informative to see how this life history is distributed within the group phylogenetically. Different phylogenetic histories and distributions of symbiosis could lead to alternative interpretations regarding the evolution of the life history, and consequently of the evolution of parasitism-associated genome size reduction.

  20. May 2023
    1. Alignment of protein sequences was performed with mafft using the “-auto” option and back translated to nucleotide alignments using pal2nal [53]. A phylogenetic tree was estimated using the JTTDCMut+F+I+G4 model in IQ-TREE v1.6.12 on the concatenated data of the same set of 188 orthologues protein clusters. This tree was then used to identify genes with significant evidences of relaxation or intensification of selection using the RELAX hypothesis testing framework in HyPhy package [54].

      Given that amino acid sequences were back translated to nucleotide alignments, it would be worth assessing to what extent the tests of relaxed selection may be sensitive to/impacted by alternative translated alignments. That is, each amino acid sequence has multiple alternative nucleotide alignments.

      I'm not certain how pal2nal works exactly, but from what I can tell it appears to generate a single translated alignment. Depending on how it "chooses" which alignment to generate, this could artificially inflate (or deflate) signal of relaxed or positive selection, as it may erode the very signals these tests use to fit model parameters.

  21. Apr 2023
    1. Instead, the bulk of the distribution fell close to the center of the triangle, revealing extensive ILS due to rapid diversification relative to the effective population size (10, 11). Thus, although well-supported statistically, the genome-wide tree is a very poor predictor of evolutionary relationships at any given genomic region.

      This is not terribly surprising to me, honestly, particularly given that it's known a priori that the two lineages are closely related and that hybridization is not uncommon. It may also be worth mentioning/emphasizing here that the phylogeny was inferred using concatenated SNPs. Whether using full sequences or SNP datasets, concatenation can often lead to inflated topological support.

      Have you considered using an approach like SVDquartets to infer a complementary phylogeny to the one inferred with RAxML? The approach is consistent with the multi-species coalescent, and treats sites as independent, inferring quartets at each position before amalgamating them into a full tree. Individuals could be pooled by sampling locality/species, enabling you to still infer a population-level tree, doing so under an independent phylogenetic model.

    2. Simulated distributions of weights. A greater opportunity for lineage sorting (i - iii) biases the distribution toward the topology that matches the demographic history. Incomplete lineage sorting yields genealogies that are a better fit to one of the discordant trees, but the distribution is always symmetrical between the left and right half triangles. Additional factors, including gene flow, create a bias toward one of the discordant genealogies (panels iv - vi).

      I'm super intrigued by the use/utility of simulation here. I know that these simulations were conducted using msprime, but I can't help but wonder about the inclusion of varying intensities of positive selection on loci in these simulations. This of course would increase the complexity of parameter space significantly with regards to potential simulations, but seems explicitly tied to the hypotheses being tested here.

      More generally, it seems to me that this framework could lend itself nicely to the application of machine or deep learning approaches for assignment of genomic windows to alternative evolutionary histories (e.g. migration, selection), as well as potentially even parameter inference, such as through the use of a CNN. Obviously this is outside of the scope of the current paper, but I'm just curious as to whether this is a thought/space you all have explored?

    3. Topology weighting reveals genomic regions associated with reproductive mode.

      First off, I can't not complement the "Twisst & Tern" pun... Excellent stuff.

      But more specifically, I just want to say I think the use of the ternary plot in combination with both empirical and simulated data to explore and quantify genealogical asymmetries is so clever. I doubt I'll be alone in saying that I think this type of approach has immense potential, not only for hypothesis testing, but even for parameter testing.

    1. Curiously, a contraction in the Hemocyanin superfamily was only observed in H. aoede, the only Heliconius species not to feed on pollen in our data, marking hexamerins as a potential mechanistic link to the divergent strategies for nitrogen storage in pollen-feeding Heliconius (Fig. 5A).

      Interesting - this seems like a perfect complement for the analysis of Cocoonases later on! Would be very interesting to apply RELAX/aBSREL in the hypthesis-testing framework (i.e. foreground/background - discussed in later comment) to this gene family to test the hypothesis that this gene-family contraction in H. aoede is paired with an increased intensity of positive selection as compared to other Heliconius butterflies.

      What sre the members of the Hemocyanin superfamily that contracted within H. aoede? Are they the same (or do they include) the two Hemocyanins that were found to have previously undergone expansions?

    2. Ancestral state reconstruction of genome size was assessed using The maximum likelihood method implemented in the R package phytools (68).

      Under what model specifically was ancestral state reconstruction (ASR) conducted? The ML method in phytools can conduct ASR under three models (Brownian Motion, Ornstein–Uhlenbeck, and Early Burst). If not already done so, I might suggest fitting these alternative models of trait evolution prior to conducting ASR under the best-fit model.

      Given the taxonomic scale, it might also be worth considering including fitting some multi-regime BM/OU models as implemented in OUwie: https://doi.org/10.1111/j.1558-5646.2012.01619.x.

    3. All sequences from each of the four clades were realigned separately and a number of tests implemented in HyPhy (overall ω; SLAC; aBSREL; RELAX). In particular, the sign of diversifying positive selection (aBSREL) was detected by scanning all internal branches of the whole cocoonases phylogeny, correcting for multiple tests using a final P-value threshold of 0.05.

      If I'm reading this correctly, it seems that for each analysis, all branches were tested for signatures of selection (with the exception of SLAC which assumes selection is constant across the entire tree)? Although RELAX and aBSREL can be used to test each branch, each are significantly more powerful when used in a hypothesis-testing framework, assigning a subset of branches to the "background" (control) and the remainder to the "foreground" (treatment), and estimating parameters within each test-set. The per-branch tests are generally considered to be most useful in exploratory use-cases.

      I think this use-case scenario is ideally suited to the hypothesis testing framework of these two models. Specifically, to test the hypothesis that selection on cocoonases intensified (or was relaxed) in H. aoede as compared to the rest of the Heliconius butterflies, you could treat H. aoede as the foreground, and the rest of Heliconius as the background for both RELAX and aBSREL. It's unclear whether cocoonases in Eueides or other species within Heliconiini should be experiencing the same/similar selection regime as in H. aoede, so I would suggest excluding those species from this specific analysis.

    4. optimizing parameters (Romain Derelle, personal communication) for a more reliable list of single-copy orthologous groups (scOGs).

      How might this approach to identify orthologs, which is tailored to the identification of single-copy orthologs, impact downstream inferences about patterns of gene-family evolution? Put another way, if the approach described here is more prone to recovering scOGs (or low-copy number OGs), then it seems likely that fewer multi-copy number (particularly large/high-copy number) OGs will be identified as a result. If so, the rate estimates of gene-family expansion/contraction using CAFE seem likely to be downwardly/upwardly biased respectively, as the majority of gene-families under consideration are particularly conserved with respect to copy-number.

      I think this possibility would be worth exploring/addressing. For instance, if the implementation of broccoli is relaxed so as to be less tailored to the recovery of scOGs, does the shape of the gene-family expansion/contraction distribution fundamentally change?

    5. We illustrate how dense genomic sampling improves our resolution of gene-phenotype links, and our understanding of how genomes evolve.

      Really exciting work, and an immense amount of work - congratulations! This was a fun read!

      It seems that towards the conclusion of the paper, you outline how your data does not appear to be consistent with one of the prevailing hypothesized mechanisms for the digestion of pollen in Heliconius. I can't help but wonder if, using the data you present here, you might be able to point towards individual gene-families/orthogroups that are consistent with the evolution of this key-innovation?

    6. bar plots show total gene counts partitioned according to their orthology profiles, from Nymphalids to lineage-restricted and clade-specific genes.

      What does the distribution of mean per-species gene-count look like across all orthologs? Is there a small handful of very large orthogroups that are observed in all species and comprise the majority of the gene-counts shown here?

    7. Hemocyanins are also among several GFs with evidence of divergent selection regimes (ω) between Eueides and Heliconius, alongside Trypsins, Protein kinases, P450s, Sugar transporters, Ion and ABC transporters (Fig. 5B).

      What was the estimated ω for H. aoede at these gene-families as compared to those distributions estimated for Eueides and Heliconius? If involved in the evolution of pollen-feeding, I think that we would expect ω to be more similar between H. aoede and Eueides than between H. aoede and Heliconius?

    8. (A, B) The contribution of TEs and CDS to genome size variation across Heliconiinae, respectively.

      These two relationships should probably be assessed in a phylogenetic context, accounting for shared evolutionary history and non-independence among species (e.g. using phylogenetic least-squares regression - pgls). There is clear phylogenetic structuring in these two plots. For instance, Dione/Agraulis appear to have lower TE content for their genome sizes (shallower slope and intercept), whereas Erato seems to accumulate TEs more rapidly for their corresponding genome size as compared to other clades. Raw points and the original regression slope certainly provide taxonomic intuition, but to summarize the relationship across the entire clade, a regression slope as inferred by PGLS (for instance) should be reported.

    9. We next explored the relationship between transposable elements (TEs) and genome size, and their effect on gene architecture. We found that larger genomes tend to have a distribution of intron length skewed towards longer introns (fig. S16a), with a positive correlation between median intron length and total TE content (fig. S22; Pearson’s ρ=0.72; R2=0.51). Long introns also accumulate significantly more TEs then expected by their size (fig. S23; Wilcoxon rank-sum test P value = 2.13×10-13), with the effect of changing gene structure more than gene density (fig. S22 and fig. S24).

      It's difficult to contextualize these results without access to the supplementary materials (at the time I'm reading this), but it's unclear to me why the choice was made to subset introns into those that are short and long (as in Fig. 4A). I'm particularly wondering about the choice to divide introns as short/long at the median, which doesn't appear to sort them according to a natural break in the length-distribution. Dividing introns in this way, and then summing for each species and each 'TE size-class' runs the possible risk of exaggerating the observed difference in intron length and TE content.

      Additionally, the relationship between TE content and total intron length (regardless of how these traits are defined) should be analyzed in a phylogenetic context (e.g. using phylogenetic least-squares regression - pgls).

    10. From each OG in-paralogs were removed (custom python script RemoveInParalogFromTree.py available at https://francicco@bitbucket.org/ebablab/custum-scripts.git). If the procedure generated a single-copy OG (nscOGs) it was analysed by contrasting evolutionary pressures between Eueides and Heliconius species. The signature of selection (aBSREL) and relaxation (RELAX (114)) were performed as implemented in HyPhy.

      Could removal of in-paralogs influence/impact the signatures of selection that you recover using aBSREL and RELAX? It might be worth assessing whether these results are robust to the inclusion of in-paralogs.

      For instance, neofunctionalization of in-paralogs following duplication is likely to manifest signatures of positive selection (Fernández et al., 2020: https://doi.org/10.1093/molbev/msaa110, Wang et al., 2015: https://doi.org/10.1007/s11103-015-0285-2). Alternatively, pseudogenization of in-paralogs is expected to lead to a relaxation of selection on the duplicated gene-copy (e.g. Emerling 2018: https://doi.org/10.1016/j.ympev.2017.09.016; Calderoni et al., 2016: https://doi.org/10.1038/hdy.2016.59).

    1. Fifteen molecular model shifts are identified on total-clades estimated to have originated near the K–Pg boundary [43-46] (Figure 1, Supplementary Figure 1, 7a-d).

      I'm curious - If you were to simulate sequences/trees under a model of mass-extinction and early burst but constant substitution rates, how often (if at all!) would Janus infer substitution rate shifts near the extinction event?

      Could the extinction-driven bottleneck and heightened species turnover in the subsequent radiation amplify apparent differences in nucleotide composition even with molecular rates remaining constant? If so, could it be that alternative diversification dynamics/models impact the inference of substitution rate shifts?

    2. Our analyses reveal well-supported shifts in estimated equilibrium base frequencies across exons, introns, untranslated regions (UTRs), and mitochondrial genomes. Remarkably, model shifts are mostly constrained to previously hypothesized clade originations associated with the K–Pg boundary.

      Is there any chance that this pattern could, at least in part be attributed to the relative paucity of branches in the tree prior to the K-Pg boundary?

      In other words, could the identification of molecular rate shifts by Janus be impacted by "sample size" differences of branches prior to and following mass extinctions as in this example?

    3. we find evidence that shifts in the mode of genomic evolution were very likely concurrent with shifts in the evolutionary optima θ(t) of important avian life-history traits (Figure 2), as well as shifts in metabolic allometric slope βmass and intercept β.

      Super cool - this is partly where/why the use of phylogenetic path analysis may be particularly informative - could these shifts in evolutionary optima have been associated with a shift in causal relationships among biological/life history features? Maybe it's possible to infer, at least in part, the causal mechanisms behind these shifts in metabolic allometry?

    4. Overall, these patterns support the hypothesis that molecular model shifts are indicative of evolutionary shifts in life-history θ(t).

      I'm not necessarily suggesting you do this, since this paper already represents an impressive effort, but I wonder whether the use of phylogenetic path analysis (https://doi.org/10.7717%2Fpeerj.4718) might be informative here?

      Within each inferred molecular model shift, you could test alternative causal models depicting the relationships between rates of molecular evolution and these natural history traits, as you might anticipate that many of these covary, themselves being causally related to one another. Inference of shifts in causal relationships among natural history traits in each clade that experienced shifts in rates of molecular evolution could then provide some intuition as to why!

    5. we have limited intuition about how the mode of molecular evolution may relate to life-history variation.

      I'm probably sounding like a broken record now (a good thing - this is super interesting to me, and it's exciting to be nearing a place where we can address this!), but I think analysis of alternative causal models here with phylogenetic path analysis would be an awesome way towards not only generating this intuition, but also explicitly testing these hypotheses!

    6. Figure 1.

      How feasible would it be to infer these deviations from equilibrium base frequencies from simulated data? For instance, simulating trees/sequences under constant substitution rates (e.g. using AliSim - https://doi.org/10.1093/molbev/msac092) but under a model of mass (selective) extinction and subsequent early burst of diversification (e.g. TreeSim - https://doi.org/10.1093/sysbio/syr029)?

      I ask because I wonder whether simulating under the hypothesized diversification model and constant rates of molecular evolution could provide an estimate of the false-positive rate with which Janus might infer these shifts and whether false positives tend to be concentrated around the time of the simulated mass extinction?

      I realize this is probably a big ask though - the paper is already super impressive! I think this analysis could just be particularly compelling evidence if in fact this concentration of events is greater than expected.

    7. The observation of a burst of molecular and quantitative trait model shifts within a ∼5 Ma interval of the K–Pg boundary (Figure 1) is likely not a coincidence

      That's one possibility - though it might be worth addressing/discussing the alternative hypothesis - that the pattern of mass (and presumably non-random with respect to body size) extinction and subsequent radiation (i.e. the early burst) could lead to surviving lineages exhibiting greater disparity in rates of molecular evolution and other life history traits.

      As I alluded to in my comment on figure 1, it might be worth (if feasible) explicitly assessing the probability of observing such a concentrated distribution of shifts near the time of a mass extinction event by simulating under a constant rate model of nucleotide substitution, and a mass extinction/early burst of speciation/turnover.

    1. This dataset was used as input in NgsDist (Vieira et al. 2015), where we specified a block size of 20 SNPs, and 100 bootstrap replicates. We then ran the NJ software fastme v2 (Lefort et al. 2015), merged all the 100 replicates into a final tree with bootstrap support using RAXML while specifying an optimization of branch-length, and specifying GTRCAT (GTR + Optimization of substitution rates + Optimization of site-specific) as the substitution model.

      I have a couple thoughts about this - is there a reason why a combination of fastme and RAXML was used as opposed to full likelihood inference using RAXML or IQ-TREE?

      I ask because both RAxML and IQ-tree include methods (that i would suggest using here) for acquisition bias correction when using SNPs that could improve inference of both topology and branch-lengths.

      Another entirely different strategy to infer a species tree in a computationally efficient way using SNP data under the multispecies coalescent would be to use SVquartets (Chifman and Kubatko 2014). This potentially well suited for your dataset as it can leverage a large number of SNPs effectively, and even robustly handles linked sites. Caveat is that it cannot infer branch-lengths, but this might not be a concern for your use case.

    2. Phyluce allows extraction of UCEs, but because of the missing data due to low coverage and fragmented genomes, we were only able to retrieve 29 UCEs that were present in half of the dataset (16 of the UCEs were present in all individuals; and 23 UCEs present in ≥70 individuals).

      I may have missed it, but I don't believe it is mentioned in the methods what software was used for phylogenetic inference of the recovered UCEs? If my intuition is correct, I would guess IQ-Tree, the same as for mitochondrial gene? It might be worth clarifying here.

      Related, was a single concatenated phylogeny used? Or did you infer gene trees as well? IQ-Tree is capable of easily inferring both (gene trees and concatenated species tree - see here: http://www.iqtree.org/doc/Concordance-Factor). This would also enable to you infer multiple concordance factors which could help to summarize genealogical concordance, and potentially even quantify per-site discordance using your concatenated SNP dataset?

    3. Recent genomic work has shown that co-occurring closely related species belonging to the Green ecomorph do not hybridize, and it has been argued that there may be some overlap in their ecological niches in the early stages of diversification, suggesting a possible avenue for the divergence of ecomorphs through character displacement upon secondary contact (Schluter 2000; Cotoras et al. 2018).

      I'm confused though how coexistence without hybridization of closely related species of the same ecomorph suggests character displacement?

      Maybe this is clarified later (e.g. some other feature of morphology/ecology other than coloration differs within ecomorph), but if so, it should be explained/mentioned earlier in the paragraph.

      For instance, do spiders vary morphologically within and among ecomorphs equally? greater? less than?

      Given that coloration is the only phentype explored here it might be worth providing some additional context to this point - as written it suggests that the paper might later investigate character displacement, though analyses are primarily genetic.

    4. Both these analyses benefited from a tree-backbone and we specified the tree obtained in ngsDist.

      Could you elaborate on why you chose to use this tree versus the UCE likelihood tree? Are results from Dsuite robust to the choice of species tree used?

    5. These species can be grouped into four ecomorphs (Figure 1 A-D), which are linked to the substrate they inhabit (Gillespie 2004): the Large Brown ecomorph is found on tree bark (Figure 1A), the Green ecomorph on leaves (Figure 1B), the Maroon ecomorph on mosses (Figure 1C), and the Small Brown ecomorph on twigs

      What is the extent/degree of within versus among ecomorph morphological variation? Is ecology and coloration the only characteristic that reliably distinguishes them?

    6. measured as R2, was above 0.11

      This is a very specific number to use as a filter. Was this determined based on patterns of LD-decay inferred from the steps above? I might provide some additional explanation as to why this was used as the LD-filter.

    7. There are mitochondrial-nuclear tree discordances based on topology and bootstrap support

      If i'm understanding the description of the UCE tree correctly, would it be correct to say that there is one additional mitonuclear discordance when considering the UCE tree? (The placement of T. anuenue)

    8. We also display K = 15 because that is the number of species in the dataset (Figure 3).

      Oh, I understand now! It might be worth adding this explanation in the methods? I would also suggest including a plot of LogLik for each value of K - it's difficult to evaluate how strong of a difference there is in model supports without it.

      However, it is somewhat surprising that that these data would suggest there are only 2-3 clusters, when there are 15 described species - this seems like quite the discrepancy.

    9. In the nuclear tree,

      Which nuclear tree? I see below that there is some discordance in topologies inferred using the two nuclear datasets (SNPs with fastme, UCEs with IQ-Tree), so it would be worth specifying here.

    10. minimum map quality of 30, minimum base quality of 20

      Just a suggestion, but it might simplify things for you if you had a section prior to any use of ANGSD describing the commonly used parameter settings across these analyses?

    11. Based on Patterson’s D statistics (Supplementary Figure 02) and F4-branch statistics, we found excess allele sharing between ecomorphs and within ecomorphs (Figure 4).

      It might be really interesting to see the distribution of D-statistics pooled into both within vs among species and ecomorph comparisons.

      I say this because it could be really informative to compare/rank the evidence for hybridization across levels - within/among species, as compared to within/among ecomorphs.

    12. he maximum-likelihood UCE tree is topologically concordant with the ngsDist tree at the ecomorph level (Supplementary Figure 01).

      Could you clarify what you mean by this? It's difficult to intuit what this discordance looks like without the supplement being available. Regardless, it's nice to see that the discordance between the two topologies seems modest -

    1. Modelling morphological evolution by using stochastic processes is more intricate than modelling molecular sequence evolution because it cannot be assumed that the same evolutionary process is acting on all characters identically.

      Could this not be said about gene content evolution as well? Patterns and dynamics of gene family gain/loss is certainly expected to be a heterogeneous process - does the reversible binary substitution model allow for site heterogeny?

    2. The phylogenetic analysis of gene content data utilises genome-derived proteomes and converts the presence or absence of gene families in the genomes of the terminals into a binary data matrix [9,25,37,38].

      This is the same type of gene-content encoding used by Ryan et al. 2013 as their second dataset, which coincidentally supported the hypothesis of Ctenophores as sister.

      I'm curious as to why an alternative recoding of these gene families was not considered - gene copy number. From the analyses conducted here to infer orthogroups using OrthoFinder, this encoding would be fairly straightforward to accomplish, and could perhaps be even more information rich than presence absence. Additionally this would be an entirely distinct way of encoding these data as compared to the approach employed by Ryan et al., (2013).

      Additionally, there exist now several methods that are capable of inferring species trees from multi-copy gene family trees, which are by design more robust to, and even improved by the existence of paralogs (e.g. Asteroid -https://doi.org/10.1093/bioinformatics/btac832, SpeciesRax - https://doi.org/10.1093/molbev/msab365, ASTRAL-Pro 2 - https://doi.org/10.1093/bioinformatics/btac620).

    3. This is an impressive paper containing a significant amount of work, and I love the amount of effort focused on exploring how sensitive the results are to methodological approaches and parameter specifications. In particular, I'm delighted to see the exploration of how the choice of inflation parameter impacts downstream phylogenetic inference - given that this parameter has profound influence on all downstream analyses, I think all studies using these types of approaches should be similarly cautious/considerate.

      In general, I feel that the choice to encode 'gene content' as gene presence makes distinguishing these results from past efforts, particularly that of Pett at al., 2019, more challenging. That is, gene content is encoded in the same way across the two studies - and both the present study and that of Pett et al., come to the same conclusion that the Porifera-sister hypothesis is most strongly supported. Together, it makes these new phylogenetic analyses of gene content less compelling as corroborating evidence.

      I feel that an alternative encoding of gene content - as gene copy number as an ordered-discrete character e.g. (0,1,2,3, etc) - would have been valuable to explore. This type of encoding would be more information rich, and would be inherently distinct from past efforts. Alternatively, the authors could have used a growing number of methods that are capable of inferring species trees from multi-copy gene families (see later comment). I would hope to see some discussion of these alternatives, though i suspect implementation of either would be non-trivial.

    4. We assembled a large number of new gene content datasets (see Methods, Fig. 1) to extensively test the effect of different parameter combinations when identifying homogroups and orthogroups, because this crucial step remains a challenge [40,41] and may influence the outcome of the downstream phylogenetic analysis [42]. For example, state-of-the-art methods provide two parameters (the E-value [similarity] and I-value [granulation or inflation]) which have a direct impact on the inferred gene family assignment (E-value) and splitting of gene families into orthogroups (I-value).

      I am delighted to see such a thorough exploration of the sensitivity of phylogenomic inference to these parameters, as defaults are typically used without consideration but can, as you say, have profound influence on downstream outcomes.

    5. Considering that previous amino-acid alignment-based phylogenomic analyses showed model- and data dependency [e.g., 9,18], which therefore did not lead to conclusive results, alternative approaches might help to select between phylogenetic hypotheses.

      Unclear a priori why this might not always be expected to be the case, even when encoding data in other ways. Fundamentally, these are still evolutionary models.

    6. The second coded the presence/absence of orthogroups. When this second coding strategy is used, individual orthogroups within each protein family are treated as individual characters. This is the same strategy introduced and justified by Pett et al. [37]

      I think a graphical depiction of this may be useful. As written, it's unclear how you are encoding orthogroups within homogroups.

      Based on your description below, it appears to me that the only difference is in the implementation of MCL clustering to sequence similarity, with orthofinder2 used to infer orthogroups (i.e. normalizing similarity scores to account for sequence length), and homomcl for homogroups). Your statement that orthogroups are nested within protein families doesn't appear to be consistent with this? Am I mistaken in my interpretation?

    1. Benchmarking Universal Single-Copy Orthologs - v 5.2.2 - metazoa odb10 - parameters: - protein

      Given that you are using transcriptome assemblies and presumaby amino acid sequences since you're using the "protein" method, it is important that some degree of protein redundancy reduction takes place prior to analysis with BUSCO (i.e. using CDHIT or retaining only the longest protein per assembled gene). Use of transdecoders '--single_best_orf' will still retain alternative isoforms which will lead the count of "duplicates" by busco to be inflated.

    2. The optimal parameters for PhyloPyPruner were chosen by comparing the outcome when adjusting for minimum sequence length, long branch trimming factor, minimum support value, minimum number of taxa, minimum OTU occupancy, tree pruning method, and minimum gene occupancy. The optimisation script, including the tested parameter values, can be found in the supplementary material.

      The supplementary material hosted on biorxiv doesn not seem to include these scripts/parameters.

      I'm also curious: on what basis were were parameter combinations judged to be more optimal than another? I like that you explored how these parameters impacted phylogenetic inference, but it's unclear what the actual optimality criteria were.

    3. Then the appropriate transition matrix for ASR was determined by fitting MK-models with equal transition rates (ER), with symmetric transition rates (SYM), and with all transition rates different (ARD) and then evaluating the model fit using the corrected Akaike information criterion (AIC).

      Based on the supplement it seems you defined these models as a set of three ordered, discrete states (i.e. A <-> B <-> C). Did you consider/did you fit another set of models where there could be a two-state jump (i.e. A <-> C)? These might be worth fitting/considering, as it's not inconceivable that such transitions might occur - likewise, this additional type of transition may be a bit less sensitive under scenarios of incomplete sampling with respect to the focal trait (i.e. in cases where transitions did in fact occur from A -> B -> C, but the intervening species in character state B were not sampled (or trait data is missing).

    4. Phylogenetic trees were constructed using IQ-TREE77 (version: 2.1.2, parameters: -m MFP -bb 1000 -bnni) or via ASTRAL78 (version 5.7.1), using standard parameter settings (S. Fig. 3a). The phylogeny combining triclads, mammals and nematodes was built following the same approach as for the planarian phylogeny.

      Are these phylogenetic trees the species trees? or all gene family trees?

      If the species tree, were these single-copy ortholog multiple sequence alignments for each orthogroup concatenated for inference with IQ-TREE, or using multiple partitions? Based on parameters, this would suggest the former (concatenation), but I would suggest being explicit in the methods.

    5. On the other hand, transitions between “restricted regeneration” and “robust regeneration” appear to be limited to the Continenticola, but likely occur frequently and in both directions (particularly in the Planariidae; Fig. 3e and S. Table 3).

      So transitions between A & C were included in these Mk models? I suggest including these specifics in the methods.

    6. To comprehensively reconstruct the phylogeny of our planarian species collection, we used the pipeline cartooned in Fig. 3a to extract broadly conserved single-copy orthologues from our transcriptomes50–52.

      Here again, it will have been quite important that protein redundancy (i.e. retaining only one protein per assembled gene) has been reduced, as even these methods to prune gene families down to single copy orthologs will be more prone to inclusion of paralogs when alternative isoforms persist in a transcriptome.

      When running orthofinder (or any clustering-based ortholog identification method) on transcriptome datasets, performance will be contingent upon the filtering strategy used.

    7. he consistently high completeness and low fragmentation of BUSCO gene copies in our transcriptomes indicated a high assembly quality of our data set (Fig. 3b).

      I would suggest reporting all standard BUSCO estimates: Complete single copy, Complete duplicated, fragmented, and missing. Partitioning out into this resolution provides a more complete understanding of transcriptome/proteome completeness/quality.

      As I describe in the methods, it is important to reduce protein redundancy (using CDHIT, retaining the longest protein per assembled gene) prior to running busco - if you don't the number of duplicated buscos will be artificially inflated.

    8. Branch length comparisons with selected mammalian and nematode transcriptomes quantitatively confirmed the extreme divergence of planarian lineages, which exceeds the observed sequence divergence between mammalian species and is on par with that of nematodes (S. Fig. 3b).

      Was a second round of orthogroup inference conducted by including an additional set of mammalian transcriptomes?

      The methods used to perform this analysis are not described in the methods section - these should be included/described. Without inclusion of these methods, it's impossible to meaningfully understand/interpret these results.

    9. Transcriptome assembly was carried out with our established pipeline46.

      I strongly suggest providing specific details of what is involved in this pipeline. What method is used for assembly? Are any steps conducted for reducing the redundancy in these assemblies (e.g. using CD hit to remove transcripts that are highly similar in sequence)? As is, the reader needs to go to the article you cite here, and then follow a hyperlink: http://planmine.mpi-cbg.de/planmine/PlanMine_Help.html#assembly

      This is purported to include a description of the assembly pipeline, but is in fact a dead hyperlink.

      As is, there is no way to interpret the transcriptome assemblies produced here.

  22. Mar 2023
    1. Transcriptome assembly was carried out with our established pipeline46.

      I strongly suggest providing specific details of what is involved in this pipeline. What method is used for assembly? Are any steps conducted for reducing the redundancy in these assemblies (e.g. using CD hit to remove transcripts that are highly similar in sequence)? As is, the reader needs to go to the article you cite here, and then follow a hyperlink: http://planmine.mpi-cbg.de/planmine/PlanMine_Help.html#assembly

      This is purported to include a description of the assembly pipeline, but is in fact a dead hyperlink.

      As is, there is no way to interpret the transcriptome assemblies produced here.

    2. Branch length comparisons with selected mammalian and nematode transcriptomes quantitatively confirmed the extreme divergence of planarian lineages, which exceeds the observed sequence divergence between mammalian species and is on par with that of nematodes (S. Fig. 3b).

      Was a second round of orthogroup inference conducted by including an additional set of mammalian transcriptomes?

      The methods used to perform this analysis are not described in the methods section - these should be included/described. Without inclusion of these methods, it's impossible to meaningfully understand/interpret these results.

    3. On the other hand, transitions between “restricted regeneration” and “robust regeneration” appear to be limited to the Continenticola, but likely occur frequently and in both directions (particularly in the Planariidae; Fig. 3e and S. Table 3).

      So transitions between A & C were included in these Mk models? I suggest including these specifics in the methods.

    4. To comprehensively reconstruct the phylogeny of our planarian species collection, we used the pipeline cartooned in Fig. 3a to extract broadly conserved single-copy orthologues from our transcriptomes50–52.

      Here again, it will have been quite important that protein redundancy (i.e. retaining only one protein per assembled gene) has been reduced, as even these methods to prune gene families down to single copy orthologs will be more prone to inclusion of paralogs when alternative isoforms persist in a transcriptome.

      When running orthofinder (or any clustering-based ortholog identification method) on transcriptome datasets, performance will be contingent upon the filtering strategy used.

    5. he consistently high completeness and low fragmentation of BUSCO gene copies in our transcriptomes indicated a high assembly quality of our data set (Fig. 3b).

      I would suggest reporting all standard BUSCO estimates: Complete single copy, Complete duplicated, fragmented, and missing. Partitioning out into this resolution provides a more complete understanding of transcriptome/proteome completeness/quality.

      As I describe in the methods, it is important to reduce protein redundancy (using CDHIT, retaining the longest protein per assembled gene) prior to running busco - if you don't the number of duplicated buscos will be artificially inflated.

    6. Then the appropriate transition matrix for ASR was determined by fitting MK-models with equal transition rates (ER), with symmetric transition rates (SYM), and with all transition rates different (ARD) and then evaluating the model fit using the corrected Akaike information criterion (AIC).

      Based on the supplement it seems you defined these models as a set of three ordered, discrete states (i.e. A <-> B <-> C). Did you consider/did you fit another set of models where there could be a two-state jump (i.e. A <-> C)? These might be worth fitting/considering, as it's not inconceivable that such transitions might occur - likewise, this additional type of transition may be a bit less sensitive under scenarios of incomplete sampling with respect to the focal trait (i.e. in cases where transitions did in fact occur from A -> B -> C, but the intervening species in character state B were not sampled (or trait data is missing).

    7. Phylogenetic trees were constructed using IQ-TREE77 (version: 2.1.2, parameters: -m MFP -bb 1000 -bnni) or via ASTRAL78 (version 5.7.1), using standard parameter settings (S. Fig. 3a). The phylogeny combining triclads, mammals and nematodes was built following the same approach as for the planarian phylogeny.

      Are these phylogenetic trees the species trees? or all gene family trees?

      If the species tree, were these single-copy ortholog multiple sequence alignments for each orthogroup concatenated for inference with IQ-TREE, or using multiple partitions? Based on parameters, this would suggest the former (concatenation), but I would suggest being explicit in the methods.

    8. The optimal parameters for PhyloPyPruner were chosen by comparing the outcome when adjusting for minimum sequence length, long branch trimming factor, minimum support value, minimum number of taxa, minimum OTU occupancy, tree pruning method, and minimum gene occupancy. The optimisation script, including the tested parameter values, can be found in the supplementary material.

      The supplementary material hosted on biorxiv doesn not seem to include these scripts/parameters.

      I'm also curious: on what basis were were parameter combinations judged to be more optimal than another? I like that you explored how these parameters impacted phylogenetic inference, but it's unclear what the actual optimality criteria were.

    9. Benchmarking Universal Single-Copy Orthologs - v 5.2.2 - metazoa odb10 - parameters: - protein

      Given that you are using transcriptome assemblies and presumaby amino acid sequences since you're using the "protein" method, it is important that some degree of protein redundancy reduction takes place prior to analysis with BUSCO (i.e. using CDHIT or retaining only the longest protein per assembled gene). Use of transdecoders '--single_best_orf' will still retain alternative isoforms which will lead the count of "duplicates" by busco to be inflated.

  23. Feb 2023
    1. The second coded the presence/absence of orthogroups. When this second coding strategy is used, individual orthogroups within each protein family are treated as individual characters. This is the same strategy introduced and justified by Pett et al. [37]

      I think a graphical depiction of this may be useful. As written, it's unclear how you are encoding orthogroups within homogroups.

      Based on your description below, it appears to me that the only difference is in the implementation of MCL clustering to sequence similarity, with orthofinder2 used to infer orthogroups (i.e. normalizing similarity scores to account for sequence length), and homomcl for homogroups). Your statement that orthogroups are nested within protein families doesn't appear to be consistent with this? Am I mistaken in my interpretation?

    2. Modelling morphological evolution by using stochastic processes is more intricate than modelling molecular sequence evolution because it cannot be assumed that the same evolutionary process is acting on all characters identically.

      Could this not be said about gene content evolution as well? Patterns and dynamics of gene family gain/loss is certainly expected to be a heterogeneous process - does the reversible binary substitution model allow for site heterogeny?

    3. We assembled a large number of new gene content datasets (see Methods, Fig. 1) to extensively test the effect of different parameter combinations when identifying homogroups and orthogroups, because this crucial step remains a challenge [40,41] and may influence the outcome of the downstream phylogenetic analysis [42]. For example, state-of-the-art methods provide two parameters (the E-value [similarity] and I-value [granulation or inflation]) which have a direct impact on the inferred gene family assignment (E-value) and splitting of gene families into orthogroups (I-value).

      I am delighted to see such a thorough exploration of the sensitivity of phylogenomic inference to these parameters, as defaults are typically used without consideration but can, as you say, have profound influence on downstream outcomes.

    4. The phylogenetic analysis of gene content data utilises genome-derived proteomes and converts the presence or absence of gene families in the genomes of the terminals into a binary data matrix [9,25,37,38].

      This is the same type of gene-content encoding used by Ryan et al. 2013 as their second dataset, which coincidentally supported the hypothesis of Ctenophores as sister.

      I'm curious as to why an alternative recoding of these gene families was not considered - gene copy number. From the analyses conducted here to infer orthogroups using OrthoFinder, this encoding would be fairly straightforward to accomplish, and could perhaps be even more information rich than presence absence. Additionally this would be an entirely distinct way of encoding these data as compared to the approach employed by Ryan et al., (2013).

      Additionally, there exist now several methods that are capable of inferring species trees from multi-copy gene family trees, which are by design more robust to, and even improved by the existence of paralogs (e.g. Asteroid -https://doi.org/10.1093/bioinformatics/btac832, SpeciesRax - https://doi.org/10.1093/molbev/msab365, ASTRAL-Pro 2 - https://doi.org/10.1093/bioinformatics/btac620).

    5. Considering that previous amino-acid alignment-based phylogenomic analyses showed model- and data dependency [e.g., 9,18], which therefore did not lead to conclusive results, alternative approaches might help to select between phylogenetic hypotheses.

      Unclear a priori why this might not always be expected to be the case, even when encoding data in other ways. Fundamentally, these are still evolutionary models.

    6. This is an impressive paper containing a significant amount of work, and I love the amount of effort focused on exploring how sensitive the results are to methodological approaches and parameter specifications. In particular, I'm delighted to see the exploration of how the choice of inflation parameter impacts downstream phylogenetic inference - given that this parameter has profound influence on all downstream analyses, I think all studies using these types of approaches should be similarly cautious/considerate.

      In general, I feel that the choice to encode 'gene content' as gene presence makes distinguishing these results from past efforts, particularly that of Pett at al., 2019, more challenging. That is, gene content is encoded in the same way across the two studies - and both the present study and that of Pett et al., come to the same conclusion that the Porifera-sister hypothesis is most strongly supported. Together, it makes these new phylogenetic analyses of gene content less compelling as corroborating evidence.

      I feel that an alternative encoding of gene content - as gene copy number as an ordered-discrete character e.g. (0,1,2,3, etc) - would have been valuable to explore. This type of encoding would be more information rich, and would be inherently distinct from past efforts. Alternatively, the authors could have used a growing number of methods that are capable of inferring species trees from multi-copy gene families (see later comment). I would hope to see some discussion of these alternatives, though i suspect implementation of either would be non-trivial.

  24. Dec 2022
    1. he maximum-likelihood UCE tree is topologically concordant with the ngsDist tree at the ecomorph level (Supplementary Figure 01).

      Could you clarify what you mean by this? It's difficult to intuit what this discordance looks like without the supplement being available. Regardless, it's nice to see that the discordance between the two topologies seems modest -

    2. Based on Patterson’s D statistics (Supplementary Figure 02) and F4-branch statistics, we found excess allele sharing between ecomorphs and within ecomorphs (Figure 4).

      It might be really interesting to see the distribution of D-statistics pooled into both within vs among species and ecomorph comparisons.

      I say this because it could be really informative to compare/rank the evidence for hybridization across levels - within/among species, as compared to within/among ecomorphs.

    3. We also display K = 15 because that is the number of species in the dataset (Figure 3).

      Oh, I understand now! It might be worth adding this explanation in the methods? I would also suggest including a plot of LogLik for each value of K - it's difficult to evaluate how strong of a difference there is in model supports without it.

      However, it is somewhat surprising that that these data would suggest there are only 2-3 clusters, when there are 15 described species - this seems like quite the discrepancy.

    4. There are mitochondrial-nuclear tree discordances based on topology and bootstrap support

      If i'm understanding the description of the UCE tree correctly, would it be correct to say that there is one additional mitonuclear discordance when considering the UCE tree? (The placement of T. anuenue)

    5. Both these analyses benefited from a tree-backbone and we specified the tree obtained in ngsDist.

      Could you elaborate on why you chose to use this tree versus the UCE likelihood tree? Are results from Dsuite robust to the choice of species tree used?

    6. measured as R2, was above 0.11

      This is a very specific number to use as a filter. Was this determined based on patterns of LD-decay inferred from the steps above? I might provide some additional explanation as to why this was used as the LD-filter.

    7. This dataset was used as input in NgsDist (Vieira et al. 2015), where we specified a block size of 20 SNPs, and 100 bootstrap replicates. We then ran the NJ software fastme v2 (Lefort et al. 2015), merged all the 100 replicates into a final tree with bootstrap support using RAXML while specifying an optimization of branch-length, and specifying GTRCAT (GTR + Optimization of substitution rates + Optimization of site-specific) as the substitution model.

      I have a couple thoughts about this - is there a reason why a combination of fastme and RAXML was used as opposed to full likelihood inference using RAXML or IQ-TREE?

      I ask because both RAxML and IQ-tree include methods (that i would suggest using here) for acquisition bias correction when using SNPs that could improve inference of both topology and branch-lengths.

      Another entirely different strategy to infer a species tree in a computationally efficient way using SNP data under the multispecies coalescent would be to use SVquartets (Chifman and Kubatko 2014). This potentially well suited for your dataset as it can leverage a large number of SNPs effectively, and even robustly handles linked sites. Caveat is that it cannot infer branch-lengths, but this might not be a concern for your use case.

    8. Phyluce allows extraction of UCEs, but because of the missing data due to low coverage and fragmented genomes, we were only able to retrieve 29 UCEs that were present in half of the dataset (16 of the UCEs were present in all individuals; and 23 UCEs present in ≥70 individuals).

      I may have missed it, but I don't believe it is mentioned in the methods what software was used for phylogenetic inference of the recovered UCEs? If my intuition is correct, I would guess IQ-Tree, the same as for mitochondrial gene? It might be worth clarifying here.

      Related, was a single concatenated phylogeny used? Or did you infer gene trees as well? IQ-Tree is capable of easily inferring both (gene trees and concatenated species tree - see here: http://www.iqtree.org/doc/Concordance-Factor). This would also enable to you infer multiple concordance factors which could help to summarize genealogical concordance, and potentially even quantify per-site discordance using your concatenated SNP dataset?

  25. Nov 2022
    1. we have limited intuition about how the mode of molecular evolution may relate to life-history variation.

      I'm probably sounding like a broken record now (a good thing - this is super interesting to me, and it's exciting to be nearing a place where we can address this!), but I think analysis of alternative causal models here with phylogenetic path analysis would be an awesome way towards not only generating this intuition, but also explicitly testing these hypotheses!

    2. The observation of a burst of molecular and quantitative trait model shifts within a ∼5 Ma interval of the K–Pg boundary (Figure 1) is likely not a coincidence

      That's one possibility - though it might be worth addressing/discussing the alternative hypothesis - that the pattern of mass (and presumably non-random with respect to body size) extinction and subsequent radiation (i.e. the early burst) could lead to surviving lineages exhibiting greater disparity in rates of molecular evolution and other life history traits.

      As I alluded to in my comment on figure 1, it might be worth (if feasible) explicitly assessing the probability of observing such a concentrated distribution of shifts near the time of a mass extinction event by simulating under a constant rate model of nucleotide substitution, and a mass extinction/early burst of speciation/turnover.

    3. we find evidence that shifts in the mode of genomic evolution were very likely concurrent with shifts in the evolutionary optima θ(t) of important avian life-history traits (Figure 2), as well as shifts in metabolic allometric slope βmass and intercept β.

      Super cool - this is partly where/why the use of phylogenetic path analysis may be particularly informative - could these shifts in evolutionary optima have been associated with a shift in causal relationships among biological/life history features? Maybe it's possible to infer, at least in part, the causal mechanisms behind these shifts in metabolic allometry?