8 Matching Annotations
  1. Jun 2025
    1. Very cool study! One suggestion: the pseudo perplexity values still seem pretty high even after fine-tuning, which may indicate some degree of underfitting. This could be due to the relatively small size of the 35M ESM2 model. Have you considered trying a larger model (150M or 650M)? If fine-tuning a larger ESM2 model is computationally prohibitive, it might still be informative to compare against the zero-shot performance of a larger model to assess whether fine-tuning is necessary, or whether a larger baseline alone achieves comparable predictive results.

  2. May 2025
    1. Very cool study! It's great to see so many tools stitched together in such a purpose-built way. Have you thought about running your pipeline on other natural KSI homologs? It’d be interesting to see if, like in directed or natural evolution, certain starting points make it easier to explore sequence space or lead to better outcomes. This kind of pipeline seems like a great way to test that idea without requiring tons of experimental screening.

    1. To visually examine the sequence-function relationship of the characterized antibody variants, both a network plot and a phylogenetic tree were generated

      Given that your results clearly show a strong relationship between sequence similarity and binding affinity (in both the phylogenetic tree and network analysis), did you consider alternative strategies for sequence encoding? In particular those that might capture some of this evolutionary signal? For example including additional features derived from the phylogenetic tree, network-based distances, or embeddings from protein language models (like ESM)?

      These kinds of features might be especially valuable in a small-sample setting like this one and could further boost the predictive power of your models. Very nice study! Great to see creative and effective ways to leverage the power of small experimental datasets for protein function prediction.

  3. Mar 2025
    1. To compare the catalytic activity, designed monomers were expressed in BHK21 cells together with a tetracycline inducible green fluorescent protein (GFP) and a synthetic protein consisting of tetracycline-controlled transactivator (tTA) tethered via a linker containing the TEV endogenous catalytic site (ENLYFQ’S) to a transmembrane domain protein. The transmembrane domain protein fused tTA is localised to the plasma membrane, and thus the GFP signal is low in the absence of an active TEV protease, but an active protease cleaves tTA enabling its translocation to the nucleus and induction of GFP expression (Figure 4A).

      Very cool paper! Really great to see a (rare) comparison between all these different methods. I’m very interested in the experimental readout, do you have any thoughts on how the in cell GFP assay might be influenced by factors like expression level, stability, or translational efficiency? Just curious if you think those could affect the comparisons at all.

  4. Feb 2025
    1. This is a really cool approach to bringing biophysical information into protein mutation prediction. It does seem worth exploring whether including the evolutionary information gleaned from LLMs like ESM2 improves the performance of METL. Combining these two approaches seems like it has real potential to leverage different types of information. Have you thought about ways to use embeddings from models like ESM2 in the METL pretraining to try to improve generalizability? It would be cool to see if these embeddings actually improve performance, especially with small experimental training sets. Great work!

    1. Very interesting work! I’m curious about the effects of using training data from multiple expression systems (bacteria, fungi, mammalian cells), particularly since expression requirements can vary slightly between organisms. Have you explored whether expression system-specific models perform better when predicting expression within a given system? Or, is the training data biased toward one particular expression system, potentially leading to worse predictions for others? Or has the model really learned general features of expression across these organisms? Great work!

  5. Dec 2024
    1. The green, blue and red lines highlights the same specific choices of ancestor used in Fig. 1.

      It would be helpful to define these colors in one or more of the figure captions. Currently they are only defined in the main text (not Fig 1) despite this pointing to Figure 1.

    2. Our analysis shows that the amount of diversity at a given evolutionary time depends strongly on the ancestor, due to highly non-trivial epistatic dynamical correlations. More epistatically constrained ancestors give rise to less diversity, thus allowing for reconstruction over longer evolutionary times (Fig. 8a). Yet, at comparable amount of diversity, more epistatic ancestors are more difficult to reconstruct (Fig. 8b), at least using the FastML algorithm that neglects correlations between sites

      This is a really intriguing finding. It would be interesting to look at the posterior probabilities of the FastML reconstructed ancestors for each level of epistasis. I am wondering if the posterior probabilities reflect the uncertainty you are observing here, or whether ASR algorithms are blind to them (because they are blind to epistasis). In other words, do the more epistatic ancestors produce ML ASR sequences with lower posterior probabilities or are the probabilities misleadingly high? If the latter, this work could have implications for the validity of ASR on sequences with high levels of epistasis since the posterior probabilities are generally used as a measure of confidence in the reconstructions.