42 Matching Annotations
  1. Last 7 days
    1. First, the attainable Spearman correlation varies widely across prompts for the same assay: the gap between the best and worst prompt commonly exceeds 0.3. Second, the average variant log-likelihood also spans a broad range, and the optimal likelihood differs by assay.

      Is this a good candidate for distillation? It seems like it could lock in these performance gains without the heavy inference cost, and it might naturally solve the prompt sensitivity issues that warrant ensembling in the first place. Curious to hear your thoughts.

    2. During training, we randomized the order of sequences within each document to encourage invariance with respect to sequence order

      When creating the prompt for a given homolog set {H_i, ...}, the order of concatenation is randomized to promote homolog order invariance. But was invariance ever tested post-training? Specifically, did you guys quantify the variance in model output when the exact same set of homologs is simply re-ordered? Establishing this baseline seems critical to determine whether the performance gains from ensembling truly derive from aggregating diverse evolutionary information, or if they are partially an artifact of smoothing out the model's sensitivity to arbitrary input ordering.

    1. RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models

      I have a major concern regarding the dataset construction. My primary question is, why did you choose to generate synthetic sequences (and structures) instead of using natural homologs? Databases like CATH or SCOP are full of naturally occurring protein pairs that share a fold but have very low sequence identity. Using those would have grounded your benchmark in real biological evolution rather than generative noise.

      Regarding your use of the "twilight zone" concept. While your dataset technically hits the 26% identity mark, I feel this misrepresents what that term actually defines. The twilight zone describes evolutionary homology, aka where sequences have diverged over millions of years due to selection and drift while maintaining structure. Your sequences, by contrast, are hallucinations from an inverse folding model running at high temperature. Generative variance is not the same as evolutionary divergence, and a pLM recognizing ProteinMPNN's output patterns is not the same as understanding structural conservation.

      Furthermore, relying entirely on synthetic validation creates a circular loop. You are testing if a pLM can recognize sequences made by ProteinMPNN and "validated" by AlphaFold3, without any experimental ground truth that these sequences actually fold. And to be frank, it's straightforward to generate high pTM AF3 structures that don't fold. Introduce a tryptophan mutation to your favorite protein. Its pTM will be almost unaffected by the mutation, but good luck expressing and purifying. A huge proportion of your dataset doesn't fold irl.

  2. Sep 2025
    1. MSAs can now be constructed in milliseconds [58]. As MSA generation methods continue to improve, models that efficiently leverage the rapidly growing set of available sequences, and thus richer evolutionary context, are well-positioned to advance protein language modeling toward a more sustainable future

      Totally agree, and its great to see this properly leveraged in the model. At the same time, this got me thinking that not all MSAs are created equally. Scalable methods (e.g., HMM-based or k-mer–based approaches) produce alignments at the scale required for these models, but these are quite different from the phylogenetics-grade MSAs carefully curated for evolutionary inference, which often incorporate clade-specific substitution models, manual curation, etc.

      To me, this raises the question that I think deserves investigation: since the model was trained on cheap-to-make MSAs, would inference on the highest of quality MSAs improve the model’s performance? Or because such an MSA would represent a slight departure from the model’s training distribution, would we expect the model to perform worse on this "superior" input?

  3. Aug 2025
    1. For a set of n sequences with predicted pairwise distances Dpred ∈ ℝn×nand true distances Dtrue ∈ ℝn×n, we sample a set of quartets 𝒬 ={(i, j, k, ℓ)}. For each quartet, we compute three possible pairwise distance sums:

      How many quartet subsets are sampled per observation (tree) during training? Does this depend on the size of the tree?

    2. The input to Phylais S with a [CLS]token concatenated in front of each tokenized sequence, s ∈ S: {[CLS]s1 ∥ [CLS]s2 ∥ [CLS]s3, …, [CLS]sn}.

      So the sequences of a tree are concatenated together, and this concatenated token sequence is what the model operates on. I have two questions about this:

      How are the sequence embeddings calculated from the model output? Mean-pooling over the sequence's token positions, CLS token positions, etc? It would be nice to have this information in the text.

      Are the sequence embeddings invariant with respect to concatenation order? Since the concatenation order has no biological meaning, this seems important to demonstrate.

    1. Large Language Models are Locally Linear MappingsReport issue for preceding element

      Really cool! I had a few questions to help my understanding.

      If F is your network, and J is your detached Jacobian, the paper's title "Large Language Models are Locally Linear Mappings" is basically the equation F(x) = J(x) x, right? This has me wondering the following: If J' is your true Jacobian, what does J'(x) x equal?

      If I understand correctly, each layer in the original transformer block reduces to a linear operator in the Jacobian, right? Is there a way to associate specific parameters in the transformer with specific coefficients in the Jacobian?

      A short distance away in the input embedding neighborhood, the detached Jacbian will be extremely different because the manifold is highly curved.

      (There is a small typo here (Jacbian -> Jacobian)). I think our offline conversation has primed me to mention this, but perhaps this chaotic-like behavior could explain why the reconstruction error for fp16 is ~1000x worse than for fp32.

  4. Jun 2025
    1. Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

      This work was really interesting to me, although significantly outside my domain of expertise. For that reason, I hope you'll excuse this idea if it seems nonsensical or unlikely to work.

      As I read your paper, I was thinking about potential use cases for such an accurate structure tokenizer and I thought about MD. From what I've heard, all-atom simulations, especially with solvent, are extremely expensive to run.

      Given this, my potentially harebrained idea was to train a model that ingests a tokenized structures for frames (i-2, i-1, i) of an MD simulation, with the objective of predicting the i+1th frame.

      Have you guys considered the practicality of this? I would enjoy hearing your thoughts.

      Thanks again for the contribution.

  5. May 2025
    1. where the count matrix C ∈ ℝ (M +1) × (M +1) is constructed by repeating the count vector c = (1, c1, c2, …, cM) across all rows.

      Employing the target 'counts' to define an attention bias introduces apparent circularity, since these 'counts' are precisely what the model aims to predict. This poses a challenge for inference: how would the model predict gene expression levels if the attention bias 'C' must be defined using these same, yet-to-be-predicted, expression levels?

  6. Apr 2025
    1. ATTENTION-BASED ARCHITECTURE FOR G-P MAPPING

      The model is a stack of attention layers, but I was surprised to see it omit all the typical components that brought attention into the limelight via transformers: multi-head attention, residual connections, layer norm, and position-wise FFNs. These have become standard and widely adopted, and largely for good reason, as they've shown to be very effective across many distinct domains.

      Was there a particular reason this specific custom architecture was preferred over implementing or at least comparing to a standard transformer encoder?

    2. Genotype vectors are converted to one-hot embeddings X(g) and transformed into d-dimensional embeddings Z(g)

      Constructing X^(g) is an extremely expensive way to associate an embedding with each locus. You should simply use a lookup table (i.e. nn.Embedding).

    1. Even when only the first layer (immediately after the embedding layer) is unfrozen, it can still influence the subsequent layers, enabling the model to produce informative embeddings for the regression head at the final layer

      This is fascinating! I wonder how the perplexity of the pre-training task is affected by which layer you choose to unfreeze.

    2. The ESM-Effect Architecture thus comprises the 35M ESM2 model with 10 of 12 layers frozen and the mutation position regression head (cf. Figure 2). The model’s performance is driven by two key inductive biases in the regression head:

      How would you extend this combined head architecture (mutation position embedding + mean pooled) if you were looking at the effect of a multi-mutation variant?

      One strategy I can think of would be to slice out all mutation positions and pool them. I'm wondering if you guys thought about generalizing the architecture to scenarios when the number of mutations in your DMS dataset varies.

    3. Figure 1:

      Seeing the pre-training validation perplexity in (1) made me wonder: did you ever assess how fine-tuning affected post-training perplexity? This could be a proxy to gauge how disruptive fine-tuning is for the pre-training task.

  7. Feb 2025
    1. Protein Language Model Fitness Is a Matter of Preference

      I really enjoyed reading your paper and thought it contained many interesting and insightful gems.

      • As someone who has calculated many PLL, which take time and money, I was very interested in your O(1) method for PLL.
      • The predictive power being predicated on wildtype PLL is a very important result.
      • I found Figure 5 to be a beautiful illustration of how homology in training data influences preference
      • In Figure 6, it was incredible to see just how much the Spearman can be increased for the low-likelihood DMS datasets. And surprising to see that low-likelihood DMS datasets do worse. Clearly there is more to learn.

      More broadly, I would be curious to hear your thoughts on alternative PLM training objectives. Specifically, I'm interested in approaches that maintain the BERT-style masked language modeling objective while incorporating additional training signals. One key idea would be to include explicit feedback about sequence fitness ('good' vs 'bad' sequences) alongside the traditional masked prediction task.

      This approach could help move away from preference-oriented behavior. When models are trained solely on naturally occurring proteins, they implicitly learn that all training examples represent 'good' or 'valid' proteins. By incorporating direct fitness measurements as an additional training objective, we could potentially guide the model to learn more nuanced distinctions between functional and non-functional sequences, rather than simply modeling the distribution of extant proteins.

      Thanks again for the insightful paper.

    2. Naive usage of sequence databases and scaling will magnify the biases training data leading to miscalibrated preferences

      This study does a great job illuminating this. Do you guys foresee a method for creating a more balanced, and less biased training dataset? It seems there is an opportunity to do more with less.

    3. Unlike autoregressive language models, masked language models don’t have a natural way to immediately compute the joint likelihood of a sequence. As a result, Wang and Cho (2019) proposed to mask every index of a sequence one-at-a-time then average to derive a PLL (Wang and Cho, 2019): <img class="highwire-embed" alt="Embedded Image" src="https://www.biorxiv.org/sites/default/files/highwire/biorxiv/early/2024/10/03/2024.10.03.616542/embed/inline-graphic-2.gif"/>.This formulation suffers from the need to run 𝒪 (L) forward passes to compute a perplexity or log likelihood. In response to this, the community only considers autoregressive pLMs when computing fitness values for proteins containing insertions or deletions.

      There is a lot of overlap between this paragraph and the next.

    1. With a smaller training size of ∼1M examples and just a single GPU, training times ranged from 6-26 hours for 100 epochs for most proteins (4 to 16 minutes per epoch). Pretraining METL-Global with 20M parameters took ∼50 hours on 4x A100s and ∼142 hours with 50M parameters.

      Given the performance, I'm impressed with the affordability of the models' pre-training.

      In a world of foundation models that cost millions of $ to train, I think it's definitely worth mentioning the frugality of these models in the discussion (if not already mentioned).

  8. Dec 2024
    1. We have evaluated the performance of ESM2 embeddings across various model sizes (from 8 million to 15 billion parameters) in transfer learning tasks on a wide range of different biological datasets

      I think the diversity of regression tasks lends a lot of robustness to your conclusions. However, I think you're using the term "transfer learning" rather narrowly, specifically referring to prediction tasks where either a value or a vector is predicted for each sequence.

      There are many classes of transfer learning tasks, like sequence labeling, token classification, all sequence-to-sequence tasks, etc. I think being more specific about the type of transfer learning you guys are making claims about would make your conclusions more accurate.

    2. Even though these models were also pretrained with a maximum sequence length

      Technically ESM2 is trained using sequences longer than 1022, but a length 1022 subsequence is sampled whenever it is selected for a training batch.

    3. Mean reduction in R2 when embeddings are compressed with methods other than mean pooling.A) Results for DMS data. B) Results for diverse protein sequences (PISCES data). In all cases, the y-axis represents different compression methods and the x-axis shows the resulting difference in R2. Dots represent the fixed effects estimates from mixed-effects modeling, and error bars represent 95% confidence intervals.

      This analysis that compares pooling methods was very informative, but it left me wondering the extent that mean pooling compares to no pooling at all. Is this something y'all considered? It would be interesting to compare the R2 of a more sophisticated transfer learning model that ingests the raw embeddings (like a basic FCN). Though an apples to apples might be hard to create, it would be useful to know the "cost" of mean pooling by observing the extend to which raw embeddings outperform mean pooling (if at all?)

    1. We performed this experiment for the “no-cycle” and “1-step recycle” settings, and found that the number of candidates with the PFAM domain retained after recycling was significantly higher at 68/190 than the candidates generated without doing any recycling (50/190)

      This is cool--and to me--surprising. Is there any theory proposed to explain why this might be (for any of Raygun/ESMFold/AlphaFold), or is this simply an empirical observation?

    2. To adjust for length variations, we updated the pLL score to make it length-invariant. The updated pLLinvar (S), where S is the input sequence, becomes

      Did you guys consider using pseudo-perplexity?

      PPPL = exp(-PLL / length)

    3. Let p′′ be another Raygun sequence obtained from p of length n′ < n. Then the total and constituent losses become

      It would be nice if this notation matched the definition of the losses that are itemized above.

    4. Replicate Loss (Lrp): Let a protein p has a length n and suppose we generated a new protein p′ of length n′ using Raygun. Then, this loss is designed to ensure that the fixed-length embedding of p is close to that of p′

      How is n' determined, and why was it only trained to be less than n, rather than greater?

      My thinking is that varying n' throughout the training process could help achieve more robust self-consistency of the embedding space.

    5. This layer performs the length transformation by broadcasting each column vector of the fixed-dimensional representation, ensuring that the resulting combined embeddings have the desired length

      A question to test my understanding: are insertions/deletions distributed equally across the blocks? Here's an example to explain what I'm asking.

      To make things simple, let's say n is a multiple of K=50. If a miniaturized protein of n-50 is decoded, is it guaranteed that each block samples 49 residue positions?

    6. Miniaturizing, Modifying, and Augmenting Nature’s Proteins with Raygun

      The authors develop a novel approach to template-guided protein design called Raygun. Raygun is an encoder-decoder model, where proteins are encoded as fixed size pLM representation, regardless of length. From this length-agnostic representation, sequences of user-specified length can be decoded from the encoded representation. This sequence generation methodology has many unique advantages for sequence design which the authors describe well.

      My comments and questions are inlined throughout the text. Thanks for the wonderful study.

    7. These results suggest that Raygun’s fixed-length representations not only retain but potentially refine the structural information present in ESM-2 embeddings

      Very cool. I've suspected mean pooling to be overly reductive, but of course raw embeddings have mismatched dimension. So it's pretty awesome to see how well it works to do the compromise: pooling within blocks.

  9. Sep 2024
    1. Functional protein mining with conformal guarantees

      I found this study very interesting, and despite my limited knowledge of pLMs and conformal statistics, I have a few comments about the results pertaining to section 3.1. Perhaps my comments may provide a data point on how non-experts may engage with the paper. Please feel free to take or leave any of my suggestions/remarks.

      I really like the approach of establishing conformal guarantees for all the reasons stated in the introduction. I especially liked the genericism with which the application of conformal statistics to this problem is presented, and that it was made clear that an explicit "non-goal" of the study was to demo a new machine learning model for enzyme classification.

      While reading, I kept thinking about the fact that members of a Pfam domain do not necessarily share the same biochemical function. This is because less than 0.1% of protein functional annotations are linked to experimental evidence (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9374478/) and the rest--the vast and overwhelming majority--are annotated transitively based on similarity scores of some kind.

      With that in mind, I think the authors could do better to point out that the ground truth upon which their terms FP, TP, and FDR are defined, is itself a proxy for shared function. I don't believe this at all detracts from the results of the paper, but pointing out these assumptions would increase the trust of readers who question what you mean by terms like conformal "guarantees" and "true" positives. My apologies if you already explained this somewhere and I missed it.

      Since JCVI Syn3.0 was published in 2016, it would be interesting to see whether the traditional search methods (BLAST & HMMSearch) still yield 20% unknown function, or whether or our annotations have since improved.

      It would also be interesting to see if the Protein-Vec hits in the Syn3.0 case study that don't exceed lambda are systematically "worse" than the true positives, for example as measured by TM-score.

      Thanks again for putting out this interesting study.

  10. Jul 2024
    1. Structure-aware protein sequence alignment using contrastive learning

      I found this study very interesting and creative! Fine-tuning the embedding space to account for structural similarity via contrastive learning seems like a wonderful idea and the results are very impressive. Here are some of my thoughts about your paper, presented in no particular order. Please feel free to take or leave any of my suggestions.

      One advantage of CLAlign compared to structural aligners is that you don't need to calculate structures. However, the hardware requirements for CLAlign are probably non-trivial, since pLM embeddings have to be calculated. Hardware requirements are missing from the manuscript so it's hard to know. Relatedly, there is no information provided about the speed of CLAlign. I think the manuscript should be expanded to include detailed runtime statistics and hardware requirements so that CLAlign can be better benchmarked against the other tools.

      While Table 1 gives us an overall picture of the alignment quality, it would be nice to know the tool's strengths and weaknesses. How does it perform when sequences are distant homologs? Or when there are large length mismatches? Since embedding-based alignments are state-of-the-art, this kind of information would be broadly useful for readers.

      Figure 1 looks more like a draft than a complete figure. And without a caption it doesn't make sense.

      The performance is very impressive, and it has me curious how much further the performance could be improved simply by increasing the epochs or training dataset. Visualizing the loss curve could help contextualize the performance and help readers understand the extent to which there is room for improvement.

      Small notes:

      • Throughout the manuscript, consistent reference to pLMs is made without any specificity. But there are many different architectures, e.g. BERT, T5, autoregressive, etc. I found this confusing.

      • There are many grammatical mistakes. Consider passing the manuscript through a grammar checker.

      Final thoughts:

      Great work! I am curious to try CLAlign once it is made available.

  11. Jun 2024
    1. The right panel shows the cumulative TM-score plotted against runtime in seconds

      My apologies if I missed this, but I was expecting to find a section in the Methods section that explained what hardware was used for the right panels. In particular, I was curious whether GTalign was ran in CPU-only mode, or whether GPUs were used. Maybe some details could be added either as a section in the Methods section or as a quick description within the Figure 1 caption.

    2. Notably, the desktop-grade machine, housing a more recent and affordable GeForce RTX 4090 GPU, outpaced the server with three Tesla V100 GPU cards when running GTalign. The detailed runtimes for each GTalign parameterized variant on these diverse machines are presented in Table S5.

      This is very surprising. Is there a dataset size at which the server starts to eek out performance gains?

    3. In the middle panel, the alignments are sorted by their (TM-align-obtained) TM-score. Vertical lines indicate the number of alignments with a TM-score ≥ 0.5. The arrow denotes the largest difference in that number between GTalign (732,024) and Foldseek (13,371)

      The middle panel presents the data in a way that I've never seen before, and I had quite a difficult time wrapping my head around. I think my confusion boils down to these two main concerns: (1) Why are the curves in the left panels repeated in the middle panels? and (2) I think it is incorrect to label the x-axis as "# top hits". I would have understood this plot right away if the curves were removed and the x-axis label was replaced with "# hits with TM-score > 0.5".