Hypothesis

42 Matching Annotations

Last 7 days
www.biorxiv.org www.biorxiv.org

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

2
1. evan.kiefl 26 Dec 2025
  
  in Arcadia Science
  
  First, the attainable Spearman correlation varies widely across prompts for the same assay: the gap between the best and worst prompt commonly exceeds 0.3. Second, the average variant log-likelihood also spans a broad range, and the optimal likelihood differs by assay.
  
  Is this a good candidate for distillation? It seems like it could lock in these performance gains without the heavy inference cost, and it might naturally solve the prompt sensitivity issues that warrant ensembling in the first place. Curious to hear your thoughts.
2. evan.kiefl 26 Dec 2025
  
  in Arcadia Science
  
  During training, we randomized the order of sequences within each document to encourage invariance with respect to sequence order
  
  When creating the prompt for a given homolog set {H_i, ...}, the order of concatenation is randomized to promote homolog order invariance. But was invariance ever tested post-training? Specifically, did you guys quantify the variance in model output when the exact same set of homologs is simply re-ordered? Establishing this baseline seems critical to determine whether the performance gains from ensembling truly derive from aggregating diverse evolutionary information, or if they are partially an artifact of smoothing out the model's sensitivity to arbitrary input ordering.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.64898/2025.12.19.695431v1
www.biorxiv.org www.biorxiv.org

RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models

1
1. evan.kiefl 26 Dec 2025
  
  in Arcadia Science
  
  RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models
  
  I have a major concern regarding the dataset construction. My primary question is, why did you choose to generate synthetic sequences (and structures) instead of using natural homologs? Databases like CATH or SCOP are full of naturally occurring protein pairs that share a fold but have very low sequence identity. Using those would have grounded your benchmark in real biological evolution rather than generative noise.
  
  Regarding your use of the "twilight zone" concept. While your dataset technically hits the 26% identity mark, I feel this misrepresents what that term actually defines. The twilight zone describes evolutionary homology, aka where sequences have diverged over millions of years due to selection and drift while maintaining structure. Your sequences, by contrast, are hallucinations from an inverse folding model running at high temperature. Generative variance is not the same as evolutionary divergence, and a pLM recognizing ProteinMPNN's output patterns is not the same as understanding structural conservation.
  
  Furthermore, relying entirely on synthetic validation creates a circular loop. You are testing if a pLM can recognize sequences made by ProteinMPNN and "validated" by AlphaFold3, without any experimental ground truth that these sequences actually fold. And to be frank, it's straightforward to generate high pTM AF3 structures that don't fold. Introduce a tryptophan mutation to your favorite protein. Its pTM will be almost unaffected by the mutation, but good luck expressing and purifying. A huge proportion of your dataset doesn't fold irl.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2025.09.23.678152v1
Sep 2025
www.biorxiv.org www.biorxiv.org

Scaling down protein language modeling with MSA Pairformer

1
1. evan.kiefl 25 Sep 2025
  
  in Arcadia Science
  
  MSAs can now be constructed in milliseconds [58]. As MSA generation methods continue to improve, models that efficiently leverage the rapidly growing set of available sequences, and thus richer evolutionary context, are well-positioned to advance protein language modeling toward a more sustainable future
  
  Totally agree, and its great to see this properly leveraged in the model. At the same time, this got me thinking that not all MSAs are created equally. Scalable methods (e.g., HMM-based or k-mer–based approaches) produce alignments at the scale required for these models, but these are quite different from the phylogenetics-grade MSAs carefully curated for evolutionary inference, which often incorporate clade-specific substitution models, manual curation, etc.
  
  To me, this raises the question that I think deserves investigation: since the model was trained on cheap-to-make MSAs, would inference on the highest of quality MSAs improve the model’s performance? Or because such an MSA would represent a slight departure from the model’s training distribution, would we expect the model to perform worse on this "superior" input?
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2025.08.02.668173v1
Aug 2025
www.biorxiv.org www.biorxiv.org

Sequence Modeling Is Not Evolutionary Reasoning

2
1. evan.kiefl 05 Aug 2025
  
  in Arcadia Science
  
  For a set of n sequences with predicted pairwise distances Dpred ∈ ℝn×nand true distances Dtrue ∈ ℝn×n, we sample a set of quartets 𝒬 ={(i, j, k, ℓ)}. For each quartet, we compute three possible pairwise distance sums:
  
  How many quartet subsets are sampled per observation (tree) during training? Does this depend on the size of the tree?
2. evan.kiefl 05 Aug 2025
  
  in Arcadia Science
  
  The input to Phylais S with a [CLS]token concatenated in front of each tokenized sequence, s ∈ S: {[CLS]s1 ∥ [CLS]s2 ∥ [CLS]s3, …, [CLS]sn}.
  
  So the sequences of a tree are concatenated together, and this concatenated token sequence is what the model operates on. I have two questions about this:
  
  How are the sequence embeddings calculated from the model output? Mean-pooling over the sequence's token positions, CLS token positions, etc? It would be nice to have this information in the text.
  
  Are the sequence embeddings invariant with respect to concatenation order? Since the concatenation order has no biological meaning, this seems important to demonstrate.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2025.01.17.633626v2
arxiv.org arxiv.org

Large Language Models are Locally Linear Mappings

1
1. evan.kiefl 01 Aug 2025
  
  in Public
  
  Large Language Models are Locally Linear MappingsReport issue for preceding element
  
  Really cool! I had a few questions to help my understanding.
  
  If F is your network, and J is your detached Jacobian, the paper's title "Large Language Models are Locally Linear Mappings" is basically the equation F(x) = J(x) x, right? This has me wondering the following: If J' is your true Jacobian, what does J'(x) x equal?
  
  If I understand correctly, each layer in the original transformer block reduces to a linear operator in the Jacobian, right? Is there a way to associate specific parameters in the transformer with specific coefficients in the Jacobian?
  
  A short distance away in the input embedding neighborhood, the detached Jacbian will be extremely different because the manifold is highly curved.
  
  (There is a small typo here (Jacbian -> Jacobian)). I think our offline conversation has primed me to mention this, but perhaps this chaotic-like behavior could explain why the reconstruction error for fp16 is ~1000x worse than for fp32.
Visit annotations in context

Annotators

evan.kiefl

URL

arxiv.org/html/2505.24293v2
Jun 2025
arxiv.org arxiv.org

Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

1
1. evan.kiefl 06 Jun 2025
  
  in Public
  
  Bio2Token: All-atom tokenization of any biomolecular structure with Mamba
  
  This work was really interesting to me, although significantly outside my domain of expertise. For that reason, I hope you'll excuse this idea if it seems nonsensical or unlikely to work.
  
  As I read your paper, I was thinking about potential use cases for such an accurate structure tokenizer and I thought about MD. From what I've heard, all-atom simulations, especially with solvent, are extremely expensive to run.
  
  Given this, my potentially harebrained idea was to train a model that ingests a tokenized structures for frames (i-2, i-1, i) of an MD simulation, with the objective of predicting the i+1th frame.
  
  Have you guys considered the practicality of this? I would enjoy hearing your thoughts.
  
  Thanks again for the contribution.
Visit annotations in context

Annotators

evan.kiefl

URL

arxiv.org/abs/2410.19110
May 2025
www.biorxiv.org www.biorxiv.org

A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

1
1. evan.kiefl 09 May 2025
  
  in Arcadia Science
  
  where the count matrix C ∈ ℝ (M +1) × (M +1) is constructed by repeating the count vector c = (1, c1, c2, …, cM) across all rows.
  
  Employing the target 'counts' to define an attention bias introduces apparent circularity, since these 'counts' are precisely what the model aims to predict. This poses a challenge for inference: how would the model predict gene expression levels if the attention bias 'C' must be defined using these same, yet-to-be-predicted, expression levels?
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2025.04.25.650731v1
Apr 2025
www.biorxiv.org www.biorxiv.org

Inferring genotype-phenotype maps using attention models

2
1. evan.kiefl 18 Apr 2025
  
  in Arcadia Science
  
  ATTENTION-BASED ARCHITECTURE FOR G-P MAPPING
  
  The model is a stack of attention layers, but I was surprised to see it omit all the typical components that brought attention into the limelight via transformers: multi-head attention, residual connections, layer norm, and position-wise FFNs. These have become standard and widely adopted, and largely for good reason, as they've shown to be very effective across many distinct domains.
  
  Was there a particular reason this specific custom architecture was preferred over implementing or at least comparing to a standard transformer encoder?
2. evan.kiefl 18 Apr 2025
  
  in Arcadia Science
  
  Genotype vectors are converted to one-hot embeddings X(g) and transformed into d-dimensional embeddings Z(g)
  
  Constructing X^(g) is an extremely expensive way to associate an embedding with each locus. You should simply use a lookup table (i.e. nn.Embedding).
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2025.04.11.648465v1
www.biorxiv.org www.biorxiv.org

ESM-Effect: An Effective and Efficient Fine-Tuning Framework towards accurate prediction of Mutation’s Functional Effect

3
1. evan.kiefl 01 Apr 2025
  
  in Arcadia Science
  
  Even when only the first layer (immediately after the embedding layer) is unfrozen, it can still influence the subsequent layers, enabling the model to produce informative embeddings for the regression head at the final layer
  
  This is fascinating! I wonder how the perplexity of the pre-training task is affected by which layer you choose to unfreeze.
2. evan.kiefl 01 Apr 2025
  
  in Arcadia Science
  
  The ESM-Effect Architecture thus comprises the 35M ESM2 model with 10 of 12 layers frozen and the mutation position regression head (cf. Figure 2). The model’s performance is driven by two key inductive biases in the regression head:
  
  How would you extend this combined head architecture (mutation position embedding + mean pooled) if you were looking at the effect of a multi-mutation variant?
  
  One strategy I can think of would be to slice out all mutation positions and pool them. I'm wondering if you guys thought about generalizing the architecture to scenarios when the number of mutations in your DMS dataset varies.
3. evan.kiefl 01 Apr 2025
  
  in Arcadia Science
  
  Figure 1:
  
  Seeing the pre-training validation perplexity in (1) made me wonder: did you ever assess how fine-tuning affected post-training perplexity? This could be a proxy to gauge how disruptive fine-tuning is for the pre-training task.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2025.02.03.635741v1
Feb 2025
www.biorxiv.org www.biorxiv.org

Protein Language Model Fitness Is a Matter of Preference

5
1. evan.kiefl 12 Feb 2025
  
  in Arcadia Science
  
  Protein Language Model Fitness Is a Matter of Preference
  
  I really enjoyed reading your paper and thought it contained many interesting and insightful gems.
  
  As someone who has calculated many PLL, which take time and money, I was very interested in your O(1) method for PLL.
  
  The predictive power being predicated on wildtype PLL is a very important result.
  
  I found Figure 5 to be a beautiful illustration of how homology in training data influences preference
  
  In Figure 6, it was incredible to see just how much the Spearman can be increased for the low-likelihood DMS datasets. And surprising to see that low-likelihood DMS datasets do worse. Clearly there is more to learn.
  
  More broadly, I would be curious to hear your thoughts on alternative PLM training objectives. Specifically, I'm interested in approaches that maintain the BERT-style masked language modeling objective while incorporating additional training signals. One key idea would be to include explicit feedback about sequence fitness ('good' vs 'bad' sequences) alongside the traditional masked prediction task.
  
  This approach could help move away from preference-oriented behavior. When models are trained solely on naturally occurring proteins, they implicitly learn that all training examples represent 'good' or 'valid' proteins. By incorporating direct fitness measurements as an additional training objective, we could potentially guide the model to learn more nuanced distinctions between functional and non-functional sequences, rather than simply modeling the distribution of extant proteins.
  
  Thanks again for the insightful paper.
2. evan.kiefl 12 Feb 2025
  
  in Arcadia Science
  
  Naive usage of sequence databases and scaling will magnify the biases training data leading to miscalibrated preferences
  
  This study does a great job illuminating this. Do you guys foresee a method for creating a more balanced, and less biased training dataset? It seems there is an opportunity to do more with less.
3. evan.kiefl 12 Feb 2025
  
  in Arcadia Science
  
  Figure 2
  
  This figure is never referenced in the text.
4. evan.kiefl 12 Feb 2025
  
  in Arcadia Science
  
  Unlike autoregressive language models, masked language models don’t have a natural way to immediately compute the joint likelihood of a sequence. As a result, Wang and Cho (2019) proposed to mask every index of a sequence one-at-a-time then average to derive a PLL (Wang and Cho, 2019): <img class="highwire-embed" alt="Embedded Image" src="https://www.biorxiv.org/sites/default/files/highwire/biorxiv/early/2024/10/03/2024.10.03.616542/embed/inline-graphic-2.gif"/>.This formulation suffers from the need to run 𝒪 (L) forward passes to compute a perplexity or log likelihood. In response to this, the community only considers autoregressive pLMs when computing fitness values for proteins containing insertions or deletions.
  
  There is a lot of overlap between this paragraph and the next.
5. evan.kiefl 12 Feb 2025
  
  in Arcadia Science
  
  a Upper
  
  The y-axis is labelled "Average DMS Score", which, for a passerby, sounds like the DMS fitness readout (activity/stability/etc)
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2024.10.03.616542v1
www.biorxiv.org www.biorxiv.org

Biophysics-based protein language models for protein engineering

2
1. evan.kiefl 07 Feb 2025
  
  in Arcadia Science
  
  With a smaller training size of ∼1M examples and just a single GPU, training times ranged from 6-26 hours for 100 epochs for most proteins (4 to 16 minutes per epoch). Pretraining METL-Global with 20M parameters took ∼50 hours on 4x A100s and ∼142 hours with 50M parameters.
  
  Given the performance, I'm impressed with the affordability of the models' pre-training.
  
  In a world of foundation models that cost millions of $ to train, I think it's definitely worth mentioning the frugality of these models in the discussion (if not already mentioned).
2. evan.kiefl 07 Feb 2025
  
  in Arcadia Science
  
  We implemented relative position embeddings as described by Shaw et al.
  
  Thank you for this paragraph and the two that proceed it. A very useful learning resource!
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2024.03.15.585128v2
Dec 2024
www.biorxiv.org www.biorxiv.org

Scaling Down for Efficiency: Medium-Sized Transformer Models for Protein Sequence Transfer Learning

7
1. evan.kiefl 31 Dec 2024
  
  in Arcadia Science
  
  We have evaluated the performance of ESM2 embeddings across various model sizes (from 8 million to 15 billion parameters) in transfer learning tasks on a wide range of different biological datasets
  
  I think the diversity of regression tasks lends a lot of robustness to your conclusions. However, I think you're using the term "transfer learning" rather narrowly, specifically referring to prediction tasks where either a value or a vector is predicted for each sequence.
  
  There are many classes of transfer learning tasks, like sequence labeling, token classification, all sequence-to-sequence tasks, etc. I think being more specific about the type of transfer learning you guys are making claims about would make your conclusions more accurate.
2. evan.kiefl 31 Dec 2024
  
  in Arcadia Science
  
  In most scenarios
  
  I really don't think this is true. Many transfer learning tasks are token-level predictions, and therefore in those scenarios embeddings cannot be compressed.
3. evan.kiefl 31 Dec 2024
  
  in Arcadia Science
  
  Scaling Down for Efficiency: Medium-Sized Transformer Models for Protein Sequence Transfer Learning
  
  Thanks for this insightful piece. I've left some food for thought below.
4. evan.kiefl 31 Dec 2024
  
  in Arcadia Science
  
  six additional proteins
  
  Initially I thought a mere 6 proteins were analyzed. Wording in the caption clarified: six additional DMS datasets were analyzed.
5. evan.kiefl 31 Dec 2024
  
  in Arcadia Science
  
  Even though these models were also pretrained with a maximum sequence length
  
  Technically ESM2 is trained using sequences longer than 1022, but a length 1022 subsequence is sampled whenever it is selected for a training batch.
6. evan.kiefl 31 Dec 2024
  
  in Arcadia Science
  
  Mean reduction in R2 when embeddings are compressed with methods other than mean pooling.A) Results for DMS data. B) Results for diverse protein sequences (PISCES data). In all cases, the y-axis represents different compression methods and the x-axis shows the resulting difference in R2. Dots represent the fixed effects estimates from mixed-effects modeling, and error bars represent 95% confidence intervals.
  
  This analysis that compares pooling methods was very informative, but it left me wondering the extent that mean pooling compares to no pooling at all. Is this something y'all considered? It would be interesting to compare the R2 of a more sophisticated transfer learning model that ingests the raw embeddings (like a basic FCN). Though an apples to apples might be hard to create, it would be useful to know the "cost" of mean pooling by observing the extend to which raw embeddings outperform mean pooling (if at all?)
7. evan.kiefl 31 Dec 2024
  
  in Arcadia Science
  
  this strategy may not retain all critical
  
  Critical what?
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2024.11.22.624936v2
www.biorxiv.org www.biorxiv.org

Miniaturizing, Modifying, and Augmenting Nature's Proteins with Raygun

8
1. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  We performed this experiment for the “no-cycle” and “1-step recycle” settings, and found that the number of candidates with the PFAM domain retained after recycling was significantly higher at 68/190 than the candidates generated without doing any recycling (50/190)
  
  This is cool--and to me--surprising. Is there any theory proposed to explain why this might be (for any of Raygun/ESMFold/AlphaFold), or is this simply an empirical observation?
2. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  To adjust for length variations, we updated the pLL score to make it length-invariant. The updated pLLinvar (S), where S is the input sequence, becomes
  
  Did you guys consider using pseudo-perplexity?
  
  PPPL = exp(-PLL / length)
3. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  Let p′′ be another Raygun sequence obtained from p of length n′ < n. Then the total and constituent losses become
  
  It would be nice if this notation matched the definition of the losses that are itemized above.
4. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  Replicate Loss (Lrp): Let a protein p has a length n and suppose we generated a new protein p′ of length n′ using Raygun. Then, this loss is designed to ensure that the fixed-length embedding of p is close to that of p′
  
  How is n' determined, and why was it only trained to be less than n, rather than greater?
  
  My thinking is that varying n' throughout the training process could help achieve more robust self-consistency of the embedding space.
5. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  This layer performs the length transformation by broadcasting each column vector of the fixed-dimensional representation, ensuring that the resulting combined embeddings have the desired length
  
  A question to test my understanding: are insertions/deletions distributed equally across the blocks? Here's an example to explain what I'm asking.
  
  To make things simple, let's say n is a multiple of K=50. If a miniaturized protein of n-50 is decoded, is it guaranteed that each block samples 49 residue positions?
6. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  Forward operation of Raygun Reduction Layer
  
  In the line that defines Me and Se the first argument to GetMeanStd should be Ee, not Es.
7. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  Miniaturizing, Modifying, and Augmenting Nature’s Proteins with Raygun
  
  The authors develop a novel approach to template-guided protein design called Raygun. Raygun is an encoder-decoder model, where proteins are encoded as fixed size pLM representation, regardless of length. From this length-agnostic representation, sequences of user-specified length can be decoded from the encoded representation. This sequence generation methodology has many unique advantages for sequence design which the authors describe well.
  
  My comments and questions are inlined throughout the text. Thanks for the wonderful study.
8. evan.kiefl 06 Dec 2024
  
  in Arcadia Science
  
  These results suggest that Raygun’s fixed-length representations not only retain but potentially refine the structural information present in ESM-2 embeddings
  
  Very cool. I've suspected mean pooling to be overly reductive, but of course raw embeddings have mismatched dimension. So it's pretty awesome to see how well it works to do the compromise: pooling within blocks.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2024.08.13.607858v1
Sep 2024
www.biorxiv.org www.biorxiv.org

Functional protein mining with conformal guarantees

1
1. evan.kiefl 10 Sep 2024
  
  in Arcadia Science
  
  Functional protein mining with conformal guarantees
  
  I found this study very interesting, and despite my limited knowledge of pLMs and conformal statistics, I have a few comments about the results pertaining to section 3.1. Perhaps my comments may provide a data point on how non-experts may engage with the paper. Please feel free to take or leave any of my suggestions/remarks.
  
  I really like the approach of establishing conformal guarantees for all the reasons stated in the introduction. I especially liked the genericism with which the application of conformal statistics to this problem is presented, and that it was made clear that an explicit "non-goal" of the study was to demo a new machine learning model for enzyme classification.
  
  While reading, I kept thinking about the fact that members of a Pfam domain do not necessarily share the same biochemical function. This is because less than 0.1% of protein functional annotations are linked to experimental evidence (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9374478/) and the rest--the vast and overwhelming majority--are annotated transitively based on similarity scores of some kind.
  
  With that in mind, I think the authors could do better to point out that the ground truth upon which their terms FP, TP, and FDR are defined, is itself a proxy for shared function. I don't believe this at all detracts from the results of the paper, but pointing out these assumptions would increase the trust of readers who question what you mean by terms like conformal "guarantees" and "true" positives. My apologies if you already explained this somewhere and I missed it.
  
  Since JCVI Syn3.0 was published in 2016, it would be interesting to see whether the traditional search methods (BLAST & HMMSearch) still yield 20% unknown function, or whether or our annotations have since improved.
  
  It would also be interesting to see if the Protein-Vec hits in the Syn3.0 case study that don't exceed lambda are systematically "worse" than the true positives, for example as measured by TM-score.
  
  Thanks again for putting out this interesting study.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2024.06.27.601042v4
Jul 2024
www.biorxiv.org www.biorxiv.org

Structure-aware protein sequence alignment using contrastive learning

1
1. evan.kiefl 30 Jul 2024
  
  in Arcadia Science
  
  Structure-aware protein sequence alignment using contrastive learning
  
  I found this study very interesting and creative! Fine-tuning the embedding space to account for structural similarity via contrastive learning seems like a wonderful idea and the results are very impressive. Here are some of my thoughts about your paper, presented in no particular order. Please feel free to take or leave any of my suggestions.
  
  One advantage of CLAlign compared to structural aligners is that you don't need to calculate structures. However, the hardware requirements for CLAlign are probably non-trivial, since pLM embeddings have to be calculated. Hardware requirements are missing from the manuscript so it's hard to know. Relatedly, there is no information provided about the speed of CLAlign. I think the manuscript should be expanded to include detailed runtime statistics and hardware requirements so that CLAlign can be better benchmarked against the other tools.
  
  While Table 1 gives us an overall picture of the alignment quality, it would be nice to know the tool's strengths and weaknesses. How does it perform when sequences are distant homologs? Or when there are large length mismatches? Since embedding-based alignments are state-of-the-art, this kind of information would be broadly useful for readers.
  
  Figure 1 looks more like a draft than a complete figure. And without a caption it doesn't make sense.
  
  The performance is very impressive, and it has me curious how much further the performance could be improved simply by increasing the epochs or training dataset. Visualizing the loss curve could help contextualize the performance and help readers understand the extent to which there is room for improvement.
  
  Small notes:
  
  Throughout the manuscript, consistent reference to pLMs is made without any specificity. But there are many different architectures, e.g. BERT, T5, autoregressive, etc. I found this confusing.
  
  There are many grammatical mistakes. Consider passing the manuscript through a grammar checker.
  
  Final thoughts:
  
  Great work! I am curious to try CLAlign once it is made available.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2024.03.09.583681v1
Jun 2024
www.biorxiv.org www.biorxiv.org

GTalign: High-performance protein structure alignment, superposition, and search

4
1. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  The right panel shows the cumulative TM-score plotted against runtime in seconds
  
  My apologies if I missed this, but I was expecting to find a section in the Methods section that explained what hardware was used for the right panels. In particular, I was curious whether GTalign was ran in CPU-only mode, or whether GPUs were used. Maybe some details could be added either as a section in the Methods section or as a quick description within the Figure 1 caption.
2. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  user-friendly nature
  
  I think GTalign could be made user-friendly by creating simpler install instructions. In my opinion, that is likely the largest barrier preventing its use in the scientific world. See this issue for details: https://github.com/minmarg/gtalign_alpha/issues/1
3. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  Notably, the desktop-grade machine, housing a more recent and affordable GeForce RTX 4090 GPU, outpaced the server with three Tesla V100 GPU cards when running GTalign. The detailed runtimes for each GTalign parameterized variant on these diverse machines are presented in Table S5.
  
  This is very surprising. Is there a dataset size at which the server starts to eek out performance gains?
4. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  In the middle panel, the alignments are sorted by their (TM-align-obtained) TM-score. Vertical lines indicate the number of alignments with a TM-score ≥ 0.5. The arrow denotes the largest difference in that number between GTalign (732,024) and Foldseek (13,371)
  
  The middle panel presents the data in a way that I've never seen before, and I had quite a difficult time wrapping my head around. I think my confusion boils down to these two main concerns: (1) Why are the curves in the left panels repeated in the middle panels? and (2) I think it is incorrect to label the x-axis as "# top hits". I would have understood this plot right away if the curves were removed and the x-axis label was replaced with "# hits with TM-score > 0.5".
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2023.12.18.572167v3

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL