7 Matching Annotations
  1. Dec 2025
    1. The hyperparameter search space is summarized in Table 1, with full results in Table 2. While no single configuration is universally optimal, we highlight a setting with block_size=4, fetch_factor=16, and num_workers=12, which achieves approximately 2593 samples/sec and maintains an entropy of 3.59—comparable to random sampling.

      This is a powerful tool that allows everyone to train on large datasets, thank you for sharing it with the community! Do you have any practical sense about the tradeoffs between minibatch entropy and model validation performance for a set amount of training time? Obviously this is an impossible experiment to actually run, but I wonder if even lower minibatch entropy which allows higher throughput would be ideal given a set training time. Do you have any anecdotal evidence from training runs on how much shuffling is optimal? I agree that since this experiment could not actually be performed, close to random shuffling is probably the best. Thank you for this contribution!

    1. calculations for exploratory analysis andother downstream tasks such as differential expression analysis.

      Thank you for this excellent resource that allows anyone to use this dataset. Obviously the SCVI approach has many benefits in terms of eliminating batch effects and noise, but are there any concerns about the noisiness in the generative process for the (corrected, normalized) expected value of the contributions from each gene for downstream tasks? The reconstruction during training looks excellent, but there should be some tradeoff between the benefits of the SCVI latent representation and the generative process - where do you think the model lands on that spectrum? Thank you again for such a useful model!

    1. Similar to the essentiality benchmark, we use the learned gene embeddings from each of the models and train a shallow MLP to predict a multi-hot label for each gene indicating its membership in one of the hallmark pathways

      Did you happen to try linear prediction from the embeddings for each approach? It would be interesting to see the amount of improvement due to the shallow MLP nonlinearity versus the simple linear prediction. Presumably the rankings by approach would be the same.

    2. To understand the relationship between model scale, training efficiency, and downstream performance, we trained the Tx1 model series at three scales: 70M, 1B, and 3B parameters. Fig. 7A shows the training cost versus computational budget (measured in FLOPs) for Tx1 compared to other single-cell foundation models including SE-600M, scGPT, and nv-Geneformer variants. Tx1 achieves substantially improved training efficiency, with 3–30× better compute efficiency relative to these prior models.

      Thank you for sharing this dataset and model (as well as the SCVI model). In terms of training cost versus computational budget, how would the smaller training subsets factor in to efficiency for the smaller models? It's interesting to consider training compute normalized by fraction of the data on which the model was trained. Is it possible that training the 3B model on only a subset of the dataset would not hurt performance and therefore improve training efficiency metrics? I appreciate this deep analysis of the training process.

  2. Oct 2025
    1. We note that improved reconstruction may come at the cost of increased feature absorption (Karvonen et al., 2024)

      Clearly from the nice agreement in Fig. 5, the SAE reconstructions do an excellent job at reconstructing the residual representation at each layer. I am curious about the magnitude of the reconstruction MSE for the hyperparameters covered in Fig. 8. Are there any results you've shared about the SAE training?

      There is a tradeoff between reconstruction error and L0 sparsity, but at what point are you learning more about only the SAEs than ESM2 itself?

    2. We developed a latent visualizer, InterProt, to streamline the process of identifying features.

      InterProt is an amazing tool for sorting through all of these findings.

      The Fig. 3C plot is also very nice for a global view of the learned latent features. What do you think about the relatively small fraction of "interesting" features (the "structural", "amino acid", "alpha helix", etc., top features on InterProt) compared to the total number of latents? Do you think this is more about our lack of knowledge of protein structure, or are the "uninteresting" latents just generally at a lower conceptual level (like point residue features) than what we find interesting (motifs with structural effects)?

  3. Jul 2025
    1. Model weights of both Ankh3-Large and Ankh3-XL models are available at https://huggingface.co/ElnaggarLab/ankh3-large and https://huggingface.co/ElnaggarLab/ankh3-xl.

      Thank you for open-sourcing this exciting model! It is impressive that the diverse set of training tasks allows the xl model to continue to improve over the large model performance. It’s also great to see how this was trained with Jax on TPUs.

      In running inference with the model, I had some issues with reproducing the S2S completion examples in the README. For the sequence completion example, was teacher forcing used at inference? I also observed excellent performance with [NLU] for predicting masked sites.