24 Matching Annotations
  1. Last 7 days
    1. ur free, flow-based, and steered MD simulations not only substantiated previous experimental findings but also revealed previously unrecognized mechanisms of VWF mechanomodulation, including dynamic interactions between the N′AIM and C′AIM regions and the A1 domain.

      Well supported. Good job!

    2. Insights from our flow simulations (Movie S4), which recapitulate the flow-induced unfurling of VWF and the uncoiling of N’AIM and C’AIM to expose the A1 domain (Fig. 3A), revealed that while O-linked glycans enhance steric shielding of A1 from GPIbα, they also modulate the stability of AIM–A1 interactions. Specifically, glycan-induced steric hindrance shortened the lifetimes of both N’AIM–A1 and C’AIM–A1 interactions (Fig. 3B), leading to earlier uncoiling events compared to the unglycosylated system (Fig. 3C). Importantly, the key residues mediating these interactions were conserved regardless of glycosylation status (Fig. 3D), indicating that the observed differences arise primarily from sterics 20.

      With what certainty/confidence? I see the blue/red shadows in 3B but no numerical bound.

    3. However, at sites of vascular injury, elevated shear stress acts as a mechanical cue that triggers VWF to unfurl into an extended, conformation exposing cryptic binding sites for the platelet surface receptor glycoprotein Ibα (GPIbα)5. Remarkably, the spatial organization of VWF is highly context dependent. Within the trans-Golgi network, VWF monomers assemble via head-to-head interactions through the D’D3 domains and tail-to-tail associations via their C-terminal regions, forming higher-order multimers with a characteristic bouquet-like architecture

      Good background tbh!

    1. This dual-masking formulation drives the model to learn robust representations by predicting masked values from complementary perspectives: r

      bro. In Dataset 1, they have 16 panels but the leave-one-panel-out drop is <8%. That's the better evidence for robustness.

      Then you claim pretraining drives "robust representations despite marker inconsistency" based on the KO task, where Dataset 2 has a completely consistent panel.

      Those two claims aren't using the same evidence base and shouldn't be merged into one conclusion.

    2. Notably, large-scale pretraining yielded considerable performance gains in small-data settings, attributable to robust cellular representations that recover biological signals despite marker inconsistency.

      If you're concluding this based on per-class AUC and markers are inconsistent isn't this claim sketchy at best?

    3. Dataset 1: Longitudinal mouse immunophenotype datasetAs part of a long-running mutagenesis project to investigate novel genetic causes of immune dysfunction [18], flow cytometry phenotypes for over forty thousand C57BL/6 mice were obtained at the Australian Phenomics Facility between 1995 and 2015. This data is comprised of predominantly eight-colour experiments with varying marker/antibody/fluorophore combinations, yet most samples include a backbone of six common markers (IgM, IgD, B220, CD44, CD4, CD3) (Supplementary Table 10 and 8).In the present analysis, we have chosen a subset of 14,014 flow cytometry samples (6,978 female, 7,036 male) with a consistent gender metadata label and mostly pan-leukocyte marker panels. Sexual dimorphism rarely produces landmark cell populations readily detectable by manual analysis of flow cytometry data. However, this has proven a tractable problem with application of neural networks [19], with discriminative signals usually subtle and dispersed across multiple cell populations.Dataset 2: Knockout Mouse Project immunophenotype datasetThe Knockout Mouse Project (KOMP) [20] generated mouse strains harbouring gene knockouts for the majority of genes in the mouse genome, accompanied by phenotype data including flow cytometry information for a subset of mutant mouse lines. For our purposes, we focus on a subset of samples subjected to flow cytometry assay of a T cell immunophenotyping panel [21] (Supplementary Table 3). Despite containing nearly 7000 samples, this dataset poses a classic lack-of-data problem, as each knockout (KO) is represented by only 10 to 20 samples. As most knockouts in this dataset were found to lack discernible cellular phenotypes [21], we selected just 5 knock-out lines with clear mutant phenotypes characterised by the original study. This yields 72 samples (Supplementary Table 9) for a 5-class KO classification task.

      So you have 14k samples for dataset 1, a slight imbalance in male/female, but only 72 samples for dataset 2 because of selecting only 5 knockout lines? Also "most knockouts in this dataset were found to lack discernible cellular phenotypes"? Is that not concerning if you want to claim general ability/can build on for flow cytometry?

      Pre-training distribution has a significant impact on downstream utility.

    4. We evaluated the impact of cross-dataset pretraining on the model generalisation scenario using two configurations. The first model, the D1 encoder (Experiment A and B), was trained exclusively on Dataset 1. The second, the generic encoder (Experiments C), was pretrained on combined training data from Datasets 1 and 2 before downstream training on Dataset 1 only. Results in Fig. 2b (1) demonstrate that including even a small fraction of Dataset 2 in the pretraining phase significantly improved downstream generalisation to Dataset 2 testing samples.

      What exactly do the pre-training distributions look like? Whats the exact mix? Is dataset 1 sufficiently different from dataset 2, specifically as it relates to sample quality and number of samples?

    5. In this regard, GPCT can be interpreted through the attention mechanism used by the decoder: during inference, each attention head in the multi-head attention layer assigns a weight to every cell, representing its relative contribution to the decision-making process. These weights serve as a quantitative measure of per-cell “importance”, and while they are typically averaged across heads per layer for visualisation, each layer may capture distinct patterns that reflect the model’s internal processing steps.

      Interesting concept to make them cell level. Why not clusters of cells?

    1. Within-method exact agreement on normalized relevance labels was modest (Figure 3; Table 2). The best agreement was between Claude Code runs 2 and 3 (54/73 orthogroups; 0.740), while the lowest was between Claude Code runs 1 and 3 (25/73; 0.342). Mean within-method agreement was in the same range for all three configurations (0.516–0.562), so no configuration was dramatically more reproducible than the others at the tier-label level. These results argue against relying on a single stochastic agent run for final biological claims, even when the input files and prompt are identical.

      Is within-method exact agreement really the best metric? Recommending to not run against a single stochastic agent is fine but what is the delta? Running many costs more for what benefit?

    2. lthough coverage was complete, calibration differed strongly across runs (Figure 2; Table 1). Claude App run 2 was highly conservative, assigning 67 of 73 orthogroups to a low or background tier and only one high call. Claude App runs 1 and 3 were less conservative, with 11 and 8 high calls, respectively. Claude Code with scientific skills produced fewer high calls overall (1, 3, and 2), but shifted substantially between low and watchlist labels across runs. Codex App with scientific skills showed the widest high-call range, from no high calls in run 2 to 12 high calls in run 3.

      How does temperature/nucleus sampling/effort affect these results? Did you control for potential variation in these parameters?

    3. Here we use a controlled, repeated-run comparison to evaluate three agent configurations as they were used on the same orthogroup annotation prompt. The goal is not to rank proprietary foundation models in general. Instead, we ask a practical question relevant to bioinformatics groups: when agents are asked to retrieve, integrate, and interpret a large set of complex protein annotations, where do they help, where do they fail, how consistent are repeated runs, and how should their outputs be merged into a defensible final annotation table?

      Great experiment! I wonder what metrics are reported and how representative/relevant those metrics are given real life tasks.

    4. ombining these evidence streams is routine, but the final biological interpretation is still difficult because many protein families are multidomain, repetitive, lineage-specific, or only indirectly connected to the process of interest.

      Absolutely!! A challenge of great importance.

  2. Mar 2026
    1. To reduce class imbalance, we exclude these longer sequences from the training dataset.

      Downsampling/excluding a minority class is an interesting decision. Why not include some of them, use some flavor of stratified sampling/curriculum learning/train a specialized subnetwork/set of heads on the larger sequences with a reasonable split? How does the model generalize/perform on larger sequences?

    2. In our earlier work, we introduced DisPredict3.0, the most recent iteration of the DisPredict series, which integrates evolutionary representations derived from protein language models to improve the prediction of intrinsically disordered regions (IDRs) [5]. This approach achieved the top ranking on the Disorder NOX dataset in CAID2. Building on this foundation, we now present ESMDisPred, a structure-aware disordered protein predictor that incorporates embeddings from the Evolutionary Scale Modeling-2 (ESM2) language model [3]. ESM2 is considered the SOTA language model and has demonstrated exemplary performance in protein structure prediction (ESMFold)

      This is interesting. Evolutionary context can be really informative?

    1. First, generating a textual analysis of a binding site based on a protein sequence.Second, predicting a plausible binding site conformation given a specific ligand.Third, synthesizing a functional description by integrating the protein sequence, the predicted conformation, and the ligand information.

      Nice breakdown tbh!

    2. Output: A natural language answer describing the protein’s function, activity, or binding mode under the specified ligand conditions.

      To what extent is this accurate / aligned with biological reality? Does generating natural language answers introduce a source of error/confounding? What happens as answers become shorter vs longer vs less/more complex?

    3. A central challenge is learning aligned and effective representations across different data types, such as learning effective binary descriptors that can maintain group fairness [27].

      To what extent has this been solved by better molecular representations? Proteins and ligands are still molecules, and wouldn't atom-level representations ensure consistency across these data types? Boltz/BoltzGen does leverage atom-level information...

    4. SE(3)-invariant encoder combined with a temporal-aware VQ-VAE style quantization module. This allows us to convert diverse binding pocket conformations (e.g., apo, holo, or intermediate states) into discrete tokens, effectively capturing their dynamic variations. Furthermore, we integrate standard SMILES string tokenization for small molecules, alongside specialized amino acid tokens and the native Llama3 text tokenizer, expanding the LLM’s vocabulary to encompass these crucial biological entities.

      Why the Llama3 tokenizer among all other choices? Seems odd methodologically? Why not something designed for this kind of purpose? https://arxiv.org/html/2409.15370v1

    1. Discrete diffusion objective: we experiment with two different masking techniques, the first is the standard discrete diffusion objective where the masking fraction is sampled from a uniform distribution over (0, 1), in the second we sample the masking fraction 80% of the time from a β(3, 9) distribution and 20% of the time from a uniform distribution over (0, 1). This approach, adapted from (Hayes et al., 2024), aims to balance representation and generation capabilities. It allows the model to observe masking fractions across (0, 1), with an average <img class="highwire-embed" alt="Embedded Image" src="https://www.biorxiv.org/sites/default/files/highwire/biorxiv/early/2025/04/08/2025.04.02.646805/embed/inline-graphic-1.gif"/>. Both these objectives improve the effectiveness for iterative denoising during sequence generation with respect to standard MLM

      Other objectives could have been chosen beyond ease of implementation, why in particular this objective? Why not a hybrid objective? what abt retrieved neighbor training?

    2. Here, we propose a method to make them homology-aware. We introduce RAG-ESM, a retrieval-augmented framework that allows to condition pretrained ESM2 protein language models on homologous sequences, using a minimal number of additional cross-attention parameters and minimal computational cost

      This is an interesting idea. I wonder what the scaling looks like and what the efficacy of the augmentation is with respect to context window size and the quality of retrieval.

    3. We introduce RAG-ESM, a retrieval-augmented framework that allows to condition pretrained ESM2 protein language models on homologous sequences, using a minimal number of additional cross-attention parameters and minimal computational cost.

      This is an interesting idea. I wonder what the scaling looks like and what the efficacy of the augmentation is with respect to context window size and the quality of retrieval.