10,000 Matching Annotations
  1. Last 7 days
    1. eLife Assessment

      This study presents SynaptoGen, a differentiable extension of connectome models that links gene expression, protein-protein interaction probabilities, synaptic multiplicity, and synaptic weights, and demonstrates its use in reinforcement learning agents and a C. elegans-inspired case study. The work is a valuable contribution to computational connectomics and neuro-inspired machine learning, with solid mathematical and computational evidence supporting the proposed optimization framework. However, the broader biological and synthetic-biology claims - particularly genomic control of synaptogenesis and drug-discovery applications - are currently overstated and would benefit from a more tempered framing and clearer articulation of biological limitations.

    2. Reviewer #1 (Public review):

      The authors address a set of important and challenging questions at the interface of (developmental) neuroscience, genetics, and computation. They ask how complex neural circuits could emerge from compact genomic information, and they outline a bold vision in which this process might eventually be harnessed to design synthetic biological intelligence through genetic control of synaptogenesis. These are significant and stimulating ideas that merit rigorous theoretical and experimental exploration.

      However, the present work does not convincingly engage with these questions at a mechanistic level. Most of the circuit formation aspects appear to be adopted from prior models, and it is not clear how the main methodological modifications-introducing synaptic conductance and stochastic formalisms-provide new conceptual insight into genomic specification of neural circuitry. The manuscript does not include significant biological data or validation to support the proposed framework, and the results provided instead use artificial reinforcement learning benchmarks, which do not appear informative with respect to the biological claims.

      Overall, while the manuscript raises intriguing themes and ambitions, the proposed model is conceptually disconnected from the biological problem it purports to address. The strength of evidence does not support the strong interpretative or translational claims, and substantial rethinking of the modeling framework, in particular its validation strategy, would be required for the work to match the claims of our improved understanding of the genomic basis of neural circuit formation and our ability to engineer it.

    3. Reviewer #2 (Public review):

      In this manuscript, the authors built upon the Connectome Model literature and proposed SynaptoGen, a differentiable model that explicitly takes into account multiplicity and conductance in neural connectivity. The authors evaluated SynaptoGen through simulated reinforcement learning tasks and established its performance as often superior to two considered baselines. This work is a valuable addition to the field, supported by a solid methodology with some details and limitations missing.

      Major points:

      (1) The genetic features in the X and Y matrices in the CM were originally introduced as combinatorial gene expression patterns that correspond to the presence and even absence of a subset of genes. The authors oversimplify this original scope by only considering single-gene expression features. While this was arguably a reasonable first approximation for a case study of gap junctions in C. elegans, it is by no means expected to be a plausible expectation for chemical synapses. As the authors appear to motivate their model by chemical synapses that have polarities, they should either consider combinatorial rules in the model or at least present this explicitly as a key limitation of the model. Omitting combinatorial effects also renders the presented "bioplausible" baseline much less bioplausible, likely calling for a different name.

      (2) It is not fully explained how Equation (11) is obtained, even conceptually. It is unclear why \bar{B} and \bar{G} should be element-wise multiplied together, both already being expected values. Moreover, the authors acknowledged in lines 147-149 that the components of \bar{G} actually depend on gene expression X, which is a component in \bar{B}, so the logic here seems circular.

      (3) The authors considered two baselines, namely SNES and a bioplausible control. However, it would be of interest to also investigate: a) Vanilla DQN with the same size trained on the same MLP, to judge whether the biological insights behind SynaptoGen parameterization add value to performance. b) Using Equation (7) instead of Equation (11) to construct the weight matrices, to judge whether incorporating the conductance adds value to performance.

    4. Reviewer #3 (Public review):

      Summary

      Boccato et al. present an ambitious and thoughtfully developed framework, SynaptoGen, which proposes a differentiable model of synaptogenesis grounded in gene-expression vectors, protein interaction probabilities, and conductance rules. The authors aim to bridge the gap between computational connectomics and synthetic biological intelligence by enabling gradient-based optimization of genetically encoded circuit architectures. They support this goal with mathematical derivations, simulation experiments across several RL benchmarks, and a biologically grounded validation using C. elegans adhesion-molecule co-expression data. The paper is timely and conceptually compelling, offering a unified formulation of synaptic multiplicity and synaptic weight formation that can be integrated directly into learning systems.

      Strengths

      (1) Well-motivated framework with clear conceptual contributions.

      (2) Rigorous mathematical development.

      (3) Compelling empirical validation.

      (4) Excellent framing and discussion of future impact.

      Weaknesses

      (1) Overstated claims in the abstract and discussion.

      (2) Ambiguity in "first of its kind" assertions.

    1. eLife Assessment

      This is an important contribution that largely confirms prior evidence that word recognition - a cornerstone of development - improves across early childhood and is related to vocabulary growth. This study is distinguished by its use of a large, multi-study dataset that is uncommon in prior research on cognitive development. It provides solid evidence that speed, accuracy, and consistency of word learning improve with age, and will therefore prove of interest to those studying language, and more broadly, perception and development.

    2. Reviewer #1 (Public review):

      Summary:

      The study examined the extent to which children's word recognition skill improves across early development, becoming faster, more accurate and less variable, and the extent to which word recognition skill is related to children's concurrent and later vocabulary knowledge.

      Strengths:

      The main strength of the study comes from the dataset, which recycles previously collected data from 24 studies to examine the development of word recognition skill using data from 1963 children. This maximizes the impact of previously collected data while also allowing the study to reliably ask big-picture questions on the development of word recognition skill and its relation to chronological age and vocabulary knowledge. Data analysis is rigorous, thought through and very clearly described. Data and code necessary to reproduce the manuscript are shared on the project's GitHub.

      Weaknesses:

      The limitations of the study are acknowledged to some extent, but need to be improved and ensured that they run throughout the manuscript. Thus, in the discussion, the authors note that the approach is observational and exploratory, and highlight for me a key alternative explanation of the findings, namely that faster children could be faster due to their larger vocabulary, rather than faster children learning more words. Indeed, the latter explanation for the relationship is called into question, given that growth in speed was not related to growth in vocabulary. Here, the authors note that the null result may be related to the fact that they do not sufficiently precise estimates of growth slopes, rather than taking the alternative explanation seriously that there may not be as causal a link between being a faster word learner and a better word learner (learn more words). This is especially since, but correct me if I'm wrong here, the current vocabulary size is not taken into consideration in the model examining vocabulary growth. Given the increasing number of studies showing that current vocabulary knowledge predicts vocabulary growth (Laing, Kalinowski et al, Siew & Vitevitch), one simple alternative explanation is that current vocabulary knowledge predicts both current word recognition skill and later vocabulary knowledge. Is there anything in the data speaking against this hypothesis?

      Equally, while the SEM examines vocabulary growth controlling for age, I wonder about the other way around. What would happen to the effect of age on word recognition skill (in the LME model, S8) if one were to add concurrent vocabulary size? So does chronological age explain word recognition skill or vocabulary knowledge? Right now, the manuscript describes this effect purely related to chronological age, but is it age per se or other cognitive abilities, including a key change across development, namely, vocabulary size? Thus, the presentation of the skill learning hypothesis suggests that age is a proxy for experience, while you actually have here a very nice proxy for experience in terms of children's vocabulary size.

      Critically, while the discussion is more nuanced, the way the abstract is concluded and the way the Introduction is phrased suggest that the study is able to answer a causal question, which, as the authors themselves note, is not possible. The abstract, for instance, states that word recognition becomes faster, more accurate and less variable...consistent with a process of skill learning. And also that this skill plays a role in supporting early language learning, which is very causal language. I don't think you can really claim that you are testing the two hypotheses you suggest here. The work is definitely embedded in the context of these hypotheses, but are you really able to test them? My worry is that while the discussion is more nuanced, the extent to which this study will then be cited down the line as showing that children learn more words down the line because they are faster at recognizing words, and anything that you can do to tamper with such interpretations would be good for the literature. For me, this should not just be relegated to the discussion but should be touched upon in the abstract and Introduction.

      Finally, it would help to talk more about the mechanisms at work in any relationship between word recognition and language learning. It seems to me that this would rely on some predictive processing framework, given the description on page 4, and it would be good to make this clear (faster and more accurately you can recognize a ball, better use this evidence to infer the speaker's intended meaning). Equally, when referring to word recognition, it would be good to clarify what this refers to - how well a child knows what a word refers to (and in the context of LWL, what it does not refer to) or how quickly it directs attention to what is referred to.

      With regards to the data, I wonder if there is a clustering of kids past 24 months that is happening here, looking at Figures 1 and 2, where it seems like there is less change past the 24-month point. Is there any way to look at whether the effect of age or vocabulary on word recognition is not linear but asymptotic?

    3. Reviewer #2 (Public review):

      Summary:

      This paper presents a series of analyses of a large dataset combining many prior studies of early word recognition (Peekbank). The analyses demonstrate that the speed, accuracy and consistency of word learning improve with age. Moreover, the speed of word learning early in development was related to vocabulary growth over time.

      Strengths:

      A key strength of the paper is the use of a large multi-study dataset. This is particularly valuable in the field of early cognitive development, which has (due to practical limitations) often been based on small-scale studies that necessarily provide a shaky foundation for conclusions. The analyses are also well-motivated.

      Weaknesses:

      The weaknesses I saw are primarily in some aspects of the conceptual motivation for the research.

      First, I wasn't entirely clear about what the authors meant by "word recognition ability". For much of the manuscript (including the use of the term "word recognition ability" itself), this comes across as an intrinsic ability or skill that improves with development. Alternatively, the speed and accuracy metrics taken from studies in Peekbank might capture children's increasing knowledge of the common, concrete words typically used in these studies. To me, this is a somewhat different construct from a general skill at recognizing words. It would be helpful if the authors could clarify which construct they intend to capture, or if it is not possible to distinguish between these constructs from the Peekbank data.

      Second, and relatedly, if the source of the age-related improvements is increasing experience with the common concrete words used in the Peekbank studies, then one might expect word recognition and improvements with age to be related to word frequency, given that more frequent words are experienced more often. Word frequency predicts word knowledge when assessed using CDI data. Can effects of frequency be detected in Peekbank word recognition metrics? If not, why? Similarly, is the speed and accuracy of word recognition in Peekbank data related to CDI-derived word age of acquisition, and again, if not, why?

      Finally, there is a bit of a risk of the main findings of this paper coming across as a foregone conclusion. I.e., how could it be otherwise that word recognition improves with development?

    1. eLife Assessment

      This important paper uses a new computational method that integrates bulk sequencing and single-cell sequencing data to provide refined gene expression datasets for 52 neuron classes in C. elegans. The paper's findings are convincing, presenting an approach that alleviates a key shortcoming of single-cell RNA sequencing. While the datasets have some limitations that the authors acknowledge, the new methodology and refined datasets will be important resources for those interested in understanding how gene expression shapes neuronal morphology and physiology.

    2. Reviewer #1 (Public review):

      This is an interesting manuscript aimed at improving the transcriptome characterization of 52 C. elegans neuron classes. Previous single-cell RNA seq studies already uncovered transcriptomes for these, but the data are incomplete, with a bias against genes with lower expression levels. Here, the authors use cell-specific reporter combinations to FACS purify neurons and use bulk RNA sequencing to obtain better sequencing depth. This reveals more rare transcripts, as well as non-coding RNAs, pseudo genes, etc. The authors develop computational approaches to combine the bulk and scRNA transcriptome results to obtain more definitive gene lists for the neurons examined.

      To ultimately understand features of any cell, from morphology to function, an understanding of the full complement of the genes it expresses is a pre-requisite. This paper gets us a step closer to this goal, assembling a current "definitive list" of genes for a large proportion of C. elegans neurons. The computational approaches used to generate the list are based on reasonable assumptions, the data appear to have been treated appropriately statistically, and the conclusions are generally warranted. I have a few issues that the authors may chose to address:

      (1) As part of getting rid of cross contamination in the bulk data, the authors model the scRNA data, extrapolate it to the bulk data and subtract out "contaminant" cell types. One wonders, however, given that low expressed genes are not represented in the scRNA data, whether the assignment of a gene to one or another cell type can really be made definitve. Indeed, it's possible that a gene is expressed at low levels in one cell, and in high levels in another, and would therefore be considered a contaminant. The result would be to throw out genes that actually are expressed in a given cell type. The definitive list would therefore be a conservative estimate, and not necessarily the correct estimate.

      (2) It would be quite useful to have tested some genes with lower expression levels using in vivo gene-fusion reporters to assess whether the expression assignments hold up as predicted. i.e. provide another avenue of experimentation, non-computational, to confirm that the decontamination algorithm works.

      (3) In many cases, each cell class would be composed of at least 2 if not more neurons. Is it possible that differences between members of a single class would be missed by applying the cleanup algorithms? Such transcripts would be represented only in a fraction of the cells isolated by scRNAseq, and might then be considered not real?

      (4) I didn't quite catch whether the precise staging of animals was matched between the bulk and scRNAseq datasets. Importantly, there are many genes whose expression is highly stage specific or age specific so that even slight temporal difference might yield different sets of gene expression.

      (5) To what extent does FACS sorting affect gene expression? Can the authors provide some controls?

      Comments on revisions:

      The authors have made reasonable arguments in response to my questions, and have done some additional experiments. I believe that although they did not do so, they could have generated additional reporters for the lower expressed genes, that would have validated their method of data integration. Nonetheless, I think the paper is rigorous and will be of use to the community.

    3. Reviewer #2 (Public review):

      Summary:

      This study from the CenGEN consortium addresses several limitations of single-cell RNA (scRNA) and bulk RNA sequencing in C. elegans with a focus on cells in the nervous system. scRNA datasets can give very specific expression profiles, but detecting rare and non-polyA transcripts is difficult. In contrast, bulk RNA sequencing on isolated cells can be sequenced to high depth to identify rare and non-polyA transcripts but frequently suffers from RNA contamination from other cell types. In this study, the authors generate a comprehensive set of bulk RNA datasets from 53 individual neurons isolated by fluorescence activated cell sorting (FACS). The authors combine these datasets with a previously published scRNA dataset (Taylor et al., 2021) to develop a novel method, called LittleBites, to estimate and subtract contamination from the bulk RNA data. The authors validate the method by comparing detected transcripts against gold-standard datasets on neuron-specific and non-neuronal transcripts. The authors generate an "integrated" list of protein-coding expression profiles for the 53 neuron sub-types, with fewer but higher confidence genes compared to expression profiles based only on scRNA. Also, the authors identify putative novel pan-neuronal and cell-type specific non-coding RNAs based on the bulk RNA data. LittleBites should be generally useful for extracting higher confidence data from bulk RNA-seq data in organisms where extensive scRNA datasets are available. The additional confidence in neuron-specific expression and non-coding RNA expands the already great utility of the neuronal expression reference atlas generated by the CenGEN consortium.

      Strengths:

      The study generates and analyzes a very comprehensive set of bulk RNA datasets from individual fluorescently tagged transgenic strains. These datasets are technically challenging to generate and significantly expand our knowledge of gene expression, particularly in cells that were poorly represented in the initial scRNA-seq datasets. Additionally, all transgenic strains are made available as a resource from the Caenorhabditis Elegans Genetics Center (CGC).

      The study uses the authors' extensive experience with neuronal expression to benchmark their method for reducing contamination utilizing a set of gold-standard validated neuronal and non-neuronal genes. These gold-standard genes will be helpful for benchmarking any C. elegans gene expression study.

      Weaknesses:

      The bulk RNA-seq data collected by the authors has high levels of contamination and, in some cases, is based on very few cells. The methodology to remove contamination partly makes up for this shortcoming, but the high background levels of contaminating RNA in the FACS-isolated neurons limit the confidence in cell-specific transcripts.

      The study does not experimentally validate any of the refined gene expression predictions, which was one of the main strengths of the initial CenGEN publication (Taylor et al, 2021). No validation experiments (e.g., fluorescence reporters or single molecule FISH) were performed for protein-coding or non-coding genes, which makes it difficult for the reader to assess how much gene predictions are improved, other than for the gold standard set, which may have specific characteristics (e.g., bias toward high expression as they were primarily identified in fluorescence reporter experiments).

      The study notes that bulk RNA-seq data, in contrast to scRNA-seq data, can be used to identify which isoforms are expressed in a given cell. Although not included in this manuscript, two bioRxiv papers have used the generous openness of the CenGEN consortium to study alternative splicing in C. elegans neurons [bioRxiv, 2024.2005.2016.594567 (2024) and bioRxiv, 2024.2005.2016.594572 (2024)], nicely showing the strengths of the data.

      Comments on revisions: I agree that the paper is improved.

    4. Reviewer #3 (Public review):

      Summary

      This study aims to overcome key limitations of single-cell RNA-seq in C. elegans neurons-especially the under-detection of lowly expressed and non-polyadenylated transcripts and residual contamination-by integrating bulk RNA-seq from FACS-isolated neuron types with an existing scRNA-seq atlas. The authors introduce LittleBites, an iterative, reference-guided decontamination algorithm that uses a single-cell reference together with ground-truth reporter datasets to optimize subtraction of contaminating signal from bulk profiles. They then generate an "Integrated" dataset that combines the sensitivity of bulk data with the specificity of scRNA-seq and use it to call neuron-specific expression for protein-coding genes, "rescued" genes not detected in scRNA-seq, and multiple classes of non-coding RNAs across 53 neuron classes. All data, code, and thresholded matrices are made publicly available to enable community reuse.

      Strengths

      (1) Conceptual advance and useful resource. The work demonstrates in a concrete way how bulk and single-cell datasets can be combined to overcome the weaknesses of each approach, and delivers a high-resolution transcriptomic resource for a substantial fraction of C. elegans neuron classes . The integrated matrices, thresholded expression calls, and non-coding RNA catalog will be useful both for basic neurobiology and for method developers.

      (2) Careful benchmarking and transparency. The revised manuscript includes extensive benchmarking of LittleBites and the Integrated dataset against multiple independent "ground-truth" sets: neuron-specific reporter lines, curated non-neuronal markers, and ubiquitous genes. The authors evaluate AUROCs over a wide range of thresholds, explain ROC/AUROC metrics for non-specialists, and quantify how integration affects both sensitivity and specificity relative to scRNA-seq alone.

      (3) Improved methodological clarity. In response to review, the authors now provide a much more intuitive description of the LittleBites algorithm, including a stepwise explanation of (1) contamination estimation via NNLS using single-cell references, (2) weighted subtraction tuned by a learning-rate parameter, and (3) performance optimization based on AUROC against ground-truth genes. this makes the approach accessible to readers who are not computational specialists and will facilitate re-implementation.

      (4) Systematic analysis of reference dependence. The authors explicitly address the concern that LittleBites depends on the completeness and accuracy of the scRNA-seq reference. They examine how performance varies with cluster size and by simulated degradation of the reference (e.g., reducing the number of cells per cluster), and show that AUROCs remain robust, but that gene-level assignments are more variable for clusters represented by fewer cells. This is an important and honest characterization of when the method is reliable and when users should be cautious.

      (5) Additional biological context. The manuscript now more clearly situates the dataset in the context of previous and ongoing work. In particular, the authors highlight that other groups have already used these bulk data to discover and validate cell-type-specific alternative splicing events, strengthening the case that the data are biologically meaningful beyond the immediate analyses presented here. The expanded analysis of non-coding RNAs and GPCR pseudogenes also adds biological interest.

      (6) Improved handling and documentation of "unexpressed" genes. The authors have trimmed the original list of 4,440 genes called "unexpressed" in scRNA-seq to a higher-confidence subset and provide new supplementary tables that include gene identities and tissue annotations. They also use a curated set of non-neuronal markers to estimate residual contamination and show that most such markers are not detected in the integrated data, with only a small number of apparent false positives remaining.

      Weaknesses

      (1) Novel assignments remain predictive rather than experimentally validated. Although the authors have strengthened their benchmarking and refer to external work that validates some splicing patterns from these data, the large sets of newly assigned lowly expressed genes and non-coding RNAs-particularly those rescued from the "unexpressed" gene pool-are still inferred from computational criteria (thresholding plus correlation-based decontamination) rather than direct orthogonal assays (e.g., smFISH, in situ hybridization, or reporter lines). This is understandable given scale and cost, but it means that many of these calls should be interpreted as well-supported predictions, not definitive expression maps. The revised manuscript acknowledges this, and a dedicated "Limitations of this study" subsection will further clarify this point for readers.

      (2) Reduced stability for neuron types with sparse single-cell representation. The authors' new analyses show that while integration improves overall correlation and AUROC across a wide range of neuron types, gene-level assignments are less stable for neuron classes represented by relatively few cells in the scRNA-seq reference. For such neuron types, both false negatives and false positives are more likely, and users should be cautious when interpreting cell-type-specific expression differences based solely on these calls.

      (3) Residual contamination and misclassification are not completely eliminated. Despite the careful design of LittleBites and the additional correlation-based decontamination of "unexpressed" genes, the authors' benchmarking against curated non-neuronal markers shows that a small fraction of putative non-neuronal genes remains detectable even at stricter thresholds, and some bona fide neuronal genes are removed as likely contaminants. The new supplementary tables documenting "unexpressed" genes and their tissue annotations, together with explicit statements about residual error rates and the predictive nature of these classifications, help users to judge the reliability of specific genes, but they also underscore that the dataset is not a perfect ground truth.

      (4) Scope and coverage remain incomplete. As the authors note, the dataset covers 53 neuron classes and does not fully represent all 302 neurons or all known neuron subtypes. In addition, bulk samples represent pools of neurons, and so the approach cannot resolve within-class heterogeneity or subtype-specific expression within those pools. These are inherent limitations of the current experimental design rather than flaws in the analysis, but they are important for readers to keep in mind when using the resource.

      Overall, the revised manuscript presents solid evidence for the main methodological and resource claims, with clearly articulated limitations. The work is likely to have valuable impact on the C. elegans community and provides a template for integrating bulk and single-cell data in other systems.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      (1) As part of getting rid of cross-contamination in the bulk data, the authors model the scRNA data, extrapolate it to the bulk data and subtract out "contaminant" cell types. One wonders, however, given that low expressed genes are not represented in the scRNA data, whether the assignment of a gene to one or another cell type can really be made definitive. Indeed, it's possible that a gene is expressed at low levels in one cell, and high levels in another, and would therefore be considered a contaminant. The result would be to throw out genes that actually are expressed in a given cell type. The definitive list would therefore be a conservative estimate, and not necessarily the correct estimate.

      We agree that the various strategies we employ do not result in perfect annotation of gene expression. However, despite their limitations, they are significantly better than either the single cell or the bulk data alone. We represent these strengths and shortcomings throughout the manuscript (for example, in ROC curves).

      (2) It would be quite useful to have tested some genes with lower expression levels using in vivo gene-fusion reporters to assess whether the expression assignments hold up as predicted. i.e. provide another avenue of experimentation, non-computational, to confirm that the decontamination algorithm works.

      We agree that evaluating only highly-expressed genes might introduce bias. We used a large battery of in vivo reporters, made with best-available technology (CRISPR insertion of the fluorophore into the endogenous locus) to evaluate our approaches. These reporters were constructed without bias in terms of gene expression and therefore represent both high and low expression levels. These data are represented throughout the manuscript (for example, in ROC curves). Details about the battery of reporters may be found in Taylor et al 2021. In addition to these reporters, this manuscript also generates and analyzes two other types of gene sets: non-neuronal and ubiquitous genes. Again, these genes are selected without bias toward gene expression, and the techniques presented here are benchmarked against them as well, with positive results.

      (3) In many cases, each cell class would be composed of at least 2 if not more neurons. Is it possible that differences between members of a single class would be missed by applying the cleanup algorithms? Such transcripts would be represented only in a fraction of the cells isolated by scRNAseq, and might then be considered not real.

      For the data set presented in this manuscript, all cells of a single neuron type were labeled and isolated together by FACS, and sequencing libraries were constructed from this pool of cells. Thus, potential subtypes within a particular type (when that type includes more than one cell) cannot be resolved by this method. By contrast, such subtypes were in some cases resolved in the single cell approach. To make the two data sets compatible with each other, for the single cell data we combined any subtypes together. We state in the Methods:

      “For this work, single cell clusters of neuron subtypes were collapsed to the resolution of the bulk replicates (example: VB and VB1 clusters in the single cell data were treated as one VB cluster).”

      (4) I didn't quite catch whether the precise staging of animals was matched between the bulk and scRNAseq datasets. Importantly, there are many genes whose expression is highly stage-specific or age-specific so even slight temporal differences might yield different sets of gene expression.

      We agree that accurate staging is critically important for valid comparisons between data sets and have included an additional supplemental table with staging metadata for each sample. The staging protocol used for the bulk data set was initially employed to generate scRNA seq data and should be comparable. An additional description of our approach is now included in Methods:

      “Populations of synchronized L1s were grown at 23 C until reaching the L4 stage on 150 mM 8P plates inoculated with Na22. The time in culture to reach the L4 stage varied (40.5-49 h) and was determined for each strain. 50-100 animals were inspected with a 40X DIC objective to determine developmental stage as scored by vulval morphology (Mok et al., 2025). Cultures were predominantly composed of L4 larvae but also typically included varying fractions of L3 larvae and adults.”

      We have also updated supplementary table 1 to include additional information about each sort including the observed developmental stages and their proportions when available, the temperature the worms were grown at, the genotype of each experiment, and the number of cells collected in FACS.

      (5) To what extent does FACS sorting affect gene expression? Can the authors provide some controls?

      We appreciate this suggestion. We agree that FACS sorting (and also dissociation of the animals prior to sorting) might affect gene expression, particularly of stress-related transcripts. We note that dissociation and FACS sorting was also used to collect cells for our single cell data set (Taylor et al 2021). We would note that clean controls for this approach can be prohibitively difficult to collect, as the process of dissociation and FACS will inevitably change the proportion of cell types present in the sample, and for bulk sequencing efforts it is difficult even with deconvolution approaches to accurately account for changes in gene expression that result from dissociation and FACS, versus changes in gene expression that result from differences in cell type composition. We regrettably omitted a discussion of these issues in the manuscript. We now write in the Results:

      “The dissociation and FACS steps used to isolate neuron types induce cellular stress responsive pathways (van den Brink et al., 2017; Kaletsky et al., 2016, Taylor 2021). Genes associated with this stress response (Taylor 2021) were not removed from downstream analyses, but should be viewed with caution.”

      Reviewer #2 (Public review):

      The bulk RNA-seq data collected by the authors has high levels of contamination and, in some cases, is based on very few cells. The methodology to remove contamination partly makes up for this shortcoming, but the high background levels of contaminating RNA in the FACS-isolated neurons limit the confidence in cell-specific transcripts.

      We agree that these are the limitations of the source data. One of the manuscript’s main goals is to analyze and refine these source data, reducing these limitations and quantifying the results.

      The study does not experimentally validate any of the refined gene expression predictions, which was one of the main strengths of the initial CenGEN publication (Taylor et al, 2021). No validation experiments (e.g., fluorescence reporters or single molecule FISH) were performed for protein-coding or non-coding genes, which makes it difficult for the reader to assess how much gene predictions are improved, other than for the gold standard set, which may have specific characteristics (e.g., bias toward high expression as they were primarily identified in fluorescence reporter experiments).

      We agree that evaluating only highly-expressed genes might introduce bias. We used a large battery of in vivo reporters, made with best-available technology (CRISPR insertion of the fluorophore into the endogenous locus) to evaluate our approaches. These reporters were constructed without bias in terms of gene expression and therefore represent both high and low expression levels. These data are represented throughout the manuscript (for example, in ROC curves). Details about the battery of reporters may be found in Taylor et al 2021. In addition to these reporters, this manuscript also generates and analyzes two other types of gene sets: non-neuronal and ubiquitous genes. Again, these genes are selected without bias toward gene expression, and the techniques presented here are benchmarked against them as well, with positive results.

      The study notes that bulk RNA-seq data, in contrast to scRNA-seq data, can be used to identify which isoforms are expressed in a given cell. However, no analysis or genome browser tracks were supplied in the study to take advantage of this important information. For the community, isoform-specific expression could guide the design of cell-specific expression constructs or for predictive modeling of gene expression based on machine learning.

      We strongly agree that these datasets allow for new discoveries in neuronal splicing patterns and regulators, which is explored further in other publications from our group and other research groups in the field. We did not sufficiently highlight these works in the body of our text, and have added a reference in the discussion. “In addition, the bulk RNA-seq dataset contains transcript information across the gene body, which parallel efforts have used to identify mRNA splicing patterns that are not found in the scRNA-seq dataset.” These works can be found in references 26 and 27.

      (1) The study relies on thresholding to determine whether a gene is expressed or not. While this is a common practice, the choice of threshold is not thoroughly justified. In particular, the choice of two uniform cutoffs across protein-encoding RNAs and of one distinct threshold for non-coding RNAs is somewhat arbitrary and has several limitations. This reviewer recommends the authors attempt to use adaptive threshold-methods that define gene expression thresholds on a per-gene basis. Some of these methods include GiniClust2, Brennecke's variance modeling, HVG in Seurat, BASiCS, and/or MAST Hurdle model for dropout correction.

      We appreciate the reviewer’s suggestion, and would note that the integrated data currently incorporates some gene-specific weighting to identify gene expression patterns, as the single-cell data are weighted by maximum expression for each gene prior to integration with the LittleBites cleaned data. This gene level normalization markedly improved gene detection accuracy, and is discussed in depth in our 2021 Paper “Molecular topography of an entire nervous system”. We previously explored several methods for setting gene specific thresholds for identifying gene expression patterns in the integrated dataset. Unfortunately we found that none of the tested methods out performed setting “static” thresholds across all genes in the integrated dataset, and tended to increase false positive rates for some low abundance genes, where gene-specific thresholding can tend towards calling a gene expressed in at least one cell type when it is actually not expressed in any cell types present. These methods are likely to provide better results for expanded datasets that cover all tissue types (where one might reasonably expect that a gene is likely to be expressed in at least one sample).

      (2) Most importantly, the study lacks independent experimental validation (e.g., qPCR, smFISH, or in situ hybridization) to confirm the expression of newly detected lowly expressed genes and non-coding RNAs. This is particularly important for validating novel neuronal non-coding RNAs, which are primarily inferred from computational approaches.

      We agree that smFISH and related in situ validation methods would be an asset in this analysis. Unfortunately because most ncRNAs are very short, they are prohibitively difficult to accurately measure with smFISH. Many ncRNAs we attempted to assay with smFISH methods can bind at most 3 fluorescent probes, which unfortunately was not reliably distinguishable from background autofluorescence in the worm. Many published methods for smFISH signal amplification have not been optimized for C. elegans, and the tough cuticle is a major barrier for those efforts.

      (3) The novel biology is somewhat limited. One potential area of exploration would be to look at cell-type specific alternative splicing events.

      We appreciate this suggestion. Indeed, as we put our source data online prior to publishing this manuscript, two published papers already use this source data set to analyze alternative splicing. Further, these works include validation of splicing patterns observed in this source data, indicating the biological relevance of these data sets.

      (4) The integration method disproportionately benefits neuron types with limited representation in scRNA-seq, meaning well-sampled neuron types may not show significant improvement. The authors should quantify the impact of this bias on the final dataset.

      We agree that cell-types that are well represented in the single-cell dataset tend to have fewer new genes identified in the Integrated dataset than “rare” cell-types in the single cell data. However we would note that cell-types that are highly abundant in the single-cell data appear to become increasingly vulnerable to non-neuronal false positives, and that integration’s primary effect in high abundance cell-types appears to be reducing the false positive rate for non-neuronal genes. Thus we suggest that integration benefits all cell-types across the spectrum of single-cell abundance. The false positives are likely caused by a side-effect of normalization steps in the single-cell dataset, which is moderated by using the LittleBites cleaned bulk samples as an orthogonal measurement. The benefit of integration for cell-types with low abundance in the single-cell dataset is now quantified, and the benefits of integration for low and high abundance cell-types from the single cell data are described in the following section (p.13):

      “To test the stability of LittleBites cleanup across different single-cell reference dataset qualities, we ran the algorithm on a set of bulk samples by first subsetting the corresponding single-cell cluster’s population to 10, 50, 100, or 500 cells. We performed this process 500 times for each subsampling rate for each sample (2000 total runs per sample). We found that testing gene AUROC values are stable across reference cluster sizes (Fig. 2D), suggesting that even if the target cell type is rarely represented in a single cell reference, accurate cleaning is still possible. However, comparing gene level stability across target cluster population levels reveals that low abundance references have higher gene level variance (Fig. 2E), lower purity estimates (Fig. S2F), higher variance in the mean expression across genes (Fig. S2G), and they tend to have lower overall expression (suggesting more aggressive subtraction) (Fig. S2H). This indicates that while binary gene calling is improved even if the reference cluster is small, users should be cautious when using fewer than 100 cells in their single cell reference cluster as the resulting cleanup is less stable.”

      (5) The authors employ a logit transformation to model single-cell proportions into count space, but they need to clarify its assumptions and potential pitfalls (e.g., how it handles rare cell types).

      We agree that the assumptions and pitfalls of the logit model are key for evaluating its usefulness, especially for cell types that are rarely captured in the single-cell dataset. The assumptions and pitfalls are described in the methods section, but we regretfully omitted any mention of those pitfalls in the results, which we have now rectified.

      The description in the methods section is: “We applied this formula to our real single cell dataset and used this equation to transform proportion measures of gene expression into a count space to generate the Prop2Count dataset for downstream analysis and integration with bulk datasets. This procedure allows for proportions data to be used in downstream analyses that work with counts datasets. However, it does limit the range of potential values that each gene can have, with the potential values set as:

      As n approaches 0, the number of potential values decreases, which can be incompatible with some downstream models. Thus, caution should be used when applying this transformation to datasets with few cells.”

      The new mention in the results is: “However, caution should be taken when using this approach in scRNAseq cases where all replicates of a cell type contain few cells. scProp2Count values are limited to the space of possible proportion values, and so replicates with low numbers of cells will have fewer potential expression “levels” which may break some model assumptions in downstream applications (see Methods).”

      (6) The LittleBites approach is highly dependent on the accuracy of existing single-cell references. If the scRNA-seq dataset is incomplete or contains classification biases, this could propagate errors into the bulk RNA-seq data. The authors may want to discuss potential limitations and sensitivity to errors in the single-cell dataset, and it is critical to define minimum quality parameters (e.g. via modeling) for the scRNAseq dataset used as reference.

      We appreciate this suggestion, and agree that manuscript would benefit from a description of where the LittleBites method can give poor results. To this end, we subset our single cell reference for individual neurons of interest to the level of 10, 50, 100, or 500 cells (500 iterations per sample rate), and then ran Littlebites, and compared metrics of gene expression stability, sample composition estimates, and AUROC performance on test genes. We found that when fewer than 100 cells for the target cell type are included in the single cell reference, gene expression stability drops (variance between subsampling iterations was much higher when fewer reference cells were used). However, we found that AUROC values were consistently high regardless of how many reference cells were included, but that this stability in AUROC values was paired with lower overall counts in samples with <100 reference cells after cleanup. This indicates that in cases where few reference cells are present, higher AUROC values might be achieved by more aggressive subtraction, which is attenuated when the reference model is more complete. This analysis is shown in figure 2 and figure S2, and described in the results section, recreated here.

      “To test the stability of Littlebites cleanup across different single-cell reference dataset qualities, we ran the algorithm on a set of bulk samples by first subsetting the corresponding single-cell cluster’s population to 10, 50, 100, or 500 cells. We performed this process 500 times for each subsampling rate for each sample (2000 total runs per sample). We found that testing gene AUROC values are stable across reference cluster sizes (Fig. 2D), suggesting that even if the target cell type is rarely represented in a single cell reference, accurate cleaning is still possible. However, comparing gene level stability across target cluster population levels reveals that low population references have higher gene level variance (Fig. 2E), lower purity estimates (Fig. S2F), higher variance in the mean expression across genes (Fig. S2G), and they tend to have lower overall expression (suggesting more aggressive subtraction) (Fig. S2H). This suggests that while binary gene calling is improved similarly even if the reference cluster is small, users should be cautious when using less than 100 cells in their single cell reference cluster as the resulting cleanup is less stable.”

      (7) Also very important, the LittleBites method could benefit from a more intuitive explanation and schematic to improve accessibility for non-computational readers. A supplementary step-by-step breakdown of the subtraction process would be useful.

      We appreciate this suggestion and implemented a step-by-steo breakdown of the subtraction process in the methods section, also copied below. We also updated the graphic representation in figure 2A.

      “LittleBites Subtraction algorithm

      LittleBites is an iterative algorithm for bulk RNA-seq datasets, that improves the accuracy of cell-type specific bulk RNA-seq sample profiles by removing counts from non-target contaminants (e.g. ambient RNA from dead cells, carry-over non-target cells from FACS enrichment due to imperfect gating). This method leverages single cell reference datasets and ground truth expression information to guide iterative and conservative subtraction to enrich for true target cell-type expression. Using this approach, LittleBites balances subtraction by optimizing using both a single-cell reference, and an orthogonal ground truth reference, moderating biases inherent to either reference.

      This algorithm first calculates gene level specificity weights in a single cell reference dataset using SPM (Specificity Preservation Method) (re-add 22, re-add 23). SPM assigns high weights (approaching 1) to genes expressed in single cell types while applying conservative weights to genes with broader expression patterns, which helps to reduce inappropriate subtraction.

      The algorithm proceeds in a loop of three steps:

      Step 1: Estimate Contamination. Each bulk sample is modeled as the sum of a linear combination of single-cell profiles (target cell type and likely contaminants) using non-negative least squares (NNLS). The resulting coefficients provide the estimate of how much of the sample’s counts come from the target cell-type, and how much comes from each contaminant cell-type.

      Step 2: Weighted Subtraction. Each bulk sample is cleaned by subtracting the weighted sum of contaminant single-cell profiles. This subtraction is attempted multiple times (separately) across a series of learning rate weights (usually ranging from 0-1) which moderate the size of the subtraction step (Equation 1). This produces a range of possible “cleaned” sample options for evaluation.

      Step 3: Performance Optimization. For each learning rate, the cleaned result is evaluated against a set of ground truth genes by calculating the area under the receiver operating characteristic curve (AUROC). The learning rate that optimizes the AUROC is then selected. When multiple learning rates yielded equivalent AUROC values, the lowest learning rate value is chosen to minimize subtraction.

      If the optimal learning rate at Step 3 is 0 (no subtraction option beats the baseline) then the loop is halted. Else, the cleaned bulk profile is returned to Step 1, and the loop continues until the AUROC cannot be improved upon using the single-cell reference modeling.“

      (8) In the same vein, the ROC curves and AUROC comparisons should have clearer annotations to make results more interpretable for readers unfamiliar with these metrics.

      We agree that the ROC and AUROC metrics need a clearer explanation to make their use and interpretations clearer. We included a description of both metrics, and a suggestion for how to interpret them in the results section, copied below.

      “To evaluate the post-subtraction datasets accuracy we used the area under the Receiver-Operator Characteristic (AUROC) score. In brief, we set a wide range of thresholds to call genes expressed or unexpressed, and then compared it to expected expression from a set of ground truth genes. This comparison produces a true positive rate (TPR, the percentage of truly expressed genes that are called expressed), and false positive rate (FPR, the percentage of truly not expressed genes that are called expressed), and a false discovery rate (FDR, the percentage of genes called expressed that are truly not expressed). The Receiver-Operator Characteristic (ROC) is the graph of the line produced by the TPR and FPR values across the range of thresholds tested, and the AUROC is calculated as the sum of the area under that line. A “random” model of gene expression is expected to have an AUROC value of 0.5, and a “perfect” model is expected to have an AUROC value of 1. Thus, AUROCs below 0.5 are worse than a random guess, and values closer to 1 indicate higher accuracy.”

      (9) Finally, after the correlation-based decontamination of the 4,440 'unexpressed' genes, how many were ultimately discarded as non-neuronal?

      a) Among these non-neuronal genes, how many were actually known neuronal genes or components of neuronal pathways (e.g., genes involved in serotonin synthesis, synaptic function, or axon guidance)?

      b) Conversely, among the "unexpressed" genes classified as neuronal, how many were likely not neuron-specific (e.g., housekeeping genes) or even clearly non-neuronal (e.g., myosin or other muscle-specific markers)?

      Combined with point 10, see below.

      (10) To increase transparency and allow readers to probe false positives and false negatives, I suggest the inclusion of:

      a) The full list of all 4,440 'unexpressed' genes and their classification at each refinement step. In that list flag the subsets of genes potentially misclassified, including:

      - Neuronal genes wrongly discarded as non-neuronal.

      - Non-neuronal genes wrongly retained as neuronal.

      b) Add a certainty or likelihood ranking that quantifies confidence in each classification decision, helping readers validate neuronal vs. non-neuronal RNA assignments.

      This addition would enhance transparency, reproducibility, and community engagement, ensuring that key neuronal genes are not erroneously discarded while minimizing false positives from contaminant-derived transcripts.

      We agree that the genes called “unexpressed” in the single-cell data need more context and clarity. First, we trimmed the list to only include 2,333 genes of highest confidence. Second, for those genes we identified any with published neuronal expression patterns. Identifying genes that were retained as neuronal but are likely non-neuronal in origin is difficult as many markers are expressed in a mixture of neuronal and non-neuronal cell-types, however we used a curated list of putative non-neuronal markers to assess the accuracy of the integrated data (see supplementary table 4), and established that most non-neuronal markers are undetected in the integrated data, with the number of detected genes decreasing as our threshold stringency increases. Of note, a few putative non-neuronal genes remain detected even at high thresholds, indicating that our dataset still retains a small percentage of neuronal false positives. This result has been collected in the new supplementary figure 4F, and addressed in the following text in the results section “Testing against a curated list of non-neuronal genes from fluorescent reporters and genomic enrichment studies, we found that of 445 non-neuronal markers, each gene was detected in an average of 12.5 cells or a median of 3 cells in the single-cell dataset, and an average of 8.7 cells or a median of 1 cell in the integrated dataset, at a 14% FDR threshold.”

      We also included a list of “unexpressed” gene identities and tissue annotations as new supplementary tables 16 and 17.

      Reviewer #2 (Recommendations for the authors):

      The utility of the bulk RNA-seq data would be significantly increased if the authors were to analyze which isoforms are expressed in individual neurons. Also, it would be very useful to know if there are instances where a gene is expressed in several neurons, but different isoforms are specific to individual neurons.

      We appreciate this suggestion. Indeed, as we put our source data online prior to publishing this manuscript, two published papers already use this source data set to analyze alternative splicing. Further, these works include validation of splicing patterns observed in this source data, indicating the biological relevance of these data sets. This is now noted in our discussion section “In addition, the bulk RNA-seq dataset contains transcript information across the gene body, which parallel efforts have used to identify mRNA splicing patterns that are not found in the scRNA-seq dataset.” These works can be found in references 26 and 27.

      Reviewer #3 (Recommendations for the authors):

      (1) Describe the number of L4 animals processed to obtain good-quality bulk RNAseq libraries from the different neuronal types. If the number of worms would be different for different neuronal types, then please make a supplementary table listing the minimum number of worms needed for each neuronal type.

      We appreciate the reviewer’s recommendation, and agree that it would be a useful resource to provide suggestions for how many worms are needed per experiment. Unfortunately We did not track the total number of animals for each sample. We aimed to start with 200-300 ul of packed worms for each strain, generally equating to >500,000 worms, but yields of FACS-isolated cells varied among cell types. Because replicates for specific neuron types were also variable in some instances (See additions to supplemental Table 1), yields likely depend on multiple factors. We have previously noted (Taylor et al., 2021), for example, that some cell types were under-represented in scRNA-seq data (e.g, pharyngeal neurons) based on in vivo abundance presumptively due to the difficulty of isolation or sub-viability in the cell dissociation-FACS protocol.

      (2) List the thresholds for the parameters used during the FASTQC quality control and the threshold number of reads that would make a sample not useful.

      We now include parameters for sample exclusion in the methods section. “Samples were excluded after sequencing if they had: fewer than 1 million read pairs, <1% of uniquely mapping reads to the C. elegans genome, > 50% duplicate reads (low umi diversity), or failed deduplication steps in the nudup package.”

      (3) In Figure 5C, include an overlapping bar that shows the total number of genes in each cell type. You may need to use a log scale to see both (new and all) numbers of genes in the same graph. Add supplementary tables with the names of all new genes assigned to each neuronal type.

      We agree that this figure panel needed additional context. On further reflection we concluded that figure 5 was not sufficiently distinct from figure 4 to warrant separation, and incorporated some key findings from figure 5 into figure S4.

    1. eLife Assessment

      This important work investigates cooperative behaviors in adolescents using a repeated Prisoner's Dilemma game. The computational modeling approach used in the study is solid and rigorous. The work could be further strengthened with the consideration of modeling higher-order social inferences and non-linear relationships between age and observed behavior. Findings from this study will be of interest to developmental psychologists, economists, and social psychologists.

    2. Reviewer #1 (Public review):

      Summary:

      Wu and colleagues aimed to explain previous findings that adolescents, compared to adults, show reduced cooperation following cooperative behaviour from a partner in several social scenarios. The authors analysed behavioural data from adolescents and adults performing a zero-sum Prisoner's Dilemma task and compared a range of social and non-social reinforcement learning models to identify potential algorithmic differences. Their findings suggest that adolescents' lower cooperation is best explained by a reduced learning rate for cooperative outcomes, rather than differences in prior expectations about the cooperativeness of a partner. The authors situate their results within the broader literature, proposing that adolescents' behaviour reflects a stronger preference for self-interest rather than a deficit in mentalising.

      Strengths:

      The work as a whole suggests that, in line with past work, adolescents prioritise value accumulation, and this can be, in part, explained by algorithmic differences in wegithed value learning. The authors situate their work very clearly in past literature, and make it obvious the gap they are testing and trying to explain. The work also includes social contexts which move the field beyond non-social value accumulation in adolescents. The authors compare a series of formal approaches that might explain the results and establish generative and model-comparison procedures to demonstrate the validity of their winning model and individual parameters. The writing was clear, and the presentation of the results was logical and well-structured.

      Weaknesses:

      I had some concerns about the methods used to fit and approximate parameters of interest. Namely, the use of maximum likelihood versus hierarchical methods to fit models on an individual level, which may reduce some of the outliers noted in the supplement, and also may improve model identifiability.

      There was also little discussion given the structure of the Prisoner's Dilemma, and the strategy of the game (that defection is always dominant), meaning that the preferences of the adolescents cannot necessarily be distinguished from the incentives of the game, i.e. they may seem less cooperative simply because they want to play the dominant strategy, rather than a lower preferences for cooperation if all else was the same.

      The authors have now addressed my comments and concerns in their revised version.

      Appraisal & Discussion:

      Overall, I believe this work has the potential to make a meaningful contribution to the field. Its impact would be strengthened by more rigorous modelling checks and fitting procedures, as well as by framing the findings in terms of the specific game-theoretic context, rather than general cooperation.

      Comments on revisions:

      Thank you to the authors for addressing my comments and concerns.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript investigates age-related differences in cooperative behavior by comparing adolescents and adults in a repeated Prisoner's Dilemma Game (rPDG). The authors find that adolescents exhibit lower levels of cooperation than adults. Specifically, adolescents reciprocate partners' cooperation to a lesser degree than adults do. Through computational modeling, they show that this relatively low cooperation rate is not due to impaired expectations or mentalizing deficits, but rather a diminished intrinsic reward for reciprocity. A social reinforcement learning model with asymmetric learning rate best captured these dynamics, revealing age-related differences in how positive and negative outcomes drive behavioral updates. These findings contribute to understanding the developmental trajectory of cooperation and highlight adolescence as a period marked by heightened sensitivity to immediate rewards at the expense of long-term prosocial gains.

      Strengths:

      (1) Rigid model comparison and parameter recovery procedure.

      (2) Conceptually comprehensive model space.

      (3) Well-powered samples.

      Weaknesses:

      A key conceptual distinction between learning from non-human agents (e.g., bandit machines) and human partners is that the latter are typically assumed to possess stable behavioral dispositions or moral traits. When a non-human source abruptly shifts behavior (e.g., from 80% to 20% reward), learners may simply update their expectations. In contrast, a sudden behavioral shift by a previously cooperative human partner can prompt higher-order inferences about the partner's trustworthiness or the integrity of the experimental setup (e.g., whether the partner is truly interactive or human). The authors may consider whether their modeling framework captures such higher-order social inferences. Specifically, trait-based models-such as those explored in Hackel et al. (2015, Nature Neuroscience)-suggest that learners form enduring beliefs about others' moral dispositions, which then modulate trial-by-trial learning. A learner who believes their partner is inherently cooperative may update less in response to a surprising defection, effectively showing a trait-based dampening of learning rate.

      This asymmetry in belief updating has been observed in prior work (e.g., Siegel et al., 2018, Nature Human Behaviour) and could be captured using a dynamic or belief-weighted learning rate. Models incorporating such mechanisms (e.g., dynamic learning rate models as in Jian Li et al., 2011, Nature Neuroscience) could better account for flexible adjustments in response to surprising behavior, particularly in the social domain.

      Second, the developmental interpretation of the observed effects would be strengthened by considering possible non-linear relationships between age and model parameters. For instance, certain cognitive or affective traits relevant to social learning-such as sensitivity to reciprocity or reward updating-may follow non-monotonic trajectories, peaking in late adolescence or early adulthood. Fitting age as a continuous variable, possibly with quadratic or spline terms, may yield more nuanced developmental insights.

      Finally, the two age groups compared-adolescents (high school students) and adults (university students)-differ not only in age but also in sociocultural and economic backgrounds. High school students are likely more homogenous in regional background (e.g., Beijing locals), while university students may be drawn from a broader geographic and socioeconomic pool. Additionally, differences in financial independence, family structure (e.g., single-child status), and social network complexity may systematically affect cooperative behavior and valuation of rewards. Although these factors are difficult to control fully, the authors should more explicitly address the extent to which their findings reflect biological development versus social and contextual influences.

      Comments on revisions:

      The authors have adequately addressed my previous comments.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public review):

      Summary:

      Wu and colleagues aimed to explain previous findings that adolescents, compared to adults, show reduced cooperation following cooperative behaviour from a partner in several social scenarios. The authors analysed behavioural data from adolescents and adults performing a zero-sum Prisoner's Dilemma task and compared a range of social and non-social reinforcement learning models to identify potential algorithmic differences. Their findings suggest that adolescents' lower cooperation is best explained by a reduced learning rate for cooperative outcomes, rather than differences in prior expectations about the cooperativeness of a partner. The authors situate their results within the broader literature, proposing that adolescents' behaviour reflects a stronger preference for self-interest rather than a deficit in mentalising.

      Strengths:

      The work as a whole suggests that, in line with past work, adolescents prioritise value accumulation, and this can be, in part, explained by algorithmic differences in weighted value learning. The authors situate their work very clearly in past literature, and make it obvious the gap they are testing and trying to explain. The work also includes social contexts that move the field beyond non-social value accumulation in adolescents. The authors compare a series of formal approaches that might explain the results and establish generative and modelcomparison procedures to demonstrate the validity of their winning model and individual parameters. The writing was clear, and the presentation of the results was logical and well-structured.

      We thank the reviewer for recognizing the strengths of our work.

      Weaknesses:

      (1) I also have some concerns about the methods used to fit and approximate parameters of interest. Namely, the use of maximum likelihood versus hierarchical methods to fit models on an individual level, which may reduce some of the outliers noted in the supplement, and also may improve model identifiability.

      We thank the reviewer for this suggestion. Following the comment, we added a hierarchical Bayesian estimation. We built a hierarchical model with both group-level (adolescent group and adult group) and individual-level structures for the best-fitting model. Four Markov chains with 4,000 samples each were run, and the model converged well (see Figure supplement 7).

      We then analyzed the posterior parameters for adolescents and adults separately. The results were consistent with those from the MLE analysis. These additional results have been included in the Appendix Analysis section (also see Figure supplement 5 and 7). In addition, we have updated the code and provided the link for reference. We appreciate the reviewer’s suggestion, which improved our analysis.

      (2) There was also little discussion given the structure of the Prisoner's Dilemma, and the strategy of the game (that defection is always dominant), meaning that the preferences of the adolescents cannot necessarily be distinguished from the incentives of the game, i.e. they may seem less cooperative simply because they want to play the dominant strategy, rather than a lower preferences for cooperation if all else was the same.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma. 

      However, our computational modeling explicitly addressed this possibility. Model 4 (inequality aversion) captures decisions that are driven purely by self-interest or aversion to unequal outcomes, including a parameter reflecting disutility from advantageous inequality, which represents self-oriented motives. If participants’ behavior were solely guided by the payoff-dominant strategy, this model should have provided the best fit. However, our model comparison showed that Model 5 (social reward) performed better in both adolescents and adults, suggesting that cooperative behavior is better explained by valuing social outcomes beyond payoff structures.

      Besides, if adolescents’ lower cooperation is that they strategically respond to the payoff structure by adopting defection as the more rewarding option. Then, adolescents should show reduced cooperation across all rounds. Instead, adolescents and adults behaved similarly when partners defected, but adolescents cooperated less when partners cooperated and showed little increase in cooperation even after consecutive cooperative responses. This pattern suggests that adolescents’ lower cooperation cannot be explained solely by strategic responses to payoff structures but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded our Discussion to acknowledge this important point and to clarify how the behavioral and modeling results address the reviewer’s concern.

      “Overall, these findings indicate that adolescents’ lower cooperation is unlikely to be driven solely by strategic considerations, but may instead reflect differences in the valuation of others’ cooperation or reduced motivation to reciprocate. Although defection is the payoff-dominant strategy in the Prisoner’s Dilemma, the selective pattern of adolescents’ cooperation and the model comparison results indicate that their reduced cooperation cannot be fully explained by strategic incentives, but rather reflects weaker valuation of social reciprocity.”

      Appraisal & Discussion:

      (3) The authors have partially achieved their aims, but I believe the manuscript would benefit from additional methodological clarification, specifically regarding the use of hierarchical model fitting and the inclusion of Bayes Factors, to more robustly support their conclusions. It would also be important to investigate the source of the model confusion observed in two of their models.

      We thank the reviewer for this comment. In the revised manuscript, we have clarified the hierarchical Bayesian modeling procedure for the best-fitting model, including the group- and individual-level structure and convergence diagnostics. The hierarchical approach produced results that fully replicated those obtained from the original maximumlikelihood estimation, confirming the robustness of our findings. Please also see the response to (1).

      Regarding the model confusion between the inequality aversion (Model 4) and social reward (Model 5) models in the model recovery analysis, both models’ simulated behaviors were best captured by the baseline model. This pattern arises because neither model includes learning or updating processes. Given that our task involves dynamic, multi-round interactions, models lacking a learning mechanism cannot adequately capture participants’ trial-by-trial adjustments, resulting in similar behavioral patterns that are better explained by the baseline model during model recovery. We have added a clarification of this point to the Results:

      “The overlap between Models 4 and 5 likely arises because neither model incorporates a learning mechanism, making them less able to account for trial-by-trial adjustments in this dynamic task.”

      (4) I am unconvinced by the claim that failures in mentalising have been empirically ruled out, even though I am theoretically inclined to believe that adolescents can mentalise using the same procedures as adults. While reinforcement learning models are useful for identifying biases in learning weights, they do not directly capture formal representations of others' mental states. Greater clarity on this point is needed in the discussion, or a toning down of this language.

      We sincerely thank the reviewer for this professional comment. We agree that our prior wording regarding adolescents’ capacity to mentalise was somewhat overgeneralized. Accordingly, we have toned down the language in both the Abstract and the Discussion to better align our statements with what the present study directly tests. Specifically, our revisions focus on adolescents’ and adults’ ability to predict others’ cooperation in social learning. This is consistent with the evidence from our analyses examining adolescents’ and adults’ model-based expectations and self-reported scores on partner cooperativeness (see Figure 4). In the revised Discussion, we state:

      “Our results suggest that the lower levels of cooperation observed in adolescents stem from a stronger motive to prioritize self-interest rather than a deficiency in predicting others’ cooperation in social learning”.

      (5) Additionally, a more detailed discussion of the incentives embedded in the Prisoner's Dilemma task would be valuable. In particular, the authors' interpretation of reduced adolescent cooperativeness might be reconsidered in light of the zero-sum nature of the game, which differs from broader conceptualisations of cooperation in contexts where defection is not structurally incentivised.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma. However, our behavioral and computational evidence suggests that this pattern cannot be explained solely by strategic responses to payoff structures, but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded the Discussion to acknowledge this point and to clarify how both behavioral and modeling results address the reviewer’s concern (see also our response to 2).

      (6) Overall, I believe this work has the potential to make a meaningful contribution to the field. Its impact would be strengthened by more rigorous modelling checks and fitting procedures, as well as by framing the findings in terms of the specific game-theoretic context, rather than general cooperation.

      We thank the reviewer for the professional comments, which have helped us improve our work.

      Reviewer #2 (Public review):

      Summary:

      This manuscript investigates age-related differences in cooperative behavior by comparing adolescents and adults in a repeated Prisoner's Dilemma Game (rPDG). The authors find that adolescents exhibit lower levels of cooperation than adults. Specifically, adolescents reciprocate partners' cooperation to a lesser degree than adults do. Through computational modeling, they show that this relatively low cooperation rate is not due to impaired expectations or mentalizing deficits, but rather a diminished intrinsic reward for reciprocity. A social reinforcement learning model with asymmetric learning rate best captured these dynamics, revealing age-related differences in how positive and negative outcomes drive behavioral updates. These findings contribute to understanding the developmental trajectory of cooperation and highlight adolescence as a period marked by heightened sensitivity to immediate rewards at the expense of long-term prosocial gains.

      Strengths:

      (1) Rigid model comparison and parameter recovery procedure.

      (2) Conceptually comprehensive model space.

      (3) Well-powered samples.

      We thank the reviewer for highlighting the strengths of our work.

      Weaknesses:

      A key conceptual distinction between learning from non-human agents (e.g., bandit machines) and human partners is that the latter are typically assumed to possess stable behavioral dispositions or moral traits. When a non-human source abruptly shifts behavior (e.g., from 80% to 20% reward), learners may simply update their expectations. In contrast, a sudden behavioral shift by a previously cooperative human partner can prompt higher-order inferences about the partner's trustworthiness or the integrity of the experimental setup (e.g., whether the partner is truly interactive or human). The authors may consider whether their modeling framework captures such higher-order social inferences. Specifically, trait-based models-such as those explored in Hackel et al. (2015, Nature Neuroscience)-suggest that learners form enduring beliefs about others' moral dispositions, which then modulate trial-bytrial learning. A learner who believes their partner is inherently cooperative may update less in response to a surprising defection, effectively showing a trait-based dampening of learning rate.

      We thank the reviewer for this thoughtful comment. We agree that social learning from human partners may involve higher-order inferences beyond simple reinforcement learning from non-human sources. To address this, we had previously included such mechanisms in our behavioral modeling. In Model 7 (Social Reward Model with Influence), we tested a higher-order belief-updating process in which participants’ expectations about their partner’s cooperation were shaped not only by the partner’s previous choices but also by the inferred influence of their own past actions on the partner’s subsequent behavior. In other words, participants could adjust their belief about the partner’s cooperation by considering how their partner’s belief about them might change. Model comparison showed that Model 7 did not outperform the best-fitting model, suggesting that incorporating higher-order influence updates added limited explanatory value in this context. As suggested by the reviewer, we have further clarified this point in the revised manuscript.

      Regarding trait-based frameworks, we appreciate the reviewer’s reference to Hackel et al. (2015). That study elegantly demonstrated that learners form relatively stable beliefs about others’ social dispositions, such as generosity, especially when the task structure provides explicit cues for trait inference (e.g., resource allocations and giving proportions). By contrast, our study was not designed to isolate trait learning, but rather to capture how participants update their expectations about a partner’s cooperation over repeated interactions. In this sense, cooperativeness in our framework can be viewed as a trait-like latent belief that evolves as evidence accumulates. Thus, while our model does not include a dedicated trait module that directly modulates learning rates, the belief-updating component of our best-fitting model effectively tracks a dynamic, partner-specific cooperativeness, potentially reflecting a prosocial tendency.

      This asymmetry in belief updating has been observed in prior work (e.g., Siegel et al., 2018, Nature Human Behaviour) and could be captured using a dynamic or belief-weighted learning rate. Models incorporating such mechanisms (e.g., dynamic learning rate models as in Jian Li et al., 2011, Nature Neuroscience) could better account for flexible adjustments in response to surprising behavior, particularly in the social domain.

      We thank the reviewer for the suggestion. Following the comment, we implemented an additional model incorporating a dynamic learning rate based on the magnitude of prediction errors. Specifically, we developed Model 9:  Social reward model with Pearce–Hall learning algorithm (dynamic learning rate), in which participants’ beliefs about their partner’s cooperation probability are updated using a Rescorla–Wagner rule with a learning rate dynamically modulated by the Pearce–Hall (PH) Error Learning mechanism. In this framework, the learning rate increases following surprising outcomes (larger prediction errors) and decreases as expectations become more stable (see Appendix Analysis section for details).

      The results showed that this dynamic learning rate model did not outperform our bestfitting model in either adolescents or adults (see Figure supplement 6). We greatly appreciate the reviewer’s suggestion, which has strengthened the scope of our analysis. We now have added these analyses to the Appendix Analysis section (see Figure Supplement 6) and expanded the Discussion to acknowledge this modeling extension and further discuss its implications.

      Second, the developmental interpretation of the observed effects would be strengthened by considering possible non-linear relationships between age and model parameters. For instance, certain cognitive or affective traits relevant to social learning-such as sensitivity to reciprocity or reward updating-may follow non-monotonic trajectories, peaking in late adolescence or early adulthood. Fitting age as a continuous variable, possibly with quadratic or spline terms, may yield more nuanced developmental insights.

      We thank the reviewer for this professional comment. In addition to the linear analyses, we further conducted exploratory analyses to examine potential non-linear relationships between age and the model parameters. Specifically, we fit LMMs for each of the four parameters as outcomes (α+, α-, β, and ω). The fixed effects included age, a quadratic age term, and gender, and the random effects included subject-specific random intercepts and random slopes for age and gender. Model comparison using BIC did not indicate improvement for the quadratic models over the linear models for α<sup>+</sup> (ΔBIC<sub>quadratic-linear</sub> = 5.09), α− (ΔBICquadratic-linear = 3.04), β (ΔBICquadratic-linear = 3.9), or ω (ΔBICquadratic-linear = 0). Moreover, the quadratic age term was not significant for α<sup>+</sup>, α<sup>−</sup>, or β (all ps > 0.10). For ω, we observed a significant linear age effect (b = 1.41, t = 2.65, p = 0.009) and a significant quadratic age effect (b = −0.03, t = −2.39, p = 0.018; see Author response image 1). This pattern is broadly consistent with the group effect reported in the main text. The shaded area in the figure represents the 95% confidence interval. As shown, the interval widens at older ages (≥ 26 years) due to fewer participants in that range, which limits the robustness of the inferred quadratic effect. In consideration of the limited precision at older ages and the lack of BIC improvement, we did not emphasize the quadratic effect in the revised manuscript and present these results here as exploratory.

      Author response image 1.

      Linear and quadratic model fits showing the relationship between age and the ω parameter, with 95% confidence intervals.<br />

      Finally, the two age groups compared - adolescents (high school students) and adults (university students) - differ not only in age but also in sociocultural and economic backgrounds. High school students are likely more homogenous in regional background (e.g., Beijing locals), while university students may be drawn from a broader geographic and socioeconomic pool. Additionally, differences in financial independence, family structure (e.g., single-child status), and social network complexity may systematically affect cooperative behavior and valuation of rewards. Although these factors are difficult to control fully, the authors should more explicitly address the extent to which their findings reflect biological development versus social and contextual influences.

      We appreciate this comment. Indeed, adolescents (high school students) and adults (university students) differ not only in age but also in sociocultural and socioeconomic backgrounds. In our study, all participants were recruited from Beijing and surrounding regions, which helps minimize large regional and cultural variability. Moreover, we accounted for individual-level random effects and included participants’ social value orientation (SVO) as an individual difference measure. 

      Nonetheless, we acknowledge that other contextual factors, such as differences in financial independence, socioeconomic status, and social experience—may also contribute to group differences in cooperative behavior and reward valuation. Although our results are broadly consistent with developmental theories of reward sensitivity and social decisionmaking, sociocultural influences cannot be entirely ruled out. Future work with more demographically matched samples or with socioeconomic and regional variables explicitly controlled will help clarify the relative contributions of biological and contextual factors. Accordingly, we have revised the Discussion to include the following statement:  “Third, although both age groups were recruited from Beijing and nearby regions, minimizing major regional and cultural variation, adolescents and adults may still differ in socioeconomic status, financial independence, and social experience. Such contextual differences could interact with developmental processes in shaping cooperative behavior and reward valuation. Future research with demographically matched samples or explicit measures of socioeconomic background will help disentangle biological from sociocultural influences.”

      Reviewer #3 (Public review):

      Summary:

      Wu and colleagues find that in a repeated Prisoner's Dilemma, adolescents, compared to adults, are less likely to increase their cooperation behavior in response to repeated cooperation from a simulated partner. In contrast, after repeated defection by the partner, both age groups show comparable behavior.

      To uncover the mechanisms underlying these patterns, the authors compare eight different models. They report that a social reward learning model, which includes separate learning rates for positive and negative prediction errors, best fits the behavior of both groups. Key parameters in this winning model vary with age: notably, the intrinsic value of cooperating is lower in adolescents. Adults and adolescents also differ in learning rates for positive and negative prediction errors, as well as in the inverse temperature parameter.

      Strengths: 

      The modeling results are compelling in their ability to distinguish between learned expectations and the intrinsic value of cooperation. The authors skillfully compare relevant models to demonstrate which mechanisms drive cooperation behavior in the two age groups.

      We thank the reviewer’s recognition of our work’s strengths.

      Weaknesses:

      Some of the claims made are not fully supported by the data:

      The central parameter reflecting preference for cooperation is positive in both groups. Thus, framing the results as self-interest versus other-interest may be misleading.

      We thank the reviewer for this insightful comment. In the social reward model, the cooperation preference parameter is positive by definition, as defection in the repeated rPDG always yields a +2 monetary advantage regardless of the partner’s action. This positive value represents the additional subjective reward assigned to mutual cooperation (e.g., reciprocity value) that counterbalances the monetary gain from defection. Although the estimated social reward parameter ω was positive, the effective advantage of cooperation is Δ=p×ω−2. Given participants’ inferred beliefs p, Δ was negative for most trials (p×ω<2), indicating that the social reward was insufficient to offset the +2 advantage of defection. Thus, both adolescents and adults valued cooperation positively, but adolescents’ smaller ω and weaker responsiveness to sustained partner cooperation suggest a stronger weighting on immediate monetary payoffs. 

      In this light, our framing of adolescents as more self-interested derives from their behavioral pattern: even when they recognized sustained partner cooperation and held high expectations of partner cooperation, adolescents showed lower cooperative behavior and reciprocity rewards compared with adults. Whereas adults increased cooperation after two or three consecutive partner cooperations, this pattern was absent among adolescents. We therefore interpret their behavior as relatively more self-interested, reflecting reduced sensitivity to the social reward from mutual cooperation rather than a categorical shift from self-interest to other-interest, as elaborated in the Discussion.

      It is unclear why the authors assume adolescents and adults have the same expectations about the partner's cooperation, yet simultaneously demonstrate age-related differences in learning about the partner. To support their claim mechanistically, simulations showing that differences in cooperation preference (i.e., the w parameter), rather than differences in learning, drive behavioral differences would be helpful.

      We thank the reviewer for raising this important point. In our model, both adolescents and adults updated their beliefs about partner cooperation using an asymmetric reinforcement learning (RL) rule. Although adolescents exhibited a higher positive and a lower negative learning rate than adults, the two groups did not differ significantly in their overall updating of partner cooperation probability (Fig. 4a-b). We then examined the social reward parameter ω, which was significantly smaller in adolescents and determined the intrinsic value of mutual cooperation (i.e., p×ω). This variable differed significantly between groups and closely matched the behavioral pattern.

      Following the reviewer’s suggestion, we conducted additional simulations varying one model parameter at a time while holding the others constant. The difference in mean cooperation probability between adults and adolescents served as the index (positive = higher cooperation in adults). As shown in the Author response image 2, decreases in ω most effectively reproduced the observed group difference (shaded area), indicating that age-related differences in cooperation are primarily driven by variation in the social reward parameter ω rather than by others.

      Author response image 2.

      Simulation results showing how variations in each model parameter affect the group difference in mean cooperation probability (Adults – Adolescents). Based on the best-fitting Model 8 and parameters estimated from all participants, each line represents one parameter (i.e., α+, α-, ω, β) systematically varied within the tested range (α±:0.1–0.9; ω, β:1–9) while other parameters were held constant. Positive values indicate higher cooperation in adults. Smaller ω values most strongly reproduced the observed group difference, suggesting that reduced social reward weighting primarily drives adolescents’ lower cooperation.

      Two different schedules of 120 trials were used: one with stable partner behavior and one with behavior changing after 20 trials. While results for order effects are reported, the results for the stable vs. changing phases within each schedule are not. Since learning is influenced by reward structure, it is important to test whether key findings hold across both phases.

      We thank the reviewer for this thoughtful and professional comment. In our GLMM and LMM analyses, we focused on trial order rather than explicitly including the stable vs. changing phase factor, due to concerns about multicollinearity. In our design, phases occur in specific temporal segments, which introduces strong collinearity with trial order. In multi-round interactions, order effects also capture variance related to phase transitions. 

      Nonetheless, to directly address this concern, we conducted additional robustness analyses by adding a phase variable (stable vs. changing) to GLMM1, LMM1, and LMM3 alongside the original covariates. Across these specifications, the key findings were replicated (see GLMM<sub>sup</sub>2 and LMM<sub>sup</sub>4–5; Tables 9-11), and the direction and significance of main effects remained unchanged, indicating that our conclusions are robust to phase differences.

      The division of participants at the legal threshold of 18 years should be more explicitly justified. The age distribution appears continuous rather than clearly split. Providing rationale and including continuous analyses would clarify how groupings were determined.

      We thank the reviewer for this thoughtful comment. We divided participants at the legal threshold of 18 years for both conceptual and practical reasons grounded in prior literature and policy. In many countries and regions, 18 marks the age of legal majority and is widely used as the boundary between adolescence and adulthood in behavioral and clinical research. Empirically, prior studies indicate that psychosocial maturity and executive functions approach adult levels around this age, with key cognitive capacities stabilizing in late adolescence (Icenogle et al., 2019; Tervo-Clemmens et al., 2023). We have clarified this rationale in the Introduction section of the revised manuscript.

      “Based on legal criteria for majority and prior empirical work, we adopt 18 years as the boundary between adolescence and adulthood (Icenogle et al., 2019; Tervo-Clemmens et al., 2023).”

      We fully agree that the underlying age distribution is continuous rather than sharply divided. To address this, we conducted additional analyses treating age as a continuous predictor (see GLMM<sub>sup</sub>1 and LMM<sub>sup</sub>1–3; Tables S1-S4), which generally replicated the patterns observed with the categorical grouping. Nevertheless, given the limited age range of our sample, the generalizability of these findings to fine-grained developmental differences remains constrained. Therefore, our primary analyses continue to focus on the contrast between adolescents and adults, rather than attempting to model a full developmental trajectory.

      Claims of null effects (e.g., in the abstract: "adults increased their intrinsic reward for reciprocating... a pattern absent in adolescents") should be supported with appropriate statistics, such as Bayesian regression.

      We thank the reviewer for highlighting the importance of rigor when interpreting potential null effects. To address this concern, we conducted Bayes factor analyses of the intrinsic reward for reciprocity and reported the corresponding BF10 for all relevant post hoc comparisons. This approach quantifies the relative evidence for the alternative versus the null hypothesis, thereby providing a more direct assessment of null effects. The analysis procedure is now described in the Methods and Materials section: 

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      Once claims are more closely aligned with the data, the study will offer a valuable contribution to the field, given its use of relevant models and a well-established paradigm.

      We are grateful for the reviewer’s generous appraisal and insightful comments.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      I commend the authors on a well-structured, clear, and interesting piece of work. I have several questions and recommendations that, if addressed, I believe will strengthen the manuscript.

      We thank the reviewer for commending the organization of our paper.

      Introduction: - Why use a zero-sum (Prisoner's Dilemma; PD) versus a mixed-motive game (e.g. Trust Task) to study cooperation? In a finite set of rounds, the dominant strategy can be to defect in a PD.

      We thank the reviewer for this helpful comment. We agree that both the rationale for using the repeated Prisoner’s Dilemma (rPDG) and the limitations of this framework should be clarified. We chose the rPDG to isolate the core motivational conflict between selfinterest and joint welfare, as its symmetric and simultaneous structure avoids the sequential trust and reputation dependencies/accumulation inherent to asymmetric tasks such as the Trust Game (King-Casas et al., 2005; Rilling et al., 2002).

      Although a finitely repeated rPDG theoretically favors defection, extensive prior research shows that cooperation can still emerge in long repeated interactions when players rely on learning and reciprocity rather than backward induction (Rilling et al., 2002; Fareri et al., 2015). Our design employed 120 consecutive rounds, allowing participants to update expectations about partner behavior and to establish stable reciprocity patterns over time. We have added the following clarification to the Introduction:

      “The rPDG provides a symmetric and simultaneous framework that isolates the motivational conflict between self-interest and joint welfare, avoiding the sequential trust and reputation dynamics characteristic of asymmetric tasks such as the Trust Game (Rilling et al., 2002; King-Casas et al., 2005)”

      Methods:

      Did the participants know how long the PD would go on for?

      Were the participants informed that the partner was real/simulated?

      Were the participants informed that the partner was going to be the same for all rounds?

      We thank the reviewer for the meticulous review work, which helped us present the experimental design and reporting details more clearly. the following clarifications: I. Participants were not informed of the total number of rounds in the rPDG. This prevented endgame expectations and avoided distraction from counting rounds, which could introduce additional effects. II. Participants were told that their partner was another human participant in the laboratory. However, the partner’s behavior was predetermined by a computer program. This design enabled tighter experimental control and ensured consistent conditions across age groups, supporting valid comparisons. III. Participants were informed that they would interact with the same partner across all rounds, aligning with the essence of a multiround interaction paradigm and stabilizing partner-related expectations. For transparency, we have clarified these points in the Methods and Materials section:

      “Participants were told that their partner was another human participant in the laboratory and that they would interact with the same partner across all rounds. However, in reality, the actions of the partner were predetermined by a computer program. This setup allowed for a clear comparison of the behavioral responses between adolescents and adults. Participants were not informed of the total number of rounds in the rPDG.”

      The authors mention that an SVO was also recorded to indicate participant prosociality. Where are the results of this? Did this track game play at all? Could cooperativeness be explained broadly as an SVO preference that penetrated into game-play behaviour?

      We thank the reviewer for pointing this out. We agree that individual differences in prosociality may shape cooperative behavior, so we conducted additional analyses incorporating SVO. Specifically, we extended GLMM1 and LMM3 by adding the measured SVO as a fixed effect with random slopes, yielding GLMM<sub>sup</sub>3 and LMM<sub>sup</sub>6 (Tables 12–13). The results showed that higher SVO was associated with greater cooperation, whereas its effect on the reward for reciprocity was not significant. Importantly, the primary findings remained unchanged after controlling for SVO. These results indicate that cooperativeness in our task cannot be explained solely by a broad SVO preference, although a more prosocial orientation was associated with greater cooperation. We have reported these analyses and results in the Appendix Analysis section.

      Why was AIC chosen rather an BIC to compare model dominance?

      Sorry for the lack of clarification. Both the Akaike Information Criterion (AIC, Akaike, 1974) and Bayesian Information Criterion (BIC, Schwarz, 1978) are informationtheoretic criterions for model comparison, neither of which depends on whether the models to be compared are nested to each other or not (Burnham et al., 2002). We have added the following clarification into the Methods.

      “We chose to use the AICc as the metric of goodness-of-fit for model comparison for the following statistical reasons. First, BIC is derived based on the assumption that the “true model” must be one of the models in the limited model set one compares (Burnham et al., 2002; Gelman & Shalizi, 2013), which is unrealistic in our case. In contrast, AIC does not rely on this unrealistic “true model” assumption and instead selects out the model that has the highest predictive power in the model set (Gelman et al., 2014). Second, AIC is also more robust than BIC for finite sample size (Vrieze, 2012).”

      I believe the model fitting procedure might benefit from hierarchical estimation, rather than maximum likelihood methods. Adolescents in particular seem to show multiple outliers in a^+ and w^+ at the lower end of the distributions in Figure S2. There are several packages to allow hierarchical estimation and model comparison in MATLAB (which I believe is the language used for this analysis; see https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007043).

      We thank the reviewer for this helpful comment and for referring us to relevant methodological work (Piray et al., 2019). We have addressed this point by incorporating hierarchical Bayesian estimation, which effectively mitigates outlier effects and improves model identifiability. The results replicated those obtained with MLE fitting and further revealed group-level differences in key parameters. Please see our detailed response to Reviewer#1 Q1 for the full description of this analysis and results.

      Results: Model confusion seems to show that the inequality aversion and social reward models were consistently confused with the baseline model. Is this explained or investigated? I could not find an explanation for this.

      The apparent overlap between the inequality aversion (Model 4) and social reward (Model 5) models in the recovery analysis likely arises because neither model includes a learning mechanism, making them unable to capture trial-by-trial adjustments in this dynamic task. Consequently, both were best fit by the baseline model. Please see Response to Reviewer #1 Q3 for related discussion.

      Figures 3e and 3f show the correlation between asymmetric learning rates and age. It seems that both a^+ and a^- are around 0.35-0.40 for young adolescents, and this becomes more polarised with age. Could it be that with age comes an increasing discernment of positive and negative outcomes on beliefs, and younger ages compress both positive and negative values together? Given the higher stochasticity in younger ages (\beta), it may also be that these values simply represent higher uncertainty over how to act in any given situation within a social context (assuming the differences in groups are true).

      We appreciate this insightful interpretation. Indeed, both α+ and α- cluster around 0.35–0.40 in younger adolescents and become increasingly polarized with age, suggesting that sensitivity to positive versus negative feedback is less differentiated early in development and becomes more distinct over time. This interpretation remains tentative and warrants further validation. Based on this comment, we have revised the Discussion to include this developmental interpretation.

      We also clarify that in our model β denotes the inverse temperature parameter; higher β reflects greater choice precision and value sensitivity, not higher stochasticity. Accordingly, adolescents showed higher β values, indicating more value-based and less exploratory choices, whereas adults displayed relatively greater exploratory cooperation. These group differences were also replicated using hierarchical Bayesian estimation (see Response to Reviewer #1 Q1). In response to this comment, we have added a statement in the Discussion highlighting this developmental interpretation.

      “Together, these findings suggest that the differentiation between positive and negative learning rates changes with age, reflecting more selective feedback sensitivity in development, while higher β values in adolescents indicate greater value sensitivity. This interpretation remains tentative and requires further validation in future research.”

      A parameter partial correlation matrix (off-diagonal) would be helpful to understand the relationship between parameters in both adolescents and adults separately. This may provide a good overview of how the model properties may change with age (e.g. a^+'s relation to \beta).

      We thank the reviewer for this helpful comment. We fully agree that a parameter partial correlation matrix can further elucidate the relationships among parameters. Accordingly, we conducted a partial correlation analysis and added the visually presented results to the revised manuscript as Figure 2-figure supplement 4.

      It would be helpful to have Bayes Factors reported with each statistical tests given that several p-values fall within the 0.01 and 0.10.

      We thank the reviewer for this important recommendation. We have conducted Bayes factor analyses and reported BF10 for all relevant post hoc comparisons. We also clarified our analysis in the Methods and Materials section: 

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      Discussion: I believe the language around ruling out failures in mentalising needs to be toned down. RL models do not enable formal representational differences required to assess mentalising, but they can distinguish biases in value learning, which in itself is interesting. If the authors were to show that more complex 'ToM-like' Bayesian models were beaten by RL models across the board, and this did not differ across adults and adolescents, there would be a stronger case to make this claim. I think the authors either need to include Bayesian models in their comparison, or tone down their language on this point, and/or suggest ways in which this point might be more thoroughly investigated (e.g., using structured models on the same task and running comparisons: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0087619).

      We thank the reviewer for the comments. Please see our response to Reviewer 1 (Appraisal & Discussion section) for details.

      Reviewer #2 (Recommendations for the authors):

      The authors may want to show the winning model earlier (perhaps near the beginning of the Results section, when model parameters are first mentioned).

      We thank the reviewer for this suggestion. We agree that highlighting the winning model early improves clarity. Currently, we have mentioned the winning model before the beginning of the Results section. Specifically, in the penultimate paragraph of the Introduction we state:

      “We identified the asymmetric RL learning model as the winning model that best explained the cooperative decisions of both adolescents and adults.”

      Reviewer #3 (Recommendations for the authors):

      In addition to the points mentioned above, I suggest the following:

      (1) Clarify plots by clearly explaining each variable. In particular, the indices 1 vs. 1,2 vs. 1,2,3 were not immediately understandable.

      We thank the reviewer for this suggestion. We agree that the indices were not immediately clear. We have revised the figure captions (Figure 1 and 4) to explicitly define these terms more clearly: 

      “The x-axis represents the consistency of the partner’s actions in previous trials (t<sub>−1</sub>: last trial; t<sub>−1,2</sub>: last two trials; t<sub>−1,2,3</sub>: last three trials).”

      It's unclear why the index stops at 3. If this isn't the maximum possible number of consecutive cooperation trials, please consider including all relevant data, as adolescents might show a trend similar to adults over more trials.

      We thank the reviewer for raising this point. In our exploratory analyses, we also examined longer streaks of consecutive partner cooperation or defection (up to four or five trials). Two empirical considerations led us to set the cutoff at three in the final analyses. First, the influence of partner behavior diminished sharply with temporal distance. In both GLMMs and LMMs, coefficients for earlier partner choices were small and unstable, and their inclusion substantially increased model complexity and multicollinearity. This recency pattern is consistent with learning and decision models emphasizing stronger weighting of recent evidence (Fudenberg & Levine, 2014; Fudenberg & Peysakhovich, 2016). Second, streaks longer than three were rare, especially among some participants, leading to data sparsity and inflated uncertainty. Including these sparse conditions risked biasing group estimates rather than clarifying them. Balancing informativeness and stability, we therefore restricted the index to three consecutive partner choices in the main analyses, which we believe sufficiently capture individuals’ general tendencies in reciprocal cooperation.

      The term "reciprocity" may not be necessary. Since it appears to reflect a general preference for cooperation, it may be clearer to refer to the specific behavior or parameter being measured. This would also avoid confusion, especially since adolescents do show negative reciprocity in response to repeated defection.

      We thank you for this comment. In our work, we compute the intrinsic reward for reciprocity as p × ω, where p is the partner cooperation expectation and ω is the cooperation preference. In the rPDG, this value framework manifests as a reciprocity-derived reward: sustained mutual cooperation maximizes joint benefits, and the resulting choice pattern reflects a value for reciprocity, contingent on the expected cooperation of the partner. This quantity enters the trade-off between U<sub>cooperation</sub> and U<sub>defection</sub>and captures the participant’s intrinsic reward for reciprocity versus the additional monetary reward payoff of defection. Therefore, we consider the term “reciprocity” an acceptable statement for this construct.

      Interpretation of parameters should closely reflect what they specifically measure.

      We thank the reviewer for pointing this out. We have refined the relevant interpretations of parameters in the current Results and Discussion sections.

      Prior research has shown links between Theory of Mind (ToM) and cooperation (e.g., Martínez-Velázquez et al., 2024). It would be valuable to test whether this also holds in your dataset.

      We thank the reviewer for this thoughtful comment. Although we did not directly measure participants’ ToM, our design allowed us to estimate participants’ trial-by-trial inferences (i.e., expectations) about their partner’s cooperation probability. We therefore treat these cooperation expectations as an indirect representation for belief inference, which is related to ToM processes. To test whether this belief-inference component relates to cooperation in our dataset, we further conducted an exploratory analysis (GLMM<sub>sup</sub>4) in which participants’ choices were regressed on their cooperation expectations, group, and the group × cooperation-expectation interaction, controlling for trial number and gender, with random effects. Consistent with the ToM–cooperation link in prior research (MartínezVelázquez et al., 2024), participants’ expectations about their partner’s cooperation significantly predicted their cooperative behavior (Table 14), suggesting that decisions were shaped by social learning about others’ inferred actions. Moreover, the interaction between group and cooperation expectation was not significant, indicating that this inference-driven social learning process likely operates similarly in adolescents and adults. This aligns with our primary modeling results showing that both age groups update beliefs via an asymmetric learning process. We have reported these analyses in the Appendix Analysis section.

      More informative table captions would help the reader. Please clarify how variables are coded (e.g., is female = 0 or 1? Is adolescent = 0 or 1?), to avoid the need to search across the manuscript for this information.

      We thank the reviewer for raising this point. We have added clear and standardized variable coding in the table notes of all tables to make them more informative and avoid the need to search the paper. We have ensured consistent wording and formatting across all tables.

      I hope these comments are helpful and support the authors in further strengthening their manuscript.

      We thank the three reviewers for their comments, which have been helpful in strengthening this work.

      Reference

      (1) Fudenberg, D., & Levine, D. K. (2014). Recency, consistent learning, and Nash equilibrium. Proceedings of the National Academy of Sciences of the United States of America, 111(Suppl. 3), 10826–10829. https://doi.org/10.1073/pnas.1400987111

      (2) Fudenberg, D., & Peysakhovich, A. (2016). Recency, records, and recaps: Learning and nonequilibrium behavior in a simple decision problem. ACM Transactions on Economics and Computation, 4(4), Article 23, 1–18. https://doi.org/10.1145/2956581

      (3) Hackel, L., Doll, B., & Amodio, D. (2015). Instrumental learning of traits versus rewards: Dissociable neural correlates and effects on choice. Nature Neuroscience, 18, 1233– 1235. https://doi.org/10.1038/nn.4080

      (4) Icenogle, G., Steinberg, L., Duell, N., Chein, J., Chang, L., Chaudhary, N., Di Giunta, L.,Dodge, K. A., Fanti, K. A., Lansford, J. E., Oburu, P., Pastorelli, C., Skinner, A. T.,Sorbring, E., Tapanya, S., Uribe Tirado, L. M., Alampay, L. P., Al-Hassan, S. M.,Takash, H. M. S., & Bacchini, D. (2019). Adolescents’ cognitive capacity reaches adult levels prior to their psychosocial maturity: Evidence for a “maturity gap” in a multinational, cross-sectional sample. Law and Human Behavior, 43(1), 69–85. https://doi.org/10.1037/lhb0000315

      (5) Krekelberg, B. (2024). Matlab Toolbox for Bayes Factor Analysis (v3.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.13744717

      (6) Martínez-Velázquez, E. S., Ponce-Juárez, S. P., Díaz Furlong, A., & Sequeira, H. (2024). Cooperative behavior in adolescents: A contribution of empathy and emotional regulation? Frontiers in Psychology, 15, 1342458. https://doi.org/10.3389/fpsyg.2024.1342458

      (7) Tervo-Clemmens, B., Calabro, F. J., Parr, A. C., et al. (2023). A canonical trajectory of executive function maturation from adolescence to adulthood. NatureCommunications, 14, 6922. https://doi.org/10.1038/s41467-023-42540-8

      (8) King-Casas, B., Tomlin, D., Anen, C., Camerer, C. F., Quartz, S. R., & Montague, P. R. (2005). Getting to know you: reputation and trust in a two-person economic exchange. Science, 308(5718), 78-83. https://doi.org/10.1126/science.1108062

      (9) Rilling, J. K., Gutman, D. A., Zeh, T. R., Pagnoni, G., Berns, G. S., & Kilts, C. D. (2002). A neural basis for social cooperation. Neuron, 35(2), 395-405. https://doi.org/10.1016/s0896-6273(02)00755-9

      (10) Fareri, D. S., Chang, L. J., & Delgado, M. R. (2015). Computational substrates of social value in interpersonal collaboration. Journal of Neuroscience, 35(21), 8170-8180. https://doi.org/10.1523/JNEUROSCI.4775-14.2015

      (11) Akaike, H. (2003). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6), 716-723. https://doi.org/10.1109/TAC.1974.1100705

      (12) Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461464. https://doi.org/10.1214/aos/1176344136

      (13) Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer.https://doi.org/10.1007/b97636

      (14) Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x

      (15) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b16018

      (16) Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Psychological Methods, 17(2), 228–243. https://doi.org/10.1037/a0027127.

    1. eLife Assessment

      This study offers important insights into how entorhinal and hippocampal activity support human thinking in feature spaces. It replicates hexagonal symmetry in entorhinal cortex, reports a novel three-fold symmetry in both behavior and hippocampal signals, and links these findings with a computational model. The task and analyses are sophisticated, and the results appear solid and of broad interest to neuroscientists.

    2. Reviewer #1 (Public review):

      Summary:

      Zhang and colleagues examine neural representations underlying abstract navigation in entorhinal cortex (EC) and hippocampus (HC) using fMRI. This paper replicates a previously identified hexagonal modulation of abstract navigation vectors in abstract space in EC in a novel task involving navigating in a conceptual Greeble space. In HC, the authors identify a three-fold signal of the navigation angle. They also use a novel analysis technique (spectral analysis) to look at spatial patterns in these two areas and identify phase coupling between HC and EC. Interestingly, the three-fold pattern identified in the hippocampus explains quirks in participants' behavior where navigation performance follows a three-fold periodicity. Finally, the authors propose a EC-HPC PhaseSync Model to understand how the EC and HC construct cognitive maps. The wide array and creativity of the techniques used is impressive but because of their unique nature, the paper would benefit from more details on how some of these techniques were implemented.

      Comments on revisions:

      Most of my concerns were adequately addressed, and I believe the paper is greatly improved. I have two more points. I noticed that the legend for Figure 4 still refers to some components of the previous figure version, this should be updated to reflect the current version of the figure. I also think the paper would benefit from more details regarding some of the analyses. Specifically, the phase-amplitude coupling analysis should have a section in the methods which should be sure to clarify how the BOLD signals were reconstructed.

    3. Reviewer #2 (Public review):

      The authors report results from behavioral data, fMRI recordings, and computer simulations during a conceptual navigation task. They report 3-fold symmetry in behavioral and simulated model performance, 3-fold symmetry in hippocampal activity, and 6-fold symmetry in entorhinal activity (all as a function of movement directions in conceptual space). The analyses seem thoroughly done, and the results and simulations are very interesting.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary: 

      Zhang and colleagues examine neural representations underlying abstract navigation in the entorhinal cortex (EC) and hippocampus (HC) using fMRI. This paper replicates a previously identified hexagonal modulation of abstract navigation vectors in abstract space in EC in a novel task involving navigating in a conceptual Greeble space. In HC, the authors claim to identify a three-fold signal of the navigation angle. They also use a novel analysis technique (spectral analysis) to look at spatial patterns in these two areas and identify phase coupling between HC and EC. Finally, the authors propose an EC-HPC PhaseSync Model to understand how the EC and HC construct cognitive maps. While the wide array of techniques used is impressive and their creativity in analysis is admirable, overall, I found the paper a bit confusing and unconvincing. I recommend a significant rewrite of their paper to motivate their methods and clarify what they actually did and why. The claim of three-fold modulation in HC, while potentially highly interesting to the community, needs more background to motivate why they did the analysis in the first place, more interpretation as to why this would emerge in biology, and more care taken to consider alternative hypotheses seeped in existing models of HC function. I think this paper does have potential to be interesting and impactful, but I would like to see these issues improved first.

      General comments:

      (1) Some of the terminology used does not match the terminology used in previous relevant literature (e.g., sinusoidal analysis, 1D directional domain).

      We thank the reviewer for this valuable suggestion, which helps to improve the consistency of our terminology with previous literature and to reduce potential ambiguity. Accordingly, we have replaced “sinusoidal analysis” with “sinusoidal modulation” (Doeller et al., 2010; Bao et al., 2019; Raithel et al., 2023) and “1D directional domain” with “angular domain of path directions” throughout the manuscript.

      (2) Throughout the paper, novel methods and ideas are introduced without adequate explanation (e.g., the spectral analysis and three-fold periodicity of HC).

      We thank the reviewer for raising this important point. In the revised manuscript, we have substantially extended the Introduction (paragraphs 2–4) to clarify our hypothesis, explicitly explaining why the three primary axes of the hexagonal grid cell code may manifest as vector fields. We have also revised the first paragraph of the “3-fold periodicity in the HPC” section in the Results to clarify the rationale for using spectral analysis. Please refer to our responses to comment 2 and 3 below for details.

      Reviewer #2 (Public review):

      The authors report results from behavioral data, fMRI recordings, and computer simulations during a conceptual navigation task. They report 3-fold symmetry in behavioral and simulated model performance, 3-fold symmetry in hippocampal activity, and 6-fold symmetry in entorhinal activity (all as a function of movement directions in conceptual space). The analyses are thoroughly done, and the results and simulations are very interesting.

      We sincerely thank the reviewer for the positive and encouraging comments on our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) This paper has quite a few spelling and grammatical mistakes, making it difficult to understand at times.

      We apologize for the wordings and grammatical errors. We have thoroughly re-read and carefully edited the entire manuscript to correct typographical and grammatical errors, ensuring improved clarity and readability.

      (2) Introduction - It's not clear why the three primary axes of hexagonal grid cell code would manifest as vector fields.

      We thank the reviewer for raising this important point. In the revised Introduction (paragraphs 2, 3, and 4), we now explicitly explain the rationale behind our hypothesis that the three primary axes of the hexagonal grid cell code manifest as vector fields.

      In paragraph 2, we present empirical evidence from rodent, bat, and human studies demonstrating that mental simulation of prospective paths relies on vectorial representations in the hippocampus (Sarel et al., 2017; Ormond and O’Keefe, 2022; Muhle-Karbe et al., 2023).

      In paragraphs 3 and 4, we introduce our central hypothesis: vectorial representations may originate from population-level projections of entorhinal grid cell activity, based on three key considerations:

      (1) The EC serves as the major source of hippocampal input (Witter and Amaral, 1991; van Groen et al., 2003; Garcia and Buffalo, 2020).

      (2) Grid codes exhibit nearly invariant spatial orientations (Hafting et al., 2005; Gardner et al., 2022), which makes it plausible that their spatially periodic activity can be detected using fMRI.

      (3) A model-based inference: for example, in the simplest case, when one mentally simulates a straight pathway aligned with the grid orientation, a subpopulation of grid cells would be activated. The resulting population activity would form a near-perfect vectorial representation, with constant activation strength along the path. In contrast, if the simulated path is misaligned with the grid orientation, the population response becomes a distorted vectorial code. Consequently, simulating all possible straight paths spanning 0°–360° results in 3-fold periodicity in the activity patterns—due to the 180° rotational symmetry of the hexagonal grid, orientations separated by 180° are indistinguishable.

      We therefore speculate that vectorial representations embedded in grid cell activity exhibit 3-fold periodicity across spatial orientations and serve as a periodic structure to represent spatial direction. Supporting this view, reorientation paradigms in both rodents and young children have shown that subjects search equally in two opposite directions, reflecting successful orientation encoding but a failure to integrate absolute spatial direction (Hermer and Spelke, 1994; Julian et al., 2015; Gallistel, 2017; Julian et al., 2018).

      (3) It took me a few reads to understand what the spectral analysis was. After understanding, I do think this is quite clever. However, this paper needs more motivation to understand why you are performing this analysis. E.g., why not just take the average regressor at the 10º, 70º, etc. bins and compare it to the average regressor at 40º, 100º bins? What does the Fourier transform buy you?

      We are sorry for the confusion. we outline the rationale for employing Fast Fourier Transform (FFT) analysis to identify neural periodicity. In the revised manuscript, we have added these clarifications into the first paragraph of the “3-fold periodicity in the HPC” subsection in the Results.

      First, FFT serves as an independent approach to cross-validate the sinusoidal modulation results, providing complementary evidence for the 6-fold periodicity in EC and the 3-fold periodicity in HPC.

      Second, FFT enables unbiased detection of multiple candidate periodicities (e.g., 3–7-fold) simultaneously without requiring prior assumptions about spatial phase (orientation). By contrast, directly comparing “aligned” versus “misaligned” angular bins (e.g., 10°/70° vs. 40°/100°) would implicitly assume knowledge of the phase offset, which was not known a priori.

      Finally, FFT uniquely allows periodicity analysis of behavioral performance, which is not feasible with standard sinusoidal GLM approaches. This methodological consistency makes it possible to directly compare periodicities across neural and behavioral domains.

      (4) A more minor point: at one point, you say it’s a spectral analysis of the BOLD signals, but the methods description makes it sound like you estimated regressors at each of the bins before performing FFT. Please clarify. 

      We apologize for the confusion. In our manuscript, we use the term spectral analysis to distinguish this approach from sinusoidal modulation analysis. Conceptually, our spectral analysis involves a three-level procedure:

      (1) First level: We estimated direction-dependent activity maps using a general linear model (GLM), which included 36 regressors corresponding to path directions, down-sampled in 10° increments.

      (2) Second level: We applied a Fast Fourier Transform (FFT) to the direction-dependent activity maps derived from the GLM to examine the spectral magnitude of potential spatial periodicities.

      (3) Third level: We conducted group-level statistical analyses across participants to assess the consistency of the observed periodicities.

      We have revised the “Spectral analysis of MRI BOLD signals” subsection in the Methods to clarify this multi-level procedure.

      (5) Figure 4a:

      Why do the phases go all the way to 2*pi if periodicity is either three-fold or six-fold? 

      When performing correlation between phases, you should perform a circular-circular correlation instead of a Pearson's correlation.

      We thank the reviewer for raising this important point. In the original Figure 4a, both EC and HPC phases spanned 0–2π because their sinusoidal phase estimates were projected into a common angular space by scaling them according to their symmetry factors (i.e., multiplying the 3-fold phase by 3 and the 6-fold phase by 6), followed by taking the modulo 2π. However, this projection forced signals with distinct intrinsic periodicities (120° vs. 60° cycles) into a shared 360° space, thereby distorting their relative angular distances and disrupting the one-to-one correspondence between physical directions and phase values. Consequently, this transformation could bias the estimation of their phase relationship.

      In the revised analysis and Figure 4a, we retained the original phase estimates derived from the sinusoidal modulation within their native periodic ranges (0–120° for 3-fold and 0–60° for 6-fold) by applying modulo operations directly. Following your suggestion, the relationship between EC and HPC phases was then quantified using circular–circular correlation (Jammalamadaka & Sengupta, 2001), as implemented in the CircStat MATLAB toolbox. This updated analysis avoids the rescaling artifact and provides a statistically stronger and conceptually clearer characterization of the phase correspondence between EC and HPC.

      (6) Figure 4d needs additional clarification:

      Phase-locking is typically used to describe data with a high temporal precision. I understand you adopted an EEG analysis technique to this reconstructed fMRI time-series data, but it should be described differently to avoid confusion. This needs additional control analyses (especially given that 3 is a multiple of 6) to confirm that this result is specific to the periodicities found in the paper.

      We thank the reviewer for this insightful comment. We have extensively revised the description of the Figure 4 to avoid confusion with EEG-based phase-locking techniques. The revised text now explicitly clarifies that our approach quantifies spatial-domain periodic coupling across path directions, rather than temporal synchronization of neural signals.

      To further address the reviewer’s concern about potential effects of the integer multiple relationship between the 3-fold HPC and 6-fold EC periodicities, we additionally performed two control analyses using the 9-fold and 12-fold EC components, both of which are also integer multiples of the 3-fold HPC periodicity. Neither control analysis showed significant coupling (p > 0.05), confirming that the observed 3-fold–6-fold coupling was specific and not driven by their harmonic relationship.

      The description of the revised Figure 4 has been updated in the “Phase Synchronization Between HPC and EC Activity” subsection of the Results.

      (7) Figure 5a is misleading. In the text, you say you test for propagation to egocentric cortical areas, but I don’t see any analyses done that test this. This feels more like a possible extension/future direction of your work that may be better placed in the discussion.

      We are sorry for the confusion. Figure 5a was intended as a hypothesis-driven illustration to motivate our analysis of behavioral periodicity based on participants’ task performance. However, we agree with the reviewer that, on its own, Figure 5a could be misleading, as it does not directly present supporting analyses.

      To provide empirical support for the interpretation depicted in Figure 5a, we conducted a whole-brain analysis (Figure S8), which revealed significant 3-fold periodic signals in egocentric cortical regions, including the parietal cortex (PC), precuneus (PCU), and motor regions.

      To avoid potential misinterpretation, we have revised the main text to include these results and explicitly referenced Figure S8 in connection with Figure 5a.

      The updated description in the “3-fold periodicity in human behavior” subsection in the Results is as follows:

      “Considering the reciprocal connectivity between the medial temporal lobe (MTL), where the EC and HPC reside, and the parietal cortex implicated in visuospatial perception and action, together with the observed 3-fold periodicity within the DMN (including the PC and PCu; Fig. S8), we hypothesized that the 3-fold periodic representations of path directions extend beyond the MTL to the egocentric cortical areas, such as the PC, thereby influencing participants' visuospatial task performance (Fig. 5a)”.

      Additionally, Figure 5a has been modified to more clearly highlight the hypothesized link between activity periodicity and behavioral periodicity, rather than suggesting a direct anatomical pathway.

      (8) PhaseSync model: I am not an expert in this type of modeling, so please put a lower weight on this comment (especially compared to some of the other reviewers). While the PhaseSync model seems interesting, it’s not clear from the discussion how this compares to current models. E.g., Does it support them by adding the three-fold HC periodicity? Does it demonstrate that some of them can't be correct because they don't include this three-fold periodicity?

      We thank the reviewer for the insightful comment regarding the PhaseSync model. We agree that further clarifying its relationship to existing computational frameworks is important.

      The EC–HPC PhaseSync model is not intended to replace or contradict existing grid–place cell models of navigation (e.g., Bicanski and Burgess, 2019; Whittington et al., 2020; Edvardsen et al., 2020). Instead, it offers a hierarchical extension by proposing that vectorial representations in the hippocampus emerge from the projections of periodic grid codes in the entorhinal cortex. Specifically, the model suggests that grid cell populations encode integrated path information, forming a vectorial gradient toward goal locations.

      To simplify the theoretical account, our model was implemented in an idealized square layout. In more complex real-world environments, hippocampal 3-fold periodicity may interact with additional spatial variables, such as distance, movement speed, and environmental boundaries.

      We have revised the final two paragraphs of the Discussion to clarify this conceptual framework and emphasize the importance of future studies in exploring how periodic activity in the EC–HPC circuit interacts with environmental features to support navigation.

      Reviewer #2 (Recommendations for the authors):

      (1) Please show a histogram of movement direction sampling for each participant.

      We thank the reviewer for this helpful suggestion. We have added a new supplementary figure (Figure S2) showing histograms of path direction sampling for each participant (36 bins of 10°). The figure is also included. Rayleigh tests for circular uniformity revealed no significant deviations from uniformity (all ps > 0.05, Bonferroni-corrected across participants), confirming that path directions were sampled evenly across 0°–360°.

      (2) Why didn’t you use participants’ original trajectories (instead of the trajectories inferred from the movement start and end points) for the hexadirectional analyses? 

      In our paradigm, participants used two MRI-compatible 2-button response boxes (one for each hand) to adjust the two features of the greebles. As a result, the raw adjustment path contained only four cardinal directions (up, down, left, right). If we were to use the raw stepwise trajectories, the analysis would be restricted to these four directions, which would severely limit the angular resolution. By instead defining direction as the vector from the start to the end position in feature space, we can expand the effective range of directions to the full 0–360°. This approach follows previous literature on abstract grid-like coding in humans (e.g., Constantinescu et al., 2016), where direction was similarly defined by the relative change between two feature dimensions rather than the literal stepwise path. We have added this clarification in the “Sinusoidal modulation” subsection of the revised method.

      (3) Legend of Figure 2: the statement "localizing grid cell activity" seems too strong because it is still not clear whether hexadirectional signals indeed result from grid-cell activity (e.g., Bin Khalid et al., eLife, 2024). I would suggest rephrasing this statement (here and elsewhere). 

      Thank you for this helpful suggestion. We have removed the statement “localizing grid cell activity” to avoid ambiguity and revised the legend of Figure 2a to more explicitly highlight its main purpose—defining how path directions and the aligned/misaligned conditions were constructed in the 6-fold modulation. We have also modified similar expressions throughout the manuscript to ensure consistency and clarity.

      (4) Legend of Figure 2: “cluster-based SVC correction for multiple comparisons” - what is the small volume you are using for the correction? Bilateral EC?

      For both Figure 2 and Figure 3, the anatomical mask of the bilateral medial temporal lobe (MTL), as defined by the AAL atlas, was used as the small volume for correction. This has been clarified in the revised Statistical Analysis section of the Methods as “… with small-volume correction (SVC) applied within the bilateral MTL”.

      (5) Legend of Figure 2: "ROI-based analysis" - what kind of ROI are you using? "corrected for multiple comparisons" - which comparisons are you referring to? Different symmetries and also the right/left hemisphere?

      In Figure 2b, the ROI was defined as a functional mask derived from the significant activation cluster in the right entorhinal cortex (EC). Since no robust clusters were observed in the left EC, the functional ROI was restricted to the right hemisphere. We indeed included Figure 2c to illustrate this point; however, we recognize that our description in the text was not sufficiently clear.

      Regarding the correction for multiple comparisons, this refers specifically to the comparisons across different rotational symmetries (3-, 4-, 5-, 6-, and 7-fold). Only the 6-fold symmetry survived correction, whereas no significant effects were detected for the other symmetries.

      We have clarified these points in the “6-fold periodicity in the EC” subsection of the result as “… The ROI was defined as a functional mask of the right EC identified in the voxel-based analysis and further restricted within the anatomical EC. These analyses revealed significant periodic modulation only at 6-fold (Figure  2c; t(32) = 3.56, p = 0.006, two-tailed, corrected for multiple comparisons across rotational symmetries; Cohen’s d = 0.62) …”.

      We have also revised the “3-fold periodicity in the HPC” subsection of the result as “… ROI analysis, using a functional mask of the HPC identified in the spectral analysis and further restricted within the anatomical HPC, indicated that HPC activity selectively fluctuated at 3-fold periodicity (Figure 3e; t(32) = 3.94, p = 0.002, corrected for multiple comparisons across rotational symmetries; Cohen’s d = 0.70) …”.

      (6) Figure 2d: Did you rotationally align 0{degree sign} across participants? Please state explicitly whether (or not) 0{degree sign} aligns with the x-axis in Greeble space.

      We thank the reviewer for this helpful question. Yes, before reconstructing the directional tuning curve in Figure 2d, path directions were rotationally aligned for each participant by subtracting the participant-specific grid orientation (ϕ) estimated from the independent dataset (odd sessions). We have now made this description explicit in the revised manuscript in the “6-fold periodicity in the EC” subsection of the Results, stating “… To account for individual difference in spatial phase, path directions were calibrated by subtracting the participant-specific grid orientation estimated from the odd sessions ...”.

      (7) Clustering of grid orientations in 30 participants: What does “Bonferroni corrected” refer to? Also, the Rayleigh test is sensitive to the number of voxels - do you obtain the same results when using pair-wise phase consistency? 

      “Bonferroni corrected” here refers to correction across participants. We have clarified this in the first paragraph of the “6-fold periodicity in the EC” subsection of the Result and in the legend of Supplementary Figure S5 as “Bonferroni-corrected across participants.”

      To examine whether our findings were sensitive to the number of voxels, we followed the reviewer’s guidance to compute pairwise phase consistency (PPC; Vinck et al., 2010) for each participant. The PPC results replicated those obtained with the Rayleigh test. We have updated the new results into the Supplementary Figure S5. We also updated the “Statistical Analysis” subsection of the Methods to describe PPC as “For the PPC (Vinck et al., 2010), significance was tested using 5,000 permutations of uniformly distributed random phases (0–2π) to generate a null distribution for comparison with the observed PPC”.

      (8) 6-fold periodicity in the EC: Do you compute an average grid orientation across all EC voxels, or do you compute voxel-specific grid orientations?

      Following the protocol originally described by Doeller et al. (2010), we estimated voxel-wise grid orientations within the EC and then obtained a participant-specific orientation by averaging across voxels within a hand-drawn bilateral EC mask. The procedure is described in detail in the “Sinusoidal modulation” subsection of the Methods.

      (9) Hand-drawn bilateral EC mask: What was your procedure for drawing this mask? What results do you get with a standard mask, for example, from Freesurfer or SPM? Why do you perform this analysis bilaterally, given that the earlier analysis identified 6-fold symmetry only in the right EC? What do you mean by "permutation corrected for multiple comparisons"?

      We thank the reviewer for raising these important methodological points. To our knowledge, no standard volumetric atlas provides an anatomically defined entorhinal cortex (EC) mask. For example, the built-in Harvard–Oxford cortical structural atlas in FSL contains only a parahippocampal region that encompasses, but does not isolate, the EC. The AAL atlas likewise does not contain an EC region. In FreeSurfer, an EC label is available, but only in the fsaverage surface space, which is not directly compatible with MNI-based volumetric group-level analyses.

      Therefore, we constructed a bilateral EC mask by manually delineating the EC according to the detailed anatomical landmarks described by Insausti et al. (1998). Masks were created using ITK-SNAP (Version 3.8, www.itksnap.org). For transparency and reproducibility, the mask has been made publicly available at the Science Data Bank (link: https://www.scidb.cn/s/NBriAn), as indicated in the revised Data and Code availability section.

      Regarding the use of a bilateral EC mask despite voxel-wise effects being strongest in the right EC. First, we did not have any a priori hypothesis regarding laterality of EC involvement before performing analyses. Second, previous studies estimated grid orientation using a bilateral EC mask in their sinusoidal analyses (Doeller et al., 2010; Constantinescu et al., 2016; Bao et al., 2019; Wagner et al., 2023; Raithel et al., 2023). We therefore followed this established approach to estimate grid orientation.

      By “permutation corrected for multiple comparisons” we refer to the family-wise error correction applied to the reconstructed directional tuning curves (Figure 2d for the EC, Figure 3f for the HPC). Specifically, directional labels were randomly shuffled 5,000 times, and an FFT was applied to each shuffled dataset to compute spectral power at each fold. This procedure generated null distributions of spectral power for each symmetry. For each fold, the 95th percentile of the maximal power across permutations was used as the uncorrected threshold. To correct across folds, the 95th percentile of the maximal suprathreshold power across all symmetries was taken as the family-wise error–corrected threshold. We have clarified this procedure in the revised “Statistical Analysis” subsection of the Methods.

      (10) Figures 3b and 3d: Why do different hippocampal voxels show significance for the sinusoidal versus spectral analysis? Shouldn’t the analyses be redundant and, thus, identify the same significant voxels? 

      We thank the reviewer for this insightful question. Although both sinusoidal modulation and spectral analysis aim to detect periodic neural activity, the two approaches are methodologically distinct and are therefore not expected to identify exactly the same significant voxels.

      Sinusoidal modulation relies on a GLM with sine and cosine regressors to test for phase-aligned periodicity (e.g., 3-fold or 6-fold), calibrated according to the estimated grid orientation. This approach is highly specific but critically depends on accurate orientation estimation. In contrast, spectral analysis applies Fourier decomposition to the directional tuning profile, enabling the detection of periodic components without requiring orientation calibration.

      Accordingly, the two analyses are not redundant but complementary. The FFT approach allows for an unbiased exploration of multiple candidate periodicities (e.g., 3–7-fold) without predefined assumptions, thereby providing a critical cross-validation of the sinusoidal GLM results. This strengthens the evidence for 6-fold periodicity in EC and 3-fold periodicity in HPC. Furthermore, FFT uniquely facilitates the analysis of periodicities in behavioral performance data, which is not feasible with standard sinusoidal GLM approaches. This methodological consistency enables direct comparison of periodicities across neural and behavioral domains.

      Additionally, the anatomical distributions of the HPC clusters appear more similar between Figure 3b and Figure 3d after re-plotting Figure 3d using the peak voxel coordinates (x = –24, y = –18), which are closer to those used for Figure 3b (x = –24, y = –20), as shown in the revised Figure 3.

      Taken together, the two analyses serve distinct but complementary purposes.

      (11) 3-fold sinusoidal analysis in hippocampus: What kind of small volume are you using to correct for multiple comparisons?

      We thank the reviewer for this comment. The same small volume correction procedure was applied as described in R4. Specifically, the anatomical mask of the bilateral medial temporal lobe (MTL), as defined by the AAL atlas, was used as the small volume for correction. This procedure has been clarified in the revised Statistical Analysis section of the Methods as following: “… with small-volume correction (SVC) applied within the bilateral MTL.”

      (12) Figure S5: “right HPC” – isn’t the cluster in the left hippocampus? 

      We are sorry for the confusion. The brain image was present in radiological orientation (i.e., the left and right orientations are flipped). We also checked the figure and confirmed that the cluster shown in the original Figure S5 (i.e., Figure S6 in the revised manuscript) is correctly labeled as the right hippocampus, as indicated by the MNI coordinate (x = 22), where positive x values denote the right hemisphere. To avoid potential confusion, we have explicitly added the statement “Volumetric results are displayed in radiological orientation” to the figure legends of all volume-based results.

      (13) Figure S5: Why are the significant voxels different from the 3-fold symmetry analysis using 10{degree sign} bins?

      As shown in R10, the apparent differences largely reflect variation in MNI coordinates. After adjusting for display coordinates, the anatomical locations of the significant clusters are in fact highly similar between the 10°-binned (Figure 3d, shown above) and the 20°-binned results (Figure S6).

      Although both analyses rely on sinusoidal modulation, they differ in the resolution of the input angular bins (10° vs. 20°). Combined with the inherent noise in fMRI data, this makes it unlikely that the two approaches would yield exactly the same set of significant voxels. Importantly, both analyses consistently reveal robust 3-fold periodicity in the hippocampus, indicating that the observed effect is not dependent on angular bin size.

      (14) Figure 4a and corresponding text: What is the unit? Phase at which frequency? Are you using a circular-circular correlation to test for the relationship?

      We thank the reviewer for raising this important point. In the revised manuscript, we have clarified that the unit of the phase values is radians, corresponding to the 6-fold periodic component in the EC and the 3-fold periodic component in the HPC. In the original Figure 4a, both EC and HPC phases—estimated from sinusoidal modulation—were analyzed using Pearson correlation. We have since realized issues with this approach, as also noted R5 to Reviewer #1.

      In the revised analysis and Figure 4a (as shown above), we re-evaluated the relationship between EC and HPC phases using a circular–circular correlation (Jammalamadaka & Sengupta, 2001), implemented in the CircStat MATLAB toolbox. The “Phase synchronization between the HPC and EC activity” subsection of the Result has been accordingly updated as following:

      “To examine whether the spatial phase structure in one region could predict that in another, we tested whether the orientations of the 6-fold EC and 3-fold HPC periodic activities, estimated from odd-numbered sessions using sinusoidal modulation with rotationally symmetric parameters (in radians), were correlated across participants. A cross-participant circular–circular correlation was conducted between the spatial phases of the two areas to quantify the spatial correspondence of their activity patterns (EC: purple dots; HPC: green dots) (Jammalamadaka & Sengupta, 2001). The analysis revealed a significant circular correlation (Figure 4a; r = 0.42, p < 0.001) …”.

      In the “Statistical analysis” subsection of the method:

      “… The relationship between EC and HPC phases was evaluated using the circular–circular correlation (Jammalamadaka & Sengupta, 2001) implemented in the CircStat MATLAB toolbox …”.

      (15) Paragraph following “We further examined amplitude-phase coupling...” - please clarify what data goes into this analysis.

      We thank the reviewer for this helpful comment. In this analysis, the input data consisted of hippocampal (HPC) phase and entorhinal (EC) amplitude, both extracted using the Hilbert transform from the reconstructed BOLD signals of the EC and HPC derived through sinusoidal modulation. We have substantially revised the description of the amplitude–phase coupling analysis in the third paragraph of the “Phase Synchronization Between HPC and EC Activity” subsection of the Results to clarify this procedure.

      (16) Alignment between EC 6-fold phases and HC 3-fold phases: Why don't you simply test whether the preferred 6-fold orientations in EC are similar to the preferred 3-fold phases in HC? The phase-amplitude coupling analyses seem sophisticated but are complex, so it is somewhat difficult to judge to what extent they are correct. 

      We thank the reviewer for this thoughtful comment. We employed two complementary analyses to examine the relationship between EC and HPC activity. In the revised Figure 4 (as shown in Figure 4 for Reviewer #1), Figure 4a provides a direct and intuitive measure of the phase relationship between the two regions using circular–circular correlation. Figure 4b–c examines whether the activity peaks of the two regions are aligned across path directions using cross-frequency amplitude–phase coupling, given our hypothesis that the spatial phase of the HPC depends on EC projections. These two analyses are complementary: a phase correlation does not necessarily imply peak-to-peak alignment, and conversely, peak alignment does not always yield a statistically significant phase correlation. We therefore combined multiple analytical approaches as a cross-validation across methods, providing convergent evidence for robust EC–HPC coupling.

      (17) Figure 5: Do these results hold when you estimate performance just based on “deviation from the goal to ending locations” (without taking path length into account)? 

      We thank the reviewer for this thoughtful suggestion. Following the reviewer’s advice, we re-estimated behavioral performance using the deviation between the goal and ending locations (i.e., error size) and path length independently. As shown in the new Figure S9, no significant periodicity was observed in error size (p > 0.05), whereas a robust 3-fold periodicity was found for path length (p < 0.05, corrected for multiple comparisons).

      We employed two behavioral metrics,(1) path length and (2) error size, for complementary reasons. In our task, participants navigated using four discrete keys corresponding to the cardinal directions (north, south, east, and west). This design inherently induces a 4-fold bias in path directions, as described in the “Behavioral performance” subsection of the Methods. To minimize this artifact, we computed the objectively optimal path length and used it to calibrate participants’ path lengths. However, error size could not be corrected in the same manner and retained a residual 4-fold tendency (see Figure S9d).

      Given that both path length and error size are behaviorally relevant and capture distinct aspects of task performance, we decided to retain both measures when quantifying behavioral periodicity. This clarification has been incorporated into the “Behavioral performance” subsection of the Methods, and the 2<sup>nd</sup> paragraph of the “3-fold periodicity in human behavior” subsection of the Results.

      (18) Phase locking between behavioral performance and hippocampal activity: What is your way of creating surrogates here?

      We thank the reviewer for this helpful question. Surrogate datasets were generated by circularly shifting the signal series along the direction axis across all possible offsets (following Canolty et al., 2006). This procedure preserves the internal phase structure within each domain while disrupting consistent phase alignment, thereby removing any systematic coupling between the two signals. Each surrogate dataset underwent identical filtering and coherence computation to generate a null distribution, and the observed coherence strength was compared with this distribution using paired t-tests across participants. The statistical analysis section has been systematically revised to incorporate these methodological details.

      (19) I could not follow why the authors equate 3-fold symmetry with vectorial representations. This includes statements such as “these empirical findings provide a potential explanation for the formation of vectorial representation observed in the HPC.” Please clarify.

      We thank the reviewer for raising this point. Please refer to our response to R2 for Reviewer #1 and the revised Introduction (paragraphs 2–4), where we explicitly explain why the three primary axes of the hexagonal grid cell code can manifest as vector fields.

      (20) It was unclear whether the sentence “The EC provides a foundation for the formation of periodic representations in the HPC” is based on the authors’ observations or on other findings. If based on the authors’ findings, this statement seems too strong, given that no other studies have reported periodic representations in the hippocampus to date (to the best of my knowledge).

      We thank the reviewer for this comment. We agree that the original wording lacked sufficient rigor. We have extensively revised the 3rd paragraph of the Discussion section with more cautious language by reducing overinterpretation and emphasizing the consistency of our findings with prior empirical evidence, as follows: “The EC–HPC PhaseSync model demonstrates how a vectorial representation may emerge in the HPC from the projections of populations of periodic grid codes in the EC. The model was motivated by two observations. First, the EC intrinsically serves as the major source of hippocampal input (Witter and Amaral, 1991; van Groen et al., 2003; Garcia and Buffalo, 2020), and grid codes exhibit nearly invariant spatial orientations (Hafting et al., 2005; Gardner et al., 2022). Second, mental planning, characterized by “forward replay” (Dragoi and Tonegawa, 2011; Pfeiffer, 2020), has the capacity to activate populations of grid cells that represent sequential experiences in the absence of actual physical movement (Nyberg et al., 2022). We hypothesize that an integrated path code of sequential experiences may eventually be generated in the HPC, providing a vectorial gradient toward the goal location. The path code exhibits regular, vector-like representations when the path direction aligns with the orientations of grid axes, and becomes irregular when they misalign. This explanation is consistent with the band-like representations observed in the dorsomedial EC (Krupic et al., 2012) and the irregular activity fields of trace cells in the HPC (Poulter et al., 2021). ”

    1. eLife Assessment

      TrASPr is an important contribution that leverages transformer models focused on regulatory regions to enhance predictions of tissue-specific splicing events. The revisions strengthen the manuscript by clarifying methodology and expanding analyses across exon and intron sizes, and the evidence supporting TrASPr's predictive performance is compelling. This work will be of interest to researchers in computational genomics and RNA biology, offering an improved model for splicing prediction and a promising approach to RNA sequence design.

    2. Reviewer #1 (Public review):

      Summary

      The authors propose a transformer-based model for prediction of condition- or tissue-specific alternative splicing and demonstrate its utility in design of RNAs with desired splicing outcomes, which is a novel application. The model is compared to relevant exising approaches (Pangolin and SpliceAI) and the authors clearly demonstrate its advantage. Overall, a compelling method that is well thought out and evaluated.

      Strengths:

      (1) The model is well thought out: rather than modeling a cassette exon using a single generic deep learning model as has been done e.g. in SpliceAI and related work, the authors propose a modular architecture that focuses on different regions around a potential exon skipping event, which enables the model to learn representations that are specific to those regions. Because each component in the model focuses on a fixed length short sequence segment, the model can learn position-specific features. Furthermore, the architecture of the model is designed to model alternative splicing events, whereas Pangolin and SpliceAI are focused on modeling individual splice junctions, which is an easier problem.

      (2) The model is evaluated in a rigorous way - it is compared to the most relevant state-of-the-art models, uses machine learning best practices, and an ablation study demonstrates the contribution of each component of the architecture.

      (3) Experimental work supports the computational predictions: Regulatory elements predicted by the model were experimentally verified; novel tissue-specific cassette exons were verified by LSV-seq.

      (4) The authors use their model for sequence design to optimize splicing outcome, which is a novel application.

      Weaknesses:

      None noted.

    3. Reviewer #2 (Public review):

      Summary:

      The authors present a transformer-based model, TrASPr, for the task of tissue-specific splicing prediction (with experiments primarily focused on the case of cassette exon inclusion) as well as an optimization framework (BOS) for the task of designing RNA sequences for desired splicing outcomes.

      For the first task, the main methodological contribution is to train four transformer-based models on the 400bp regions surrounding each splice site, the rationale being that this is where most splicing regulatory information is. In contrast, previous work trained one model on a long genomic region. This new design should help the model capture more easily interactions between splice sites. It should also help in cases of very long introns, which are relatively common in the human genome.

      TrASPr's performance is evaluated in comparison to previous models (SpliceAI, Pangolin, and SpliceTransformer) on numerous tasks including splicing predictions on GTEx tissues, ENCODE cell lines, RBP KD data, and mutagenesis data. The scope of these evaluations is ambitious; however, significant details on most of the analyses are missing, making it difficult to evaluate the strength of evidence.

      In the second task, the authors combine Latent Space Bayesian Optimization (LSBO) with a Transformer-based variational auto encoder to optimize RNA sequences for a given splicing-related objective function. This method (BOS) appears to be a novel application of LSBO, with promising results on several computational evaluations and the potential to be impactful on sequence design for both splicing-related objectives and other tasks. However, comparison of BOS against existing methods for sequence design is lacking.

      Strengths:

      - A novel machine learning model for an important problem in RNA biology with excellent prediction accuracy.

      - Instead of being based on a generic design as in previous work, the proposed model incorporates biological domain knowledge (that regulatory information is concentrated around splice sites). This way of using inductive bias can be important to future work on other sequence-based prediction tasks.

      Weaknesses:

      - Most of the analyses presented in the manuscript are described in broad strokes and are often confusing. As a result, it is difficult to assess the significance of the contribution.

      - As more and more models are being proposed for splicing prediction (SpliceAI, Pangolin, SpliceTransformer, TrASPr), there is a need for establishing standard benchmarks, similar to those in computer vision (ImageNet). Without such benchmarks, it is exceedingly difficult to compare models.<br /> *This point is now addressed in the revision *<br /> *Moreover, datasets have been made available by the authors on BitBucket. *

      - Related to the previous point, as discussed in the manuscript, SpliceAI and Pangolin are not designed to predict PSI of cassette exons. Instead, they assign a "splice site probability" to each nucleotide. Converting this to a PSI prediction is not obvious, and the method chosen by the authors (averaging the two probabilities (?)) is likely not optimal. It would interesting to see what happens if an MLP is used on top of the four predictions (or the outputs of the top layers) from SpliceAI/Pangolin. This could also indicate where the improvement in TrASPr comes from: is it because TrASPr combines information from all four splice sites? Also consider fine-tuning Pangolin on cassette exons only (as you do for your model).<br /> *This point is still not addressed in the revision. *

      - L141, "TrASPr can handle cassette exons spanning a wide range of window sizes from 181 to 329,227 bases-thanks to its multi-transformer architecture." This is reported to be one of the primary advantages compared to existing models. Additional analysis should be included on how TrASPr performs across varying exon and intron sizes, with comparison to SpliceAI, etc.

      Added after revision: The authors have added additional analyses of performance based on both the length of the exon under consideration and the total length of the surrounding intronic contexts. The result that TrASPr performs well across various context sizes (i.e., the length of the sequence between the upstream and downstream exons, ranging from <1k to >10k) is highly encouraging and supports the claim that most of the sequence-based splicing logic is located proximal to the splice sites. It is also noteworthy that TrASPr performs well for exons longer than 200, suggesting that most of the "regulatory code" is present at the exon boundaries rather than in its center (which TrASPr is blind to).<br /> Additionally, Pearson correlation is used as the sole performance metric in many analyses (e.g., Fig 2 - Supp 2). The authors should consider alternative accuracy metrics, such as RMSE, which better convey the magnitude of prediction error and are more easily comparable across datasets. Pearson correlation may also be more sensitive to outliers on the smaller samples that arise when binning sequences.

      - L171, "training it on cassette exons". This seems like an important point: previous models were trained mostly on constitutive exons, whereas here the model is trained specifically on cassette exons. This should be discussed in more detail.<br /> * Our initial comment was incorrect, as pointed out by the authors. *

      - L214, ablations of individual features are missing.<br /> * This was addressed in the revision. *

      - L230, "ENCODE cell lines", it is not clear why other tissues from GTEx were not included<br /> * This was addressed in the revision. *

      - L239, it is surprising that SpliceAI performs so badly, and might suggest a mistake in the analysis. Additional analysis and possible explanations should be provided to support these claims. Similarly for the complete failure of SpliceAI and Pangolin shown in Fig 4d.<br /> * The authors should consider adding SpliceAI/Pangolin predictions for the alternative 5' and 3' splice site selection tasks (and code for related analyses) to the BitBucket repository.*

      - BOS seems like a separate contribution that belongs in a separate publication. Instead, consider providing more details on TrASPr.

      *Minor comment added after revision: regarding the author response that "A completely independent evaluation would have required a high-throughput experimental system to assess designs, which is beyond the scope of the current paper.":<br /> It's not clear why BOS cannot be evaluated as a separate contribution by instead using different "teacher" models instead of TrASPr. Additionally, BOS lacks evaluation against existing methods for sequence optimization. *

      - The authors should consider evaluating BOS using Pangolin or SpliceTransformer as the oracle, in order to measure the contribution to the sequence generation task provided by BOS vs TrASPr.<br /> * See comment above *

    4. Author response:

      The following is the authors’ response to the original reviews

      A point by point response included below. Before we turn to that we want to note one change that we decided to introduce, related to generalization on unseen tissues/cell types (Figure 3a in the original submission and related question by Reviewer #2 below). This analysis was based on adding a latent “RBP state” representation during learning of condition/tissue specific splicing. The “RBP state” per condition is captured by a dedicated encoder. Our original plan was to have a paper describing a new RBP-AE model we developed in parallel, which also served as the base to capture this “RBP State”. However, we got delayed in getting this second paper finalized (it was led by other lab members, some of whom have already left the lab). This delay affected the TrASPr manuscript as TrASPr’s code should be available and analysis reproducible upon publication. After much deliberation, we decided that in order to comply with reproducibility standards while not self scooping the RBP-AE paper, we eventually decided to take out the RBP-AE and replace it with a vanilla PCA based embedding for the “RBP-State”. The PCA approach is simpler and reproducible, based on linear transformation of the RBPs expression vector into a lower dimension. The qualitative results included in Figure 3a still hold, and we also produced the new results suggested by Reviewer #2 in other GTEX tissues with this PCA based embedding (below). 

      We don’t believe the switch to PCA based embedding should have any bearing on the current manuscript evaluation but wanted to take this opportunity to explain the reasoning behind this additional change.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors propose a transformer-based model for the prediction of condition - or tissue-specific alternative splicing and demonstrate its utility in the design of RNAs with desired splicing outcomes, which is a novel application. The model is compared to relevant existing approaches (Pangolin and SpliceAI) and the authors clearly demonstrate its advantage. Overall, a compelling method that is well thought out and evaluated.

      Strengths:

      (1) The model is well thought out: rather than modeling a cassette exon using a single generic deep learning model as has been done e.g. in SpliceAI and related work, the authors propose a modular architecture that focuses on different regions around a potential exon skipping event, which enables the model to learn representations that are specific to those regions. Because each component in the model focuses on a fixed length short sequence segment, the model can learn position-specific features. Another difference compared to Pangolin and SpliceAI which are focused on modeling individual splice junctions is the focus on modeling a complete alternative splicing event.

      (2) The model is evaluated in a rigorous way - it is compared to the most relevant state-of-the-art models, uses machine learning best practices, and an ablation study demonstrates the contribution of each component of the architecture.

      (3) Experimental work supports the computational predictions.     

      (4) The authors use their model for sequence design to optimize splicing outcomes, which is a novel application.

      We wholeheartedly thank Reviewer #1 for these positive comments regarding the modeling approach we took to this task and the evaluations we performed. We have put a lot of work and thought into this and it is gratifying to see the results of that work acknowledged like this.

      Weaknesses:

      No weaknesses were identified by this reviewer, but I have the following comments:

      (1) I would be curious to see evidence that the model is learning position-specific representations.

      This is an excellent suggestion to further assess what the model is learning. To get a better sense of the position-specific representation we performed the following analyses:

      (1) Switching the transformers relative order: All transformers are pretrained on 3’ and 5’ splice site regions before fine-tunning for the PSI and dPSI prediction task. We hypothesized that if relative position is important, switching the order of the transformers would make a large difference on prediction accuracy. Indeed if we switch the 3’ and 5’ we see as expected a severe drop in performance, with Pearson correlation on test data dropping from 0.82 to 0.11. Next, we switched the two 5’ and 3’ transformers, observing a drop to 0.65 and 0.78 respectively. When focusing only on changing events the drop was from 0.66 to 0.54 (for 3’ SS transformers), 0.48 (for 5’ SS transformers), and 0.13 (when the 3’ and 5’ transformers flanking the alternative exon were switched). 

      (2) Position specific effect of RBPs: We wanted to test whether the model is able to learn position specific effects for RBPs. For this we focused on two RBPs, FOX (a family of three highly related RBPs), and QKI, both have a relatively well defined motif, known condition and position specific effect identified via RBP KD experiments combined with CLIP experiments (e.g. PMID: 23525800, PMID: 24637117, PMID: 32728246). For each, we randomly selected 40 highly and 40 lowly included cassette exons sequences. We then ran in-silico mutagenesis experiments where we replaced small windows of sequences with the RBP motifs (80 for RBFOX and 80 for QKI), then compared TrASPR’s predictions for the average predictions for 5 random sequences inserted in the same location. The results of this are now shown in Figure 4 Supp 3, where the y-axis represents the dPSI effect per position (x-axis), and the color represents the percentile of observed effects over inserting motifs in that position across all 80 sequences tested. We see that both RBPs have strong positional preferences for exerting a strong effect on the alternative exon. We also see differences between binding upstream and downstream of the alternative exon. These results, learned by the model from natural tissue-specific variations, recapitulate nicely the results derived from high-throughput experimental assays. However, we also note that effects were highly sequence specific. For example, RBFOX is generally expected to increase inclusion when binding downstream of the alternative exon and decrease inclusion when binding upstream. While we do observe such a trend we also see cases where the opposite effects are observed. These sequence specific effects have been reported in the literature but may also represent cases where the model errs in the effect’s direction. We discuss these new results in the revised text.

      (3) Assessing BOS sequence edits to achieve tissue-specific splicing: Here we decided to test whether BOS edits in intronic regions (at least 8b away from the nearest splice site) are important for the tissue-specific effect. The results are now included in Figure 6 Supp 1, clearly demonstrating that most of the neuronal specific changes achieved by BOS were based on changing the introns, with a strong effect observed for both up and downstream intron edits.

      (2) The transformer encoders in TrASPr model sequences with a rather limited sequence size of 200 bp; therefore, for long introns, the model will not have good coverage of the intronic sequence. This is not expected to be an issue for exons.

      The reviewer is raising a good question here. On one hand, one may hypothesize that, as the reviewer seems to suggest, TrASPr may not do well on long introns as it lacks the full intronic sequence.

      Conversely, one may also hypothesize that for long introns, where the flanking exons are outside the window of SpliceAI/Pangolin, TrASPr may have an advantage.

      Given this good question and a related one by Reviewer #2, we divided prediction accuracy by intron length and the alternative exon length.

      For short exons  (<100bp) we find TrASPr and Pangolin perform similarly, but for longer exons, especially those > 200, TrASPr results are better. When dividing samples by the total length of the upstream and downstream intron, we find TrASPr outperform all other models for introns of combined length up to 6K, but Pangolin gets better results when the combined intron length is over 10K. This latter result is interesting as it means that contrary to the second hypothesis laid out above, Pangolin’s performance did not degrade for events where the flanking exons were outside its field of view. We note that all of the above holds whether we assess all events or just cases of tissue specific changes. It is interesting to think about the mechanistic causes for this. For example, it is possible that cassette exons involving very long introns evoke a different splicing mechanism where the flanking exons are not as critical and/or there is more signal in the introns which is missed by TrASPr. We include these new results now as Figure 2 - Supp 1,2 and discuss these in the main text.

      (3) In the context of sequence design, creating a desired tissue- or condition-specific effect would likely require disrupting or creating motifs for splicing regulatory proteins. In your experiments for neuronal-specific Daam1 exon 16, have you seen evidence for that? Most of the edits are close to splice junctions, but a few are further away.

      That is another good question. Regarding Daam1 exon 16, in the original paper describing the mutation locations some motif similarities were noted to PTB (CU) and CUG/Mbnl-like elements (Barash et al Nature 2010). In order to explore this question beyond this specific case we assessed the importance of intronic edits by BOS to achieve a tissue specific splicing profile - see above.

      (4) For sequence design, of tissue- or condition-specific effect in neuronal-specific Daam1 exon 16 the upstream exonic splice junction had the most sequence edits. Is that a general observation? How about the relative importance of the four transformer regions in TrASPr prediction performance?

      This is another excellent question. Please see new experiments described above for RBP positional effect and BOS edits in intronic regions which attempt to give at least partial answers to these questions. We believe a much more systematic analysis can be done to explore these questions but such evaluation is beyond the scope of this work.

      (5) The idea of lightweight transformer models is compelling, and is widely applicable. It has been used elsewhere. One paper that came to mind in the protein realm:

      Singh, Rohit, et al. "Learning the language of antibody hypervariability." Proceedings of the National Academy of Sciences 122.1 (2025): e2418918121.

      We definitely do not make any claim this approach of using lighter, dedicated models instead of a large ‘foundation’ model has not been taken before. We believe Rohit et al mentioned above represents a somewhat different approach, where their model (AbMAP) fine-tunes large general protein foundational models (PLM) for antibody-sequence inputs by supervising on antibody structure and binding specificity examples. We added a description of this modeling approach citing the above work and another one which specifically handles RNA splicing (intron retention, PMID: 39792954).

      Reviewer #2 (Public review):

      Summary:

      The authors present a transformer-based model, TrASPr, for the task of tissue-specific splicing prediction (with experiments primarily focused on the case of cassette exon inclusion) as well as an optimization framework (BOS) for the task of designing RNA sequences for desired splicing outcomes.

      For the first task, the main methodological contribution is to train four transformer-based models on the 400bp regions surrounding each splice site, the rationale being that this is where most splicing regulatory information is. In contrast, previous work trained one model on a long genomic region. This new design should help the model capture more easily interactions between splice sites. It should also help in cases of very long introns, which are relatively common in the human genome.

      TrASPr's performance is evaluated in comparison to previous models (SpliceAI, Pangolin, and SpliceTransformer) on numerous tasks including splicing predictions on GTEx tissues, ENCODE cell lines, RBP KD data, and mutagenesis data. The scope of these evaluations is ambitious; however, significant details on most of the analyses are missing, making it difficult to evaluate the strength of the evidence. Additionally, state-of-the-art models (SpliceAI and Pangolin) are reported to perform extremely poorly in some tasks, which is surprising in light of previous reports of their overall good prediction accuracy; the reasoning for this lack of performance compared to TrASPr is not explored.

      In the second task, the authors combine Latent Space Bayesian Optimization (LSBO) with a Transformer-based variational autoencoder to optimize RNA sequences for a given splicing-related objective function. This method (BOS) appears to be a novel application of LSBO, with promising results on several computational evaluations and the potential to be impactful on sequence design for both splicing-related objectives and other tasks.

      We thank Reviewer #2 for this detailed summary and positive view of our work. It seems the main issue raised in this summary regards the evaluations: The reviewer finds details of the evaluations missing and the fact that SpliceAI and Pangolin perform poorly on some of the tasks to be surprising. We made a concise effort to include the required details, including code and data tables. In short, some of the concerns were addressed by adding additional evaluations, some by clarifying missing details, and some by better explaining where Pangolin and SpliceAI may excel vs. settings where these may not do as well. More details are given below. 

      Strengths:

      (1) A novel machine learning model for an important problem in RNA biology with excellent prediction accuracy.

      (2) Instead of being based on a generic design as in previous work, the proposed model incorporates biological domain knowledge (that regulatory information is concentrated around splice sites). This way of using inductive bias can be important to future work on other sequence-based prediction tasks.

      Weaknesses:

      (1) Most of the analyses presented in the manuscript are described in broad strokes and are often confusing. As a result, it is difficult to assess the significance of the contribution.

      We made an effort to make the tasks be specific and detailed,  including making the code and data of those available. We believe this helped improve clarity in the revised version.

      (2) As more and more models are being proposed for splicing prediction (SpliceAI, Pangolin, SpliceTransformer, TrASPr), there is a need for establishing standard benchmarks, similar to those in computer vision (ImageNet). Without such benchmarks, it is exceedingly difficult to compare models. For instance, Pangolin was apparently trained on a different dataset (Cardoso-Moreira et al. 2019), and using a different processing pipeline (based on SpliSER) than the ones used in this submission. As a result, the inferior performance of Pangolin reported here could potentially be due to subtle distribution shifts. The authors should add a discussion of the differences in the training set, and whether they affect your comparisons (e.g., in Figure 2). They should also consider adding a table summarizing the various datasets used in their previous work for training and testing. Publishing their training and testing datasets in an easy-to-use format would be a fantastic contribution to the community, establishing a common benchmark to be used by others.

      There are several good points to unpack here. Starting from the last one, we very much agree that a standard benchmark will be useful to include. For tissue specific splicing quantification we used the GTEx dataset from which we select six representative human tissues (heart, cerebellum, lung, liver, spleen, and EBV-transformed lymphocytes). In total, we collected 38394 cassette exon events quantified across 15 samples (here a ‘sample’ is a cassette exon quantified in two tissues) from the GTEx dataset with high-confidence quantification for their PSIs based on MAJIQ. A detailed description of how this data was derived is now included in the Methods section, and the data itself is made available via the bitbucket repository with the code.

      Next, regarding the usage of different data and distribution shifts for Pangolin: The reviewer is right to note there are many differences between how Pangolin and TrASPr were trained. This makes it hard to determine whether the improvements we saw are not just a result of different training data/labels. To address this issue, we first tried to finetune the pre-trained Pangolin with MAJIQ’s PSI dataset: we use the subset of the GTEx dataset described above, focusing on the three tissues analyzed in Pangolin’s paper—heart, cerebellum, and liver—for a fair comparison. In total, we obtained 17,218 events, and we followed the same training and test split as reported in the Pangolin paper. We got Pearson: 0.78 Spearman: 0.68 which are values similar to what we got without this extra fine tuning. Next, we retrained Pangolin from scratch, with the full tissues and training set used for TrASPr, which was derived from MAJIQ’s quantifications. Since our model only trained on human data with 6 tissues at the same time, we modified Pangolin from original 4 splice site usage outputs to 6 PSI outputs. We tried to take the sequence centered with the first or the second splice site of the mid exon. This test resulted in low performance (3’ SS: pearson 0.21 5’ SS: 0.26.). 

      The above tests are obviously not exhaustive but their results suggest that the differences we observe are unlikely to be driven by distribution shifts. Notably, the original Pangolin was trained on much more data (four species, four tissues each, and sliding windows across the entire genome). This training seems to be important for performance while the fact we switched from Pangolin’s splice site usage to MAJIQ’s PSI was not a major contributor. Other potential reasons for the improvements we observed include the architecture, target function, and side information (see below) but a complete delineation of those is beyond the scope of this work. 

      (3) Related to the previous point, as discussed in the manuscript, SpliceAI, and Pangolin are not designed to predict PSI of cassette exons. Instead, they assign a "splice site probability" to each nucleotide. Converting this to a PSI prediction is not obvious, and the method chosen by the authors (averaging the two probabilities (?)) is likely not optimal. It would be interesting to see what happens if an MLP is used on top of the four predictions (or the outputs of the top layers) from SpliceAI/Pangolin. This could also indicate where the improvement in TrASPr comes from: is it because TrASPr combines information from all four splice sites? Also, consider fine-tuning Pangolin on cassette exons only (as you do for your model).

      Please see the above response. We did not investigate more sophisticated models that adjust Pangolin’s architecture further as such modifications constitute new models which are beyond the scope of this work.

      (4) L141, "TrASPr can handle cassette exons spanning a wide range of window sizes from 181 to 329,227 bases - thanks to its multi-transformer architecture." This is reported to be one of the primary advantages compared to existing models. Additional analysis should be included on how TrASPr performs across varying exon and intron sizes, with comparison to SpliceAI, etc.

      This was a good suggestion, related to another comment made by Reviewer #1. Please see above our response to them with a breakdown by exon/intron length.

      (5) L171, "training it on cassette exons". This seems like an important point: previous models were trained mostly on constitutive exons, whereas here the model is trained specifically on cassette exons. This should be discussed in more detail.

      Previous models were not trained exclusively on constitutive exons and Pangolin specifically was trained with their version of junction usage across tissues. That said, the reviewer’s point is valid (and similar to ones made above) about a need to have a matched training/testing and potential distribution shifts. Please see response and evaluations described above. 

      (6) L214, ablations of individual features are missing.

      These were now added to the table which we moved to the main text (see table also below).

      (7) L230, "ENCODE cell lines", it is not clear why other tissues from GTEx were not included.

      Good question. The task here was to assess predictions in unseen conditions, hence we opted to test on completely different data of human cell lines rather than additional tissue samples. Following the reviewers suggestion we also evaluated predictions on two additional GTEx tissues, Cortex and Adrenal Gland. These new results, as well as the previous ones for ENCODE, were updated to use the PCA based embedding of “RBP-State” as described above. We also compared the predictions using the PCA based embedding of the “RBP-State” to training directly on data (not the test data of course) from these tissues. See updated Figure 3a,b. Figure 3 Supp 1,2.

      (8) L239, it is surprising that SpliceAI performs so badly, and might suggest a mistake in the analysis. Additional analysis and possible explanations should be provided to support these claims. Similarly, the complete failure of SpliceAI and Pangolin is shown in Figure 4d.

      Line 239 refers to predicting relative inclusion levels between competing 3’ and 5’ splice sites. We admit we too expected this to be better for SpliceAI and Pangolin but we were not able to find bugs in our analysis (which is all made available for readers and reviewers alike). Regarding this expectation to perform better, first we note that we are not aware of a similar assessment being done for either of those algorithms (i.e. relative inclusion for 3’ and 5’ alternative splice site events). Instead, our initial expectation, and likely the reviewer’s as well, was based on their detection of splice site strengthening/weakening due to mutations, including cryptic splice site activation. More generally though, it is worth noting in this context that given how SpliceAI, Pangolin and other algorithms have been presented in papers/media/scientific discussions, we believe there is a potential misperception regarding tasks that SpliceAI and Pangolin excel at vs other tasks where they should not necessarily be expected to excel. Both algorithms focus on cryptic splice site creation/disruption. This has been the focus of those papers and subsequent applications.  While Pangolin added tissue specificity to SpliceAI training, the authors themselves admit “...predicting differential splicing across tissues from sequence alone is possible but remains a considerable challenge and requires further investigation”. The actual performance on this task is not included in Pangolin’s main text, but we refer Reviewer #2 to supplementary figure S4 in the Pangolin manuscript to get a sense of Pangolin’s reported performance on this task. Similar to that, Figure 4d in our manuscript is for predicting ‘tissue specific’ regulators. We do not think it is surprising that SpliceAI (tissue agnostic) and Pangolin (slight improvement compared to SpliceAI in tissue specific predictions) do not perform well on this task. Similarly, we do not find the results in Figure 4C surprising either. These are for mutations that slightly alter inclusion level of an exon, not something SpliceAI was trained on - SpiceAI was trained on genomic splice sites with yes/no labels across the genome. As noted elsewhere in our response, re-training Pangolin on this mutagenesis dataset results in performance much closer to that of TrASPr. That is to be expected as well - Pangolin is constructed to capture changes in PSI (or splice site usage as defined by the authors), those changes are not even tissue specific for the CD19 data and the model has no problem/lack of capacity to generalize from the training set just like TrASPr does. In fact, if you only use combinations of known mutations seen during training a simple regression model gives correlation of ~92-95% (Cortés-López et al 2022). In summary, we believe that better understanding of what one can realistically expect from models such as SpliceAI, Pangolin, and TrASPr will go a long way to have them better understood and used effectively. We have tried to make this more clear in the revision.

      (9) BOS seems like a separate contribution that belongs in a separate publication. Instead, consider providing more details on TrASPr.

      We thank the reviewer for the suggestion. We agree those are two distinct contributions/algorithms and we indeed considered having them as two separate papers. However, there is strong coupling between the design algorithm (BOS) and the predictor that enables it (TrASPr). This coupling is both conceptual (TrASPr as a “teacher”) and practical in terms of evaluations. While we use experimental data (experiments done involving Daam1 exon 16, CD19 exon 2) we still rely heavily on evaluations by TrASPr itself. A completely independent evaluation would have required a high-throughput experimental system to assess designs, which is beyond the scope of the current paper. For those reasons we eventually decided to make it into what we hope is a more compelling combined story about generative models for prediction and design of RNA splicing.

      (10) The authors should consider evaluating BOS using Pangolin or SpliceTransformer as the oracle, in order to measure the contribution to the sequence generation task provided by BOS vs TrASPr.

      We can definitely see the logic behind trying BOS with different predictors. That said, as we note above most of BOS evaluations are based on the “teacher”. As such, it is unclear what value replacing the teacher would bring. We also note that given this limitation we focus mostly on evaluations in comparison to existing approaches (genetic algorithm or random mutations as a strawman). 

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      Additional comments:

      (1) Is your model picking up transcription factor binding sites in addition to RBPs? TFs have been recently shown to have a role in splicing regulation:

      Daoud, Ahmed, and Asa Ben-Hur. "The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models." PLOS Computational Biology 21.1 (2025): e1012755.

      We agree this is an interesting point to explore, especially given the series of works from the Ben-Hur’s group. We note though that these works focus on intron retention (IR) which we haven’t focused on here, and we only cover short intronic regions flanking the exons. We leave this as a future direction as we believe the scope of this paper is already quite extensive.

      (2) SpliceNouveau is a recently published algorithm for the splicing design problem:

      Wilkins, Oscar G., et al. "Creation of de novo cryptic splicing for ALS and FTD precision medicine." Science 386.6717 (2024): 61-69.

      Thank you for pointing out Wilkins et al recent publication, we now refer to it as well. 

      (3) Please discuss the relationship between your model and this deep learning model. You will also need to change the following sentence: "Since the splicing sequence design task is novel, there are no prior implementations to reference."

      We revised this statement and now refer to several recent publications that propose similar design tasks.  

      (4) I would suggest adding a histogram of PSI values - they appear to be mostly close to 1 or 0.

      PSI values are indeed typically close to either 0 or 1. This is a known phenomenon illustrated in previous studies of splicing (e.g. Shen et al NAR 2012 ). We are not sure what is meant by the comment to add a histogram but we made sure to point this out in the main text: 

      “...Still, those statistics are dominated by extreme values, such that 33.2\% are smaller than 0.15 and 56.0\% are higher than 0.85. Furthermore, most cassette exons do not change between a given tissue pair (only 14.0\% of the samples in the dataset, \ie a cassette exon measured across two tissues, exhibit ΔΨ| ≥ 0.15).”

      (5) Part of the improvement of TrASPr over Pangolin could be the result of a more extensive dataset.

      Please see above responses and new analysis.

      (6) In the discussion of the roles of alternative splicing, protein diversity is mentioned, but I suggest you also mention the importance of alternative splicing as a regulatory mechanism:

      Lewis, Benjamin P., Richard E. Green, and Steven E. Brenner. "Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans." Proceedings of the National Academy of Sciences 100.1 (2003): 189-192.

      Thank you for the suggestion. We added that point and citation. 

      (7) Line 96: You use dPSI without defining it (although quite clear that it should be Delta PSI).

      Fixed.

      (8) Pretrained transformers: Have you trained separate transformers on acceptor and donor sites, or a single splice junction transformer?

      Single splice junction pre-training.

      (9) "TrASPr measures the probability that the splice site in the center of Se is included in some tissue" - that's not my understanding of what TrASPr is designed to do.

      We revised the above sentence to make it more precise: “Given a genomic sequence context S<sub>e</sub> = (s<sub>e</sub>,...,s<sub>e</sub>), made of  a cassette exon e and flanking intronic/exonic regions, TrASPr predicts for tissue c the fraction of transcripts where exon e is included or skipped over, ΔΨ-<sub>e,c,c’</sub>.”

      (10) Please include the version of the human genome annotations that you used. 

      We used GENCODE v40 human genome hg38- this is now included in the Data section. 

      (11) I did not see a description of the RBP-AE component in the methods section. A bit more detail on the model would be useful as well.

      Please see above details about replacing RBP-AE with a simpler linear PCA “RBP-State” encoding. We added details about how the PCA was performed to the Methods section.

      (12) Typos, grammar:

      -   Fix the following sentence: ATP13A2, a lysosomal transmembrane cation transporter, linked to an early-onset form of Parkinson's Disease (PD) when 306 loss-of-function mutations disrupt its function.

      Sentence was fixed to now read: “The first example is of a brain cerebellum-specific cassette exon skipping event predicted by TrASPr in the ATP13A2 gene (aka PARK9). ATP13A2 is a lysosomal transmembrane cation transporter, for which loss of function mutation has been linked to early-onset of Parkinson’s Disease (PD)”.

      -   Line 501: "was set to 4e−4"(the - is a superscript). 

      Fixed

      -   A couple of citations are missing in lines 580 and 581.

      Thank you for catching this error. Citations in line 580, 581 were fixed.

      (13) Paper title: Generative modeling for RNA splicing predictions and design - it would read better as "Generative modeling for RNA splicing prediction and design", as you are solving the problems of splicing prediction and splicing design.  

      Thank you for the suggestion. We updated the title and removed the plural form.

      Reviewer #2 (Recommendations for the authors):

      (1) Appendices are not very common in biology journals. It is also not clear what purpose the appendix serves exactly - it seems to repeat some of the things said earlier. Consider merging it into the methods or the main text. 

      We merged the appendices into the Methods section and removed redundancy.

      (2) L112, "For instance, the model could be tasked with designing a new version of the cassette exon, restricted to no more than N edit locations and M total base changes." How are N and M different? Is there a difference between an edit location and a base change? 

      Yes, N is the number of locations (one can think of it as a start position) of various lengths (e.g. a SNP is of length 1) and the total number of positions edited is M. The text now reads “For instance, the model could be tasked with designing a new version of the cassette exon, restricted to no more than  $N$ edit locations (\ie start position of one or more consecutive bases) and $M$ total base changes.”

      (3) L122: "DEN was developed for a distinct problem". What prevents one from adapting DEN to your sequence design task? The method should be generic. I do not see what "differs substantially" means here. (Finally, wasn't DEN developed for the task you later refer to as "alternative splice site" (as opposed to "splice site selection")? Use consistent terminology. And in L236 you use "splice site variation" - is that also the same?).

      Indeed, our original description was not clear/precise enough. DEN was designed and trained for two tasks: APA, and 5’ alternative splice site usage. The terms “selection”, “usage”, and “variation” were indeed used interchangeably in different locations and the reviewer was right, noting the lack of precision. We have now revised the text to make sure the term “relative usage” is used. 

      Nonetheless, we hold DEN was indeed defined for different tasks. See figures from Figure 2A, 6A of Linder et al 2020 (the reference was also incorrect as we cited the preprint and not the final paper):

      In both cases DEN is trying to optimize a short region for selecting an alternative PA site (left) or a 5’ splice site (right). This work focused on an MPRA dataset of short synthetic sequences inserted in the designated region for train/test. We hold this is indeed a different type of data and task then the one we focus on here. Yes, one can potentially adopt DEN for our task, but this is beyond the scope of this paper. Finally, we note that a more closely related algorithm recently proposed is Ledidi (Schreiber et al 2025) which was posted as a pre-print. Similar to BOS, Ledidi tries to optimize a given sequence and adopt it with a few edits for a given task. Regardless, we updated the main text to make the differences between DEN and the task we defined here for BOS more clear, and we also added a reference to Ledidi and other recent works in the discussion section.

      (4) L203, exons with DeltaPSI very close to 0.15 are going to be nearly impossible to classify (or even impossible, considering that the DeltaPSI measurements are not perfect). Consider removing such exons to make the task more feasible.

      Yes, this is how it was done. As described in more details below, we defined changing samples as ones where the change was >= 0.15 and non-changing as ones where the change in PSI was < 0.05 to avoid ambiguous cases affecting the classification task.  

      (5) L230, RBP-AE is not explained in sufficient detail (and does not appear in the methods, apparently). It is not clear how exactly it is trained on each new cellular condition.

      Please see response in the opening of this document and Q11 from

      Reviewer 1 

      (6) L230, "significantly improving": the r value actually got worse; it is therefore not clear you can claim any significant improvement. Please mention that fact in the text.

      This is a fair point. We note that we view the “a” statistic as potentially more interesting/relevant here as the Pearson “r” is dominated by points being generally close to 0/1.  Regardless, revisiting this we realized one can also make a point that the term “significant” is imprecise/misplaced since there is no statistical test done here (side note: given the amount of points, a simple null of same distribution yes/no would pass significance but we don’t think this is an interesting/relevant test here). Also, we note that with the transition to PCA instead of RBP-AE we actually get improvements in both a and r values, both for the ENCODE samples shown in Figure 3a and the two new GTEX tissues we tested (see above). We now changed the text to simply state: 

      “...As shown in Figure 3a, this latent space representation allows TrSAPr to generalize from the six GTEX tissues to unseen conditions, including unseen GTEX tissues (top row), and ENCODE cell lines (bottom row). It improves prediction accuracy compared to TrASPr lacking PCA (eg a=88.5% vs a=82.3% for ENCODE cell lines), though naturally training on the additional GTEX and ENCODE conditions can lead to better performance  (eg a=91.7%, for ENCODE, Figure 3a left column).”

      (7) L233, "Notably, previous splicing codes focused solely on cassette exons", Rosenberg et al. focused solely on alternative splice site choice.

      Right - we removed that sentence.. 

      (8) L236, "trained TrASPr on datasets for 3' and 5' splice site variations". Please provide more details on this task. What is the input to TrASPr and what is the prediction target (splice site usage, PSI of alternative isoforms)? What datasets are used for this task?

      The data for this data was the same GTEx tissue data processed, just for alternative 3’ and 5’ splice sites events. We revised the description of this task in the main task and added information in the Methods section. The data is also included in the repo.

      (9) L243, "directly from genomic sequences", and conservation?

      Yes, we changed the sentence to read “...directly from genomic sequences combined with related features” 

      (10) L262, what is the threshold for significant splicing changes?

      The threshold is 0.15 We updated the main text to read the following:

      The total number of mutations hitting each of the 1198 genomic positions across the 6106 sequences is shown in \FIG{mut_effect}b (left), while the distribution of effects ($|\Delta \Psi|$) observed across those 6106 samples is shown in \FIG{mut_effect}b (right). To this data we applied three testing schemes. The first is a standard 5-fold CV where 20\% of combinations of point mutations were hidden in every fold while the second test involved 'unseen mutation' (UM) where we hide any sample that includes mutations in specific positions for a total of 1480 test samples. As illustrated by the CDF in \FIG{mut_effect}b, most samples (each sample may involve multiple positions mutated) do not involve significant splicing changes. Thus, we also performed a third test using only  the 883 samples were mutations cause significant changes ($|\Delta \Psi|\geq 0.15 $). 

      (11) L266, Pangolin performance is only provided for one of the settings (and it is not clear which). Please provide details of its performance in all settings.

      The description was indeed not clear. Pangolin’s performance was similar to SpliceAI as mentioned above but retraining it on the CD19 data yielded much closer performance to TrASPr. We include all the matching tests for Pangolin after retraining in Figure 4 Supp Figure 1. 

      (12) Please specify "n=" in all relevant plots. 

      Fixed.

      (13) Figure 3a, "The tissues were first represented as tokens, and new cell line results were predicted based on the average over conditions during training." Please explain this procedure in more detail. What are these tokens and how are they provided to the model? Are the cell line predictions the average of the predictions for the training tissues?

      Yes, we compared to simply the average over the predictions for the training tissues for that specific event as baseline to assess improvements (see related work pointing for the need to have similar baselines in DL for genomics in https://pubmed.ncbi.nlm.nih.gov/33213499/). Regarding the tokens - we encode each tissue type as a possible value and feed the two tissues as two tokens to the transformer.

      (14) Figure 4b, the total count in the histogram is much greater than 6106. Please explain the dataset you're using in more detail, and what exactly is shown here.

      We updated the text to read: 

      “...we used 6106 sequence samples where each sample may have multiple positions mutated (\ie mutation combinations) in exon 2 of CD19 and its flanking introns and exons (Cortes et al 2022). The total number of mutations hitting each of the 1198 genomic positions across the 6106 sequences is shown in Figure 4b (left).”

      (15) Figure 5a, how are the prediction thresholds (TrASPr passed, TrASPr stringent, and TrASPr very stringent) defined?

      Passed: dpsi>0.1, Stringent: dpsi>0.15, Very stringent: dpsi>0.2 This is now included in the main text.

      (16) L417, please include more detail on the relative size of TrASPr compared to other models (e.g. number of parameters, required compute, etc.).

      SpliceAI is a general-purpose splicing predictor with 32-layer deep residual neural network to capture long-range dependencies in genomic sequences. Pangolin is a deep learning model specifically designed for predicting tissue-specific splicing with similar architecture as SpliceAI. The implementation of SpliceAI that can be found here https://huggingface.co/multimolecule/spliceai involves an ensemble of 5 such models for a total of ~3.5M parameters. TrASPr, has 4 BERT transformers (each 6 layers and 12 heads) and MLP a top of those for a total of ~189M parameters. Evo 2, a genomic ‘foundation’ model has 40B parameters, DNABERT has ~86M (a single BERT with 12 layers and 12 heads), and Borzoi has 186M parameters (as stated in https://www.biorxiv.org/content/10.1101/2025.05.26.656171v2).  We note that the difference here is not just in model size but also the amount of data used to train the model. We edited the original L417 to reflect that.

      (17) L546, please provide more detail on the VAE. What is the dimension of the latent representation?

      We added more details in the Methods section like the missing dimension (256) and definitions for P(Z) and P(S). 

      (18) Consider citing (and possibly comparing BOS to) Ghari et al., NeurIPS 2024 ("GFlowNet Assisted Biological Sequence Editing").

      Added.

      (19) Appendix Figure 2, and corresponding main text: it is not clear what is shown here. What is dPSI+ and dPSI-? What pairs of tissues are you comparing? Spearman correlation is reported instead of Pearson, which is the primary metric used throughout the text.

      The dPSI+ and dPSI- sets were indeed not well defined in the original submission. Moreover, we found our own code lacked consistency due to different tests executed at different times/by different people. We apologize for this lack of consistency and clarity which we worked to remedy in the revised version. To answer the reviewer’s question, given two tissues ($c,c'$), dPSI+ and dPSI- is for correctly classifying the exons that are significantly differentially included or excluded. Specifically, differential included exons are those for which  $\Delta \Psi_{e,c1,c2} = \Psi_\Psi_{e,c1} - \Psi_{e,c2}  \geq 0.15$, compared to those that are not  ($\Delta \Psi_{e,c1,c2} < 0.05). Similarly, dPSI- is for correctly classifying the exons that are significantly differentially excluded in the first tissue or included in the second tissue ($\Delta \Psi_{e,c1,c2} = \Psi_\Psi_{e,c1} - \Psi_{e,c2}  \leq -0.15$) compared to those that are not  ($\Delta \Psi_{e,c1,c2} > -0.05). This means dPSI+ and dPSI- are dependent on the order of c1, c2. In addition, we also define a direction/order agnostic test for changing vs non changing events i.e. $|\Delta \Psi_{e,c1,c2}| \geq 0.15$ vs $|\Delta \Psi_{e,c1,c2}| < 0.05$. These test definitions are consistent with previous publications (e.g. Barash et al Nature 2010, Jha et al 2017) and also answer different biological questions: For example “Exons that go up in brain” and “Exons that go up in Liver” can reflect distinct mechanisms, while changing exons capture a model’s ability to identify regulated exons even if the direction of prediction may be wrong. The updated Appendix Figure 2 is now in the main text as Figure 2d and uses Pearson, while AUPRC and AUROC refer to the changing vs no-changing classification task described above such that we avoid dPSI+ and dPSI- when summarizing in this table over 3 pairs of tissues . Finally, we note that making sure all tests comply with the above definition also resulted in an update to Figure 2b/c labels and values, where TrASPr’s improvements over Pangolin reaches up to 1.8fold in AUPRC compared to 2.4fold in the earlier version. We again apologize for having a lack of clarity and consistent evaluations in the original submission.

      (20) Minor typographical comments:

      -   Some plots could use more polishing (e.g., thicker stroke, bigger font size, consistent style (compare 4a to the other plots)...).

      Agreed. While not critical for the science itself we worked to improve figure polishing in the revision to make those more readable and pleasant. 

      -   Consider using 2-dimensional histograms instead of the current kernel density plots, which tend to over-smooth the data and hide potentially important details. 

      We were not sure what the exact suggestion is here and opted to leave the plots as is.

      -   L53: dPSI_{e, c, c'} is never formally defined. Is it PSI_{e, c} - PSI_{e, c'} or vice versa?  

      Definition now included (see above).

      -   L91: Define/explain "transformer" and provide reference. 

      We added the explanation and related reference of the transformer in the introduction section and BERT in the method section.  

      -   L94: exons are short. Are you referring here to the flanking introns? Please explain. 

      We apologize for the lack of clarity. We are referring to a cassette exon alternative splicing event as is commonly defined by the splice junctions involved that is from the 5’ SS of the upstream exon to the 3’ SS of the downstream exon. The text now reads:

      “...In contrast, 24% of the cassette exons analyzed in this study span a region between the flanking exons' upstream 3' and downstream 5' splice sites that are larger than 10 kb.”

      -   L132: It's unclear whether a single, shared transformer or four different transformers (one for each splice site) are being pre-trained. One would at least expect 5' and 3' splice sites to have a different transformer. In Methods, L506, it seems that each transformer is pre-trained separately. 

      We updated the text to read:

      “We then center a dedicated transformer around each of the splice sites of the cassette exon and its upstream and downstream (competing) exons (four separate transformers for four splice sites in total).”

      -   L471: You explain here that it is unclear what tasks 'foundation' models are good for. Also in L128, you explain that you are not using a 'foundation' model. But then in L492, you describe the BERT model you're using as a foundation model! 

      Line 492 was simply a poor choice of wording as “foundation” is meant here simply as the “base component”. We changed it accordingly.

      -   L169, "pre-training ... BERT", explain what exactly this means. Is it using masking? Is it self-supervised learning? How many splice sites do you provide? Also explain more about the BERT architecture and provide references. 

      We added more details about the BERT architecture and training in the Methods section.

      -   L186 and later, the values for a and r provided here and in the below do not correspond to what is shown in Figure 2. 

      Fixed, thank you for noticing this.

      -   L187,188: What exactly do you mean by "events" and "samples"? Are they the same thing? If so, are they (exon, tissue) pairs? Please use consistent terminology. Moreover, when you say "changing between two conditions": do you take all six tissues whenever there is a 0.15 spread in PSI among them? Or do you take just the smallest PSI tissue and the largest PSI tissue when there is a 0.15 spread between them? Or something else altogether?

      Reviewer #2 is yet again correct that the definitions were not precise. A “sample” involves a specific exon skipping “event” measured in two tissues.  The text now reads: 

      “....most cassette exons do not change between a given tissue pair (only 14.0% of the samples in the dataset, i.e., a cassette exon measured across two tissues, exhibit |∆Ψ| ≥ 0.15). Thus, when we repeat this analysis only for samples involving exons that exhibited a change in inclusion (|∆Ψ| ≥ 0.15) between at least two tissues, performance degrades for all three models, but the differences between them become more striking (Figure 2a, right column).”

      -   Figure 1a, explain the colors in the figure legend. The 3D effect is not needed and is confusing (ditto in panel C).

      Color explanation is now added: “exons and introns are shown as blue rectangles and black lines. The blue dashed line indicates the inclusive pattern and the red junction indicates an alternative splicing pattern.” 

      These are not 3D effects but stacks to indicate multiple events/cases. We agree these are not needed in Fig1a to illustrate types of AS and removed those. However, in Fig1c and matching caption we use the stacks to  indicate HT data captures many such LSVs over which ML algorithms can be trained. 

      -   Figure 1b, this cartoon seems unnecessary and gives the wrong impression that this paper explores mechanistic aspects of splicing. The only relevant fact (RBPs serving as splicing factors) can be explained in the text (and is anyway not really shown in this figure).

      We removed Figure 1b cartoon.

      -   Figure 1c, what is being shown by the exon label "8"? 

      This was meant to convey exon ID, now removed to simplify the figure. 

      -   Figure 1e, left, write "Intron Len" in one line. What features are included under "..."? Based on the text, I did not expect more features.

      Also, the arrows emanating from the features do not make sense. Is "Embedding" a layer? I don't think so. Do not show it as a thin stripe. Finally, what are dPSI'+ and dPSI'-? are those separate outputs? are those logits of a classification task?

      We agree this description was not good and have updated it in the revised version. 

      -   Figure 1e, the right-hand side should go to a separate figure much later, when you introduce BOS.

      We appreciate the suggestion. However, we feel that Figure 1e serves as a visual representation of the entire framework. Just like we opted to not turn this work into two separate papers (though we fully agree it is a valid option that would also increase our publication count), we also prefer to leave this unified visual representation as is.

      -   Figure 2, does the n=2456 refer to the number of (exons, tissues) pairs? So each exon contributes potentially six times to this plot? Typo "approximately". 

      The “n” refers to the number of samples which is a cassette event measured in two tissues. The same cassette event may appear in multiple samples if it was confidently quantified in more than two tissues. We updated the caption to reflect this and corrected the typo.

      -   Figure 2b, typo "differentially included (dPSI+) or excluded" .

      Fixed.

      -   L221, "the DNABERT" => "DNABERT".

      Fixed.

      -   L232, missing percent sign.

      -    

      Fixed.

      -   L246, "see Appendix Section 2 for details" seems to instead refer to the third section of the appendix.

      We do not have this as an Appendix, the reference has been updated.

      -   Figure 3, bottom panels, PSI should be "splice site usage"? 

      PSI is correct here - we hope the revised text/definitions make it more clear now.

      -   Figure 3b: typo: "when applied to alternative alternative 3'".

      Fixed.

      -   p252, "polypyrimidine" (no capitalization).

      Fixed.

      -   Strange capitalization of tissue names (e.g., "Brain-Cerebellum"). The tissue is called "cerebellum" without capitalization.

      We used EBV (capital) for the abbreviation and lower case for the rest.

      -   Figure 4c: "predicted usage" on the left but "predicted PSI" on the right. 

      Right. We opted to leave it as is since Pangolin and SpliceAI do predict their definition of “usage” and not directly PSI, we just measure correlations to observed PSI as many works have done in the past. 

      -   Figure 4 legend typo: "two three".

      Fixed.

      -   L351, typo: "an (unsupervised)" (and no need to capitalize Transformer).

      Fixed.

      -   L384, "compared to other tissues at least" => "compared to other tissues of at least".

      Fixed.

      -   L549, P(Z) and P(S) are not defined in the text.

      Fixed.

      -   L572, remove "Subsequently". Add missing citations at the end of the paragraph.

      Fixed.

      -   L580-581, citations missing.

      Fixed.

      -   L584-585, typo: "high confidince predictions"

      Fixed.

      -   L659-660, BW-M and B-WM are both used. Typo?

      Fixed.

      -   L895, "calculating the average of these two", not clear; please rewrite.

      Fixed.

      -   L897, "Transformer" and "BERT", do these refer to the same thing? Be consistent.  

      BOS is a transformer and not a BERT but TrASPr uses the BERT architecture. BERT is a type of transformer as the reviewer is surely well aware so the sentence is correct. Still, to follow the reviewer’s recommendation for consistency/clarity we changed it here to state BERT.

      -   Appendix Figure 5: The term dPSI appears to be overloaded to also represent the difference between predicted PSI and measured PSI, which is inconsistent with previous definitions. 

      Indeed! We thank the reviewer again for their sharp eye and attention to details that we missed. We changed Supp Figure 5, now Figure 4 Supplementary Figure 2, to |PSI’-PSI| and defined those as the difference between TrASPr’s predictions (PSI’) and MAJIQ based PSI quantifications.

    1. eLife Assessment

      This important work advances our understanding of the role of kisspeptin neurons in regulating the luteinizing hormone (LH) surge in females. The evidence demonstrating increased neuronal activity in anterior hypothalamic kisspeptin neurons just before the LH surge is compelling, though additional neuroanatomical evidence showing the specificity of the methods would strengthen the study. It also confirms that high circulating levels of estradiol, but also other unidentified factors, are required for the full daily activation. This research will be of interest to reproductive biologists and neuroscientists studying the female ovarian cycle.

    2. Joint Public Review:

      Summary:

      This is an excellent, timely study investigating and characterizing the underlying neural activity that generates the neuroendocrine GnRH and LH surges that are responsible for triggering ovulation. Abundant evidence accumulated over the past 20 years implicated the population of kisspeptin neurons in the hypothalamic RP3V region (also referred to as the POA or AVPV/PeN kisspeptin neurons) as being involved in driving the GnRH surge in response to elevated estradiol (E2), also known as the "estrogen positive feedback". However, while former studies used Cfos coexpression as a marker of RP3V kisspeptin neuron activation at specific times and found this correlates with the timing of the LH surge, detailed examination of the live in vivo activity of these neurons before, during, and after the LH surge remained elusive due to technical challenges.

      Here, Zhou and colleagues use fiber photometry to measure the long-term synchronous activity of RP3V kisspeptin neurons across different stages of the mouse estrous cycle, including on proestrus when the LH surge occurs, as well as in a well-established OVX+E2 mouse model of the LH surge.

      The authors report that RP3V kisspeptin neuron activity is low on estrous and diestrus, but increases on proestrus several hours before the late afternoon LH surge, mirroring prior reports of rising GnRH neuron activity in proestrus female mice. The measured increase in RP3V kisspeptin activation is long, spanning ~13 hours in proestrus females and extending well beyond the end of the LH secretion, and is shown by the authors to be E2 dependent.

      For this work, Kiss-Cre female mice received a Cre-dependent AAV injection, containing GCaMP6, to measure the neuronal activation of RP3V Kiss1 cells. Females exhibited periods of increased neuronal activation on the day of proestrus, beginning several hours prior to the LH surge and lasting for about 12 hours. Though oscillations in the pattern of GCaMP fluorescence were occasionally observed throughout the ovarian cycle, the frequency, duration, and amplitude of these oscillations were significantly higher on the day of proestrus. This increase in RP3V Kiss1 neuronal activation that precedes the increase in LH supports the hypothesis that these neurons are critical in regulating the LH surge. The authors compare this data to new data showing a similar increased activation pattern in GnRH neurons just prior to the LH surge, further supporting the hypothesis that RP3V Kiss1 cell activation causes the release of kisspeptin to stimulate GnRH neurons and produce the LH surge.

      Strengths:

      This study provides compelling data demonstrating that RP3V kisspeptin neuronal activity changes throughout the ovarian cycle, likely in response to changes in estradiol levels, and that neuronal activation increases on the day of the LH surge.

      The observed increase in RP3V kisspeptin neuronal activation precedes the LH surge, which lends support to the hypothesis that these neurons play a role in regulating the estradiol-induced LH surge. Continuing to examine the complexities of the LH surge and the neuronal populations involved, as done in this study, is critical for developing therapeutic treatments for women's reproductive disorders.

      This innovative study uses a within-subject design to examine neuronal activation in vivo across multiple hormone milieus, providing a thorough examination of the changes in activation of these neurons. The variability in neuronal activity surrounding the LH surge across ovarian cycles in the same animals is interesting and could not be achieved without this within-subjects design. The inclusion and comparison of ovary-intact females and OVX+E2 females is valuable to help test mechanisms under these two valuable LH surge conditions, and allows for further future studies to tease apart minor differences in the LH surge pattern between these 2 conditions.

      This study provides an excellent experimental setup able to monitor the daily activity of preoptic kisspeptin neurons in freely moving female mice. It will be a valuable tool to assess the putative role of these kisspeptin neurons in various aspects of altered female fertility (aging, pathologies...). This approach also offers novel and useful insights into the impact of E2 and circadian cues on the electrical activity of RP3V kisspeptin neurons.

      An intriguing cyclical oscillation in kisspeptin neural activity every 90 minutes exists, which may offer critical insight into how the RP3V kisspeptin system operates. Interestingly, there was also variability in the onset and duration of RP3V Kisspeptin neuron activity between and within mice in naturally cycling females. Preoptic kisspeptin neurons show an increased activity around the light/dark transition only on the day of proestrus, and this is associated with an increase in LH secretion. An original finding is the observation that the peak of kisspeptin neuron activation continues a few hours past the peak of LH, and the authors hypothesize that this prolonged activity could drive female sexual behaviors, which usually appear after the LH surge.

      The authors demonstrated that ovariectomy resulted in very little neuronal activity in RP3V kisspeptin neurons. When these ovarietomized females were treated with estradiol benzoate (EB) and an LH surge was induced, there was an increase in RP3V kisspeptin neuronal activation, as was seen during proestrus. However, the magnitude of the change in activity was greater during proestrus than during the EB-induced LH surge. Interestingly, the authors noted a consistent peak in activity about 90 minutes prior to lights out on each day of the ovarian cycle and during EB treatment, but not in ovariectomized females. The functional purpose of this consistent neuronal activity at this time remains to be determined.

      Though not part of this study, the comparison of neuronal activation of GnRH neurons during the LH surge to the current data was convincing, demonstrating a similar pattern of increased activation that precedes the LH surge.

      In summary, the study is well-designed, uses proper controls and analyses, has robust data, and the paper is nicely organized and written. The data from these experiments is compelling, and the authors' claims and conclusions are nicely supported and justified by the data. The data support the hypothesis in the field that these RP3V neurons regulate the LH surge. Overall, these findings are important and novel, and lend valuable insight into the underlying neural mechanisms for neuroendocrine control of ovulation.

      Weaknesses:

      (1) LH levels were not measured in many mice or in robust temporal detail, such as every 30 or 60 min, to allow a more detailed comparison between the fine-scale timing of RP3V neuron activation with onset and timing of LH surge dynamics.

      (2) The authors report that the peak LH value occurred 3.5 hours after the first RP3V kisspeptin neuron oscillation. However, it is likely, and indeed evident from the 2 example LH patterns shown in Figures 3A-B, that LH values start to increase several hours before the peak LH. This earlier rise in LH levels ("onset" of the surge) occurs much closer in time to the first RP3V kisspeptin neuron oscillatory activation, and as such, the ensuing LH secretion may not be as delayed as the authors suggest.

      (3) The authors nicely show that there is some variation (~2 hours) in the peak of the first oscillation in proestrus females. Was this same variability present in OVX+E2 females, or was the variability smaller or absent in OVX+E2 versus proestrus? It is possible that the variability in proestrus mice is due to variability in the timing and magnitude of rising E2 levels, which would, in theory, be more tightly controlled and similar among mice in the OVX+E2 model. If so, the OVX+E2 mice may have less variability between mice for the onset of RP3V kisspeptin activity.

      (4) One concern regarding this study is the lack of data showing the specificity of the AAV and the GCaMP6s signals. There are no data showing that GCaMP6s is limited to the RP3V and is not expressed in other Kiss1 populations in the brain. Given that 2ul of the AAV was injected, which seems like a lot considering it was close to the ventricle, it is important to show that the signal and measured activity are specific to the RP3V region. Though the authors discuss potential reasons for the low co-expression of GCaMP6 and kisspeptin immunoreactivity, it does raise some concern regarding the interpretation of these results. The low co-expression makes it difficult to confirm the Kiss1 cell-specificity of the Cre-dependent AAV injections. In addition, if GFP (GCaMP6s) and kisspeptin protein co-localization is low, it is possible that the activation of these neurons does not coincide with changes in kisspeptin or that these neurons are even expressing Kiss1 or kisspeptin at the time of activation. It is important to remember that the study measures activation of the kisspeptin neuron, and it does not reveal anything specific about the activity of the kisspeptin protein.

      (5) One additional minor concern is that LH levels were not measured in the ovariectomized females during the expected time of the LH surge. The authors suggest that the lower magnitude of activation during the LH surge in these females, in comparison to proestrus females, may be the result of lower LH levels. It's hard to interpret the difference in magnitude of neuronal activation between EB-treated and proestrus females without knowing LH levels. In addition, it's possible that an LH surge did not occur in all EB-treated females, and thus, having LH levels would confirm the success of the EB treatment.

      (6) This kisspeptin neuron peak activity is abolished in ovariectomized mice, and estradiol replacement restored this activity, but only partially. Circulating levels of estradiol were not measured in these different setups, but the authors hypothesize that the lack of full restoration may be due to the absence of other ovarian signals, possibly progesterone.

      (7) Recordings in several mice show inter- and intra-variability in the time of peak onset. It is not shown whether this variability is associated with a similar variability in the timing of the LH surge onset in the recorded mice. The authors hypothesized that this variability indicates a poor involvement of the circadian input. However, no experiments were done to investigate the role of the (vasopressinergic-driven) circadian input on the kisspeptin neuron activation at the light/dark transition. Thus, we suggest that the authors be more tentative about this hypothesis.

    1. eLife Assessment

      This study aims to identify the proteins that make up the electrical synapse, which are much less understood than those of the chemical synapse. These findings represent an important step toward understanding the molecular function of chemical synapses and will have broad utility for the wider neuroscience field. The experimental evidence is convincing.

    2. Reviewer #1 (Public review):

      This study aims to identify the proteins that compose the electrical synapse, which are much less understood than those of the chemical synapse. Identifying these proteins is important to understand how synaptogenesis and conductance are regulated in these synapses.

      Using a proteomics approach, the authors identified more than 50 new proteins and used immunoprecipitation and immunostaining to validate their interaction of localization. One new protein, a scaffolding protein (Sipa1l3), shows particularly strong evidence of being an integral component of the electrical synapse. The function of Sipa1l3 remains to be determined.

      Another strength is the use of two different model organisms (zebrafish and mice) to determine which components are conserved across species. This approach also expands the utility of this work to benefit researchers working with both species.

      The methodology is robust and there is compelling evidence supporting the findings.

      Comments on revisions:

      I thank the authors for responding to the comments. No further recommendations.

    3. Reviewer #3 (Public review):

      Summary:

      This study by Tetenborg S et al. identifies proteins that are physically closely associated with gap junctions in retinal neurons of mice and zebrafish using BioID, a technique that labels and isolates proteins in proximal to a protein of interest. These proteins include scaffold proteins, adhesion molecules, chemical synapse proteins, components of the endocytic machinery, and cytoskeleton-associated proteins. Using a combination of genetic tools and meticulously executed immunostaining, the authors further verified the colocalizations of some of the identified proteins with connexin-positive gap junctions. The findings in this study highlight the complexity of gap junctions. Electrical synapses are abundant in the nervous system, yet their regulatory mechanisms are far less understood than those of chemical synapses. This work will provide valuable information for future studies aiming to elucidate the regulatory mechanisms essential for the function of neural circuits.

      Strengths:

      A key strength of this work is the identification of novel gap junction-associated proteins in AII amacrine cells and photoreceptors using BioID in combination with various genetic tools. The well-studied functions of gap junctions in these neurons will facilitate future research into the functions of the identified proteins in regulating electrical synapses.

      Comments on revisions:

      The authors have addressed my concerns in the revised manuscript.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer 1

      The authors should clarify the statement regarding the expression in horizontal cells (lines 170-172). In line 170, it is stated that GFP was observed in horizontal cells. Since GFP is fused to Cx36, the observation of GFP in horizontal cells would suggest the expression of Cx36-GFP.

      We believe that there appears to be a misunderstanding. GFP is observed in horizontal cells, because the test AAV construct, which consists of the HKamac promoter and a downstream GFP sequence, was used to validate the promoter specificity in wildtype animals. This was just a test to confirm that HKamac is indeed active in AII amacrine cells as previously described by Khabou et al. 2023. This construct was not used for the large scale BioID screen. For these experiments, V5-dGBP-Turbo was expressed under the control of the HKamac promoter as illustrated in Figure 2A.

      Fig 7: the legend is missing the descriptions for panels A-C.

      We apologize for this mistake. We have missed the label “(A-C)” and added it to the legend.

      Supplemental files are not referenced in the manuscript.

      We have added a reference for these files in line 221-226.

      Reviewer 2

      Supplementary Files 1 and 2 are presented as two replicates of the zebrafish proteomic datasets, but they appear to be identical.

      This appears to be a misunderstanding. These two replicates contain slightly different hits, although the most abundant candidates are identical.

      Reviewer 3

      Thank you for the positive comments

    1. eLife Assessment

      This study presents a valuable finding on how the locus coeruleus modulates the involvement of medial prefrontal cortex in set shifting using calcium imaging. The evidence supporting the claims was viewed as incomplete in comparisons of extra- (EDS) and intradimensional shifts (IDS). The work is of broad interest to those studying flexible cognition.

    2. Reviewer #1 (Public review):

      Summary:

      The authors note that there is a large corpus of research establishing the importance of LC-NE projections to medial prefrontal cortex (mPFC) of rats and mice in attentional set or 'rule' shifting behaviours. However, this is complex behavior and the authors were attempting to gain an understanding of how locus coeruleus modulation of the mPFC contributes to set shifting.

      The authors replicated the ED-shift impairment following NE denervation of mPFC by chemogenetic inhibition of the LC. They further showed that LC inhibition changed the way neurons in mPFC responded to the cues, with a greater proportion of individual neurons responsive to 'switching', but the individual neurons also had broader tuning, responding to other aspects of the task (i.e., response choice and response history). The population dynamics was also changed by LC inhibition, with reduced separation of population vectors between early-post-switch trials, when responding was at chance, and later trials when responding was correct. This was what they set out to demonstrate and so one can conclude they achieved their aims.

      The authors concluded that LC inhibition disrupted mPFC "encoding capacity for switching" and suggest that this "underlie[s] the behavioral deficits."

      Strengths:

      The principal strength is combining inactivation of LC with calcium imaging in mPFC. This enabled detailed consideration of the change in behavior (i.e., defining epochs of learning, with an 'early phase' when responding is at chance being compared to a 'later phase' when the behavioral switch has occurred) and how these are reflected in neuronal activity in the mPFC, with and without LC-NE input.

      Comments on revised version:

      In their response to reviewers, the authors say "We report p values using 2 decimal points and standard language as suggested by this reviewer". However, no changes were made in the manuscript: for example, "P = 4.2e-3" rather than "p = 0.004".

      In their response to the reviewers, they wrote: "Upon closer examination of the behavioral data, we exclude several sessions where more trials were taken in IDS than in EDS." If those sessions in which EDSIDS. Most problematic is the fact that the manuscript now reads "Importantly, control mice (pooled from Fig. 1e, 1h, Supp. Fig. 1a, 1b) took more trials to complete EDS than IDS (Trials to criterion: IDS vs. EDS, 10 {plus minus} 1 trials vs. 16 {plus minus} 1 trials, P < 1e-3, Supp. Fig. 1c), further supporting the validity of attentional switching (as in Fig. 1c)" without mentioning that data has been excluded.

    3. Reviewer #3 (Public review):

      Summary:

      Nigro et al examine how the locus coeruleus (LC) influences the medial prefrontal cortex (mPFC) during attentional shifts required for behavioral flexibility. Specifically, the propose that LC-mPFC inputs enable mice to shift attention effectively from texture to odor cues to optimize behavior. The LC and its noradrenergic projections to the mPFC have previously been implicated in this behavior. The authors further establish this by using chemogenetics to inhibit LC terminals in mPFC and show a selective deficit in extradimensional set shifting behavior. But the study's primary innovation is the simultaneous inhibition of LC while recording multineuron patterns of activity in mPFC. Analysis at the single neuron and population levels revealed broadened tuning properties, less distinct population dynamics, and disrupted predictive encoding when LC is inhibited. These findings add to our understanding of how neuromodulatory inputs shape attentional encoding in mPFC and are an important advance. There are some methodological limitations and/or caveats that should be considered when interpreting the findings, and these are described below.

      Strengths:

      The naturalistic set-shifting task in freely-moving animals is a major strength and the inclusion of localized suppression of LC-mPFC terminals is builds confidence in the specificity of their behavioral effect. Combining chemogenetic inhibition of LC while simultaneously recording neural activity in mPFC with miniscopes is state-of-the-art. The authors apply analyses to population dynamics in particular that can advance our understanding of how the LC modifies patterns of mPFC neural activity. The authors show that neural encoding at both the single cell level and the population level are disrupted when LC is inhibited. They also show that activity is less able to predict key aspects of the behavior when the influence of LC is disrupted. This is quite interesting and adds to a growing understanding of how neuromodulatory systems sharpen tuning of mPFC activity.

      Weaknesses:

      Weaknesses are mostly minor, but there are some caveats that should be considered. First, the authors use a DBH-Cre mouse line and provide histological confirmation of overlap between HM4Di expression and TH immunostaining. While this strongly suggests modulation of noradrenergic circuit activity, the results should be interpreted conservatively as there is no independent confirmation that norepinephrine (NE) release is suppressed and these neurons are known to release other neurotransmitters and signaling peptides. In the absence of additional control experiments, it is important to recognize that effects on mPFC activity may or may not be directly due to LC-mPFC NE.

      Another caveat is that the imaging analyses are entirely from the extradimensional shift session. Without analyzing activity data from the intradimensional shift (IDS) session, one cannot be certain that the observed changes are to some feature of activity that is specific to extradimensional shifts. Future experiments should examine animals with LC suppression during the IDS as well, which would show whether the observed effects are specific to an extradimensional shift and might explain behavioral effects.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      We thank the reviewers and editors for this peer review. Following the editorial assessment and specific review comments, in this revision we have included new analysis to support the validity of the behavioral task (Reviewer #2). We have improved data presentation by including 1) data points from individual animals (Reviewer #1, #3), 2) updated histology showing the expression of hM4Di in LC neurons as well as LC terminals in the mPFC (Reviewer #3), and 3) more detailed descriptions of methodology and data analysis (Reviewer #1, #2, #3).

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Planned t-tests should be performed in both control and experimental animals to determine if the number of trials needed to reach criterion on the ID is lower than on the ED. Based on the data analyses showing no difference among the control group, the data could be pooled to demonstrate that the task is valid. Reporting all p-values using 2 decimal points and standard language e.g., p < 0.001 would greatly improve the readability of the data. 

      Thank you for this suggestion. As pointed out by this reviewer, more trials to reach performance criterion in EDS than IDS is indicative of successful acquisition and switching of the attentional sets. Upon closer examination of the behavioral data, we exclude several sessions where more trials were taken in IDS than in EDS, and our conclusions that DREADD inhibition of the LC or LC input to the mPFC impaired rule switching in EDS remain robust (e.g., new Fig. 1e, 1h). We also pool control and test data (Fig. 1e, 1h, new Supp. Fig. 1a, 1b) to demonstrate the validity of this task (new Supp. Fig. 1c, IDS vs. EDS in the control group, 10 ± 1 trials vs. 16 ± 1 trials, P < 1e-3). The validity of set shifting is also supported by the new Fig. 1c.  

      We report p values using 2 decimal points and standard language as suggested by this reviewer.

      Relevant to the comments from Reviewer #1 in the public review, we now show individual data points on the bar charts (new Fig. 1e, 1h).  

      (2) It may also be helpful to provide the average time between CNO infusion and onset of the ED as well as information about when maximal effects are expected after these treatments.

      Systemic CNO injections were administered immediately after IDS, and we waited approximately one hour before proceeding to EDS. Maximal effects of systemic CNO activation were reported to occur after 30 minutes and last for at least 4-6 hours. Both control and test groups received the CNO injections in the same manner. This is now better described in Methods.  

      Reviewer #3 (Recommendations for the authors):

      (1) Add better histology images showing colocalization of TH and HM4Di. Quantification of colocalization would be optimal.

      We now include better histology images (new Fig. 1d) and have quantified the colocalization of TH and HM4Di in the main text (line 115-116).  

      (2) If possible, images showing HM4Di expression in mPFC axon terminals would be useful. If these are colocalized with TH immunostaining, that would increase confidence in their identity. This would be much more useful than the images provided in Figure 1C.

      We now include new image to show hM4Di expression (mCherry) in LC terminals in the mPFC (new Fig. 1f). However, due to technical limitations (species of the primary antibody), we did not co-stain with TH.

      (3) Include behavior of mice from the miniscope experiment in Figure 2 to show they are similar to those from Figure 1.

      This is now included in Supp. Fig. 1b.

      (4) More details about the processing and segmentation of miniscope data would be helpful (e.g., how many neurons were identified from each animal?). 

      We use standard preprocessing and segmentation pipelines in Inscopix data processing software (version 1.6), which includes modules for motion correction and signal extraction. Briefly, raw imaging videos underwent preprocessing, including a x4 spatial down sampling to reduce file size and processing time. No temporal down sampling was performed. The images were then cropped to eliminate post-registration borders and areas where cells were not visible. Prior to the calculation of the dF/F0 traces, lateral movement was corrected. For ROI identification, we used a constrained non-negative matrix factorization algorithm optimized for endoscopic data (CNMF-E) to extract fluorescence traces from ROIs. We identified 128 ± 31 neurons after manual selection, depending on recording quality and field of view. Number of neurons acquired from each animal are now included in Methods. This is now further elaborated in Methods (line 405415).  

      (5) Add more methodological detail for how cell tuning was analyzed, including how z-scoring was performed (across the entire session?), and how neurons in each category were classified. 

      We have expanded the Methods section to clarify how cell tuning was analyzed (line 419430). Calcium traces were z-scored on a per-neuron basis across the entire session. For each neuron, we computed trial-averaged activity aligned to specific task events (e.g., digging in one of the two ramekins available). A neuron was classified as responsive if its activity showed a significant difference (p < 0.05) between two conditions within the defined time window in the ROC analysis.

      (6) For data from Figure 2F it would be very useful to plot data from individual mice in addition to this aggregated representation.

      We now include data from individual mice in Supp. Table 1.

      (7) I think it would be helpful to move some parts of Figure S1 to the main Figure 1, in particular the table from S1A. 

      Fig. S1 is now part of the new Fig. 1.

      (8) Clarify whether Figure S2 is an independent replication, as implied, or whether the same test data is shown twice in two separate figures (In Figure 1b and Supplementary Figure 2).

      The test group in Fig. S2 (new Fig. S1) is the same as the test group in Fig. 1b (new Fig. 1e), but the control group is a separate cohort. This is now clarified in the figure legends.  

      (9) The authors should add a limitations section to the discussion where they specifically discuss the caveats involved in relating their results specifically to NE. This should include the possible involvement of co-transmitters and off-target expression of Cre in other populations.

      Thank you for this comment. Previous pharmacology and lesion studies showed that LC input or NE content in the mPFC was specifically required for EDS-type switching processes (Lapiz, M.D. et al., 2006; Tait, D.S. et al. 2007; McGaughy, J. et al. 2008), in light of which we interpret our mPFC neurophysiological effects with LC inhibition as at least partially mediated by the direct LC-NE input.  When discussing the limitations of our study, we now explicitly acknowledge the potential involvement of co-transmitters released by LC neurons (line 253-256).  

      (10) The authors should provide details about the TH antibody uses for IHC

      We now include more details in immunohistochemistry (line 384-388).

      (11) Throughout, it would be helpful to include datapoints from individual animals - these are included in some supplementary figures, but are missing in a number of the main plots.

      Reviewer #1 made a similar comment, and we now include individual data points in the figures (e.g., Fig. 1e, 1h).

    1. eLife Assessment

      This study introduces a novel method for estimating spatial spectra from irregularly sampled intracranial EEG data, revealing cortical activity across all spatial frequencies, which supports the global and integrated nature of cortical dynamics. It showcases important technical innovations and rigorous analyses, including tests to rule out potential confounds. However, further direct evaluation of the model, for example by using simulated cortical activity with a known spatial spectrum (e.g., an iEEG volume-conductor model that describes the mapping from cortical current source density to iEEG signals, and that incorporates the reference electrodes and the particular montage used), would even further strengthen the incomplete evidence.

    2. Reviewer #1 (Public review):

      Summary:

      The paper uses rigorous methods to determine phase dynamics from human cortical stereotactic EEGs. It finds that the power of the phase is higher at the lowest spatial phase. The application to data illustrates the solidity of the method and their potential for discovery.

      Comments on revised submission:

      The authors have provided responses to the previous recommendations.

    3. Reviewer #3 (Public review):

      Summary:

      The authors propose a method for estimating the spatial power spectrum of cortical activity from irregularly sampled data and apply it to iEEG data from human patients during a delayed free recall task. The main findings are that the spatial spectra of cortical activity peak at low spatial frequencies and decrease with increasing spatial frequency. This is observed over a broad range of temporal frequencies (2-100 Hz).

      Strengths:

      A strength of the study is the type of data that is used. As pointed out by the authors, spatial spectra of cortical activity are difficult to estimate from non-invasive measurements (EEG and MEG) and from commonly used intracranial measurements (i.e. electrocorticography or Utah arrays) due to their limited spatial extent. In contrast, iEEG measurements are easier to interpret than EEG/MEG measurements and typically have larger spatial coverage than Utah arrays. However, iEEG is irregularly sampled within the three-dimensional brain volume and this poses a methodological problem that the proposed method aims to address.

      Weaknesses:

      Although the proposed method is evaluated in several indirect ways, a direct evaluation is lacking. This would entail simulating cortical current source density (CSD) with known spatial spectrum and using a realistic iEEG volume-conductor model to generate iEEG signals.

      Comments on revised version:

      In my original review, I raised the following issue:

      "The proposed method of estimating wavelength from irregularly sampled three-dimensional iEEG data involves several steps (phase-extraction, singular value-decomposition, triangle definition, dimension reduction, etc.) and it is not at all clear that the concatenation of all these steps actually yields accurate estimates. Did the authors use more realistic simulations of cortical activity (i.e. on the convoluted cortical sheet) to verify that the method indeed yields accurate estimates of phase spectra?"

      And the authors' response was:

      "We now included detailed surrogate testing, in which varying combinations of sEEG phase data and veridical surrogate wavelengths are added together. See our reply from the public reviewer comments. We assess that real neurophysiological data (here, sEEG plus surrogate and MEG manipulated in various ways) is a more accurate way to address these issues. In our experience, large scale TWs appear spontaneously in realistic cortical simulations, and we now cite the relevant papers in the manuscript (line 53)."

      The point that I wanted to make is not that traveling waves appear in computational models of cortical activity, as the authors seem to think. My point was that the only direct way to evaluate the proposed method for estimating spatial spectra is to use simulated cortical activity with known spatial spectrum. In particular, with "realistic simulations" I refer to the iEEG volume-conductor model that describes the mapping from cortical current source density (CSD) to iEEG signals, and that incorporates the reference electrodes and the particular montage used.

      Although in the revised manuscript the authors have provided indirect evidence for the soundness of the proposed estimation method, the lack of a direct evaluation using realistic simulations with ground truth as described above makes that remain sceptical about the soundness of the method.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study introduces a novel method for estimating spatial spectra from irregularly sampled intracranial EEG data, revealing cortical activity across all spatial frequencies, which supports the global and integrated nature of cortical dynamics. The study showcases important technical innovations and rigorous analyses, including tests to rule out potential confounds; however, the lack of comprehensive theoretical justification and assumptions about phase consistency across time points renders the strength of evidence incomplete. The dominance of low spatial frequencies in cortical phase dynamics continues to be of importance, and further elaboration on the interpretation and justification of the results would strengthen the link between evidence and conclusions.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The paper uses rigorous methods to determine phase dynamics from human cortical stereotactic EEGs. It finds that the power of the phase is higher at the lowest spatial phase.

      Strengths:

      Rigorous and advanced analysis methods.

      Weaknesses:

      The novelty and significance of the results are difficult to appreciate from the current version of the paper.

      (1) It is very difficult to understand which experiments were analysed, and from where they were taken, reading the abstract. This is a problem both for clarity with regard to the reader and for attribution of merit to the people who collected the data.

      We now explicitly state the experiments that were used, lines 715-716.

      (2) The finding that the power is higher at the lowest spatial phase seems in tune with a lot of previous studies. The novelty here is unclear and it should be elaborated better.

      It is not generally accepted in neuroscience that power is higher at lowest spatial frequencies, and recent research concludes that traveling waves at this scale may be the result of artefactual measurement (Orczyk et al., 2022; Hindriks et al., 2014; Zhigalov & Jensen,2023). The question we answer is therefore timely and a source of controversy to researchers analysing TWs in cortex. While, in our view, the previous literature points in the direction of our conclusions (notably the work of Freeman et. al. 2003; 2000; Barrie et al. 1996), it is not conclusive at the scale we are interested in, specifically >8cm, and certainly not convincing to the proponents of ‘artefactual measurement’.

      We have added to a sentence to make this explicit in the abstract, lines 20-22. Please also note previous text at the end of the introduction, lines 140-148 and in the first paragraph of the discussion, lines 563-569.

      I could not understand reading the paper the advantage I would have if I used such a technique on my data. I think that this should be clear to every reader.

      We have made the core part of the code available on github (line 1154), which should simplify adoption of the technique. We have urged, in the Discussion (lines 653-663), why habitual measurement of SF spectra is desirable, since the same task measured with EEG, sEEG or ECoG does not encompass the same spatial scales, and researchers may be comparing signals with different functional properties. Until reliable methods for estimating SF are available, not dependent on the layout of the recording array, data cannot be analysed to resolve this question. Publication of our results and methods will help this process along.

      (3) It seems problematic to trust in a strong conclusion that they show low spatial frequency dynamics of up to 15-20 cm given the sparsity of the arrays. The authors seem to agree with this concern in the last paragraph of page 12. 

      The new surrogate testing supports our conclusions. The sEEG arrays would not normally be a first choice to estimate SF spectra, for reasons of their sparsity, which may be why such estimates have not been done before. Yet, this is the research challenge that we sought to solve, and a problem for which there was no ready method to hand. Nevertheless, it is a problem that urgently needed to be solved given the current debate on the origin of large-scale TWs. We have now included detailed surrogate testing of real data plus varying strength model waves (Figure 6A and Supplementary Figure 4). We believe this should convince the reader that we are measuring the spatial frequency spectrum with sufficient accuracy to answer the central research question.

      They also say that it would be informative to repeat the analyses presented here after the selection of more participants from all available datasets. It begs the question of why this was not done. It should be done if possible.

      We have now doubled the number of participants in the main analyses. Since each participant comprises a test of the central hypothesis, now the hypothesis test now has 23 replications (Supplementary Figures 2 and 3). There were four failures to reach significance due to under-powered tests, i.e., not enough contacts. This is sufficient test of the hypothesis and, in our opinion, not the primary obstacle to scientific acceptance of our results. The main obstacle is providing convincing tests that the method is accurate, and this is what we have focussed on. Publication of python code and the detailed methods described here enable any interested researcher to extend our method to other datasets.

      (4) Some of the analyses seem not to exploit in full the power of the dataset. Usually, a figure starts with an example participant but then the analysis of the entire dataset is not as exhaustive. For example, in Figure 6 we have a first row with the single participants and then an average over participants. One would expect quantifications of results from each participant (i.e. from the top rows of GFg 6) extracting some relevant features of results from each participant and then showing the distribution of these features across participants. This would complement the subject average analysis.

      The results are now clearly split into sections, where we first deal with all the single participant analyses, then the surrogate testing to confirm the basic results, then the participant aggregate results (Figure 7 and Supplementary Figure 7). The participant aggregate results reiterate the basic findings for the single participants. The key finding is straightforward (SF power decreases with SF) and required only one statistical analysis per subject.

      (5) The function of brain phase dynamics at different frequencies and scales has been examined in previous papers at frequencies and scales relevant to what the authors treat. The authors may want to be more extensive with citing relevant studies and elaborating on the implications for them. Some examples below:

      Womelsdorf T, et alScience. 2007

      Besserve M et al. PloS Biology 2015

      Nauhaus I et al Nat Neurosci 2009

      We have added two paragraphs to the discussion, in response to the reviewer suggestion (lines 606-623). These paragraphs place our high TF findings in the context of previous research.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors analyze the organization of phases across different spatial scales. The authors analyze intracranial, stereo-electroencephalogram (sEEG) recordings from human clinical patients. The authors estimate the phase at each sEEG electrode at discrete temporal frequencies. They then use higher-order SVD (HOSVD) to estimate the spatial frequency spectrum of the organization of phase in a data-driven manner. Based on this analysis, the authors conclude that most of the variance explained is due to spatially extended organizations of phase, suggesting that the best description of brain activity in space and time is in fact a globally organized process. The authors' analysis is also able to rule out several important potential confounds for the analysis of spatiotemporal dynamics in EEG.

      Strengths:

      There are many strengths in the manuscript, including the authors' use of SVD to address the limitation of irregular sampling and their analyses ruling out potential confounds for these signals in the EEG.

      Weaknesses:

      Some important weaknesses are not properly acknowledged, and some conclusions are overinterpreted given the evidence presented.

      The central weakness is that the analyses estimate phase from all signal time points using wavelets with a narrow frequency band (see Methods - "Numerical methods"). This step makes the assumption that phase at a particular frequency band is meaningful at all times; however, this is not necessarily the case. Take, for example, the analysis in Figure 3, which focuses on a temporal frequency of 9.2 Hz. If we compare the corresponding wavelet to the raw sEEG signal across multiple points in time, this will look like an amplitude-modulated 9.2 Hz sinusoid to which the raw sEEG signal will not correspond at all. While the authors may argue that analyzing the spatial organization of phase across many temporal frequencies will provide insight into the system, there is no guarantee that the spatial organization of phase at many individual temporal frequencies converges to the correct description of the full sEEG signal. This is a critical point for the analysis because while this analysis of the spatial organization of phase could provide some interesting results, this analysis also requires a very strong assumption about oscillations, specifically that the phase at a particular frequency (e.g. 9.2 Hz in Figure 3, or 8.0 Hz in Figure 5) is meaningful at all points in time. If this is not true, then the foundation of the analysis may not be precisely clear. This has an impact on the results presented here, specifically where the authors assert that "phase measured at a single contact in the grey matter is more strongly a function of global phase organization than local". Finally, the phase examples given in Supplementary Figure 5 are not strongly convincing to support this point.

      “using wavelets with a narrow frequency band … this analysis also requires a very strong assumption about oscillations, specifically that the phase at a particular frequency (e.g. 9.2 Hz in Figure 3, or 8.0 Hz in Figure 5) is meaningful at all points in time”

      Our method uses very short time-window Morlet wavelets to avoid the assumptions of oscillations, i.e., long-lasting sinusoids in the signal, in the sense of sinusoidal waveforms, or limit cycles extending in time. Cortical TWs can only last one or two cycles (Alexander et al., 2006), requiring methods that are compact in the time domain to avoid underreporting the desired phenomena. Additionally, the short time-window Morlet wavelets have low frequency resolution, so they are robust with respect to shifts in frequency between sites. We now discuss this issue explicitly in the Methods (lines 658-674). This means the phase estimation methods used in the manuscript precisely do not have the problem of assuming narrow-band oscillations in the signal. The methods are also robust to the exact shape of the waveforms; the signal needs be only approximately sinusoidal; to rise and fall. This means the Fourier variant we use does not introduce ringing artefact that can be introduced using longer timeseries methods, such as FFT.

      “This step makes the assumption that phase at a particular frequency band is meaningful at all times”

      This important consideration is entrenched in our choice of methods. By way of explanatory background, we point out that this step is not the final step. Aggregation methods can be used to distinguish between signal and noise. In the simple case, event-locked time-series of phase can be averaged. This would allow consistent (non-noise) phase relations to be preserved, while the inconsistent (including noise) phase relations would be washed out. This is part of the logic behind all such aggregation procedures, e.g., phase-locking, coherence. SVD has the advantage of capturing consistent relations in this sense, but without loss of information as occurs in averaging (up to the choice of number of singular vectors in the final model). Specifically, maps of the spatial covariances in phase are captured in the order of the variance explained. Noise (in the sense conveyed by the reviewer) in the phase measurements will not contribute to highest rank singular vectors. SVD is commonly used to remove noise, and that is one of its purposes here. This point can be seen by considering the very smooth singular vectors derived from MEG (Figure 3F) in this new version of the manuscript. These maps of phase gradients pull out only the non-noisy relations, even as their weighted sums reproduce any individual sample to any desired accuracy.

      To summarize, the next step (of incorporating the phase measure into the SVD) neatly bypasses the issue of non-meaningful phase quantification. This is one of the reasons why we do not undertake the spatial frequency estimates on the raw matrices of estimated phase.

      We now include a new sub-paragraph on this topic in the methods, lines 831-838.

      In addition, we have reworded the first description of the methods with a new paragraph at the end of the introduction, which better balances the description of the steps involved. The two sentences (lines 162-166 highlight the issue of concern to the reviewer.

      “there is no guarantee that the spatial organization of phase at many individual temporal frequencies converges to the correct description of the full sEEG signal.”

      The correct description of the full sEEG signal is beyond the scope of the present research. Our main goal, as stated, is to show that the hypothesis that ‘extra-cranial measurements of TWs is the result of projection from localized activity’ is not supported by the evidence of spatial patterns of activity in the cortex. Since this activity can be accessed as single frequency band (especially if localized sources create the large-scale patterns), analysis of SF on a TF-by-TF basis is sufficient.

      “This has an impact on the results presented here, specifically where the authors assert that "phase measured at a single contact in the grey matter is more strongly a function of global phase organization than local".

      We agree with the reviewer, even though we expect that the strongest influences on local phase are due to other cortical signals in the same band. The implicit assumption of the focus on bands of the same temporal frequency is now made explicit in the abstract (lines 31-34).

      A sentence addressing this issue had been added to the first paragraph of the discussion (lines 579-582).

      Inclusion of cross-frequency interactions would likely require a highly regular measurement array over the scales of interest here, i.e., the noise levels inherent in the spatial organization of sEEG contacts would not support such analyses.

      “Finally, the phase examples given in Supplementary Figure 5 are not strongly convincing to support this point.”

      We have removed the phase examples that were previously in Supplementary Figure 5 (and Figure 5 in the previous version of the main text), since further surrogate testing and modelling (Supplementary Figure 11) shows the LSVs from irregular arrays will inevitably capture mixtures of low and high SF signals. The final section of the Methods explains this effect in some detail. Instead, the new version of the manuscript relies on new surrogate testing to validate our methods.

      Another weakness is in the discussion on spatial scale. In the analyses, the authors separate contributions at (approximately) > 15 cm as macroscopic and < 15 cm as mesoscopic. The problem with the "macroscopic" here is that 15 cm is essentially on the scale of the whole brain, without accounting for the fact that organization in sub-systems may occur. For example, if a specific set of cortical regions, spanning over a 10 cm range, were to exhibit a consistent organization of phase at a particular temporal frequency (required by the analysis technique, as noted above), it is not clear why that would not be considered a "macroscopic" organization of phase, since it comprises multiple areas of the brain acting in coordination. Further, while this point could be considered as mostly semantic in nature, there is also an important technical consideration here: would spatial phase organizations occurring in varying subsets of electrodes and with somewhat variable temporal frequency reliably be detected? If this is not the case, then could it be possible that the lowest spatial frequencies are detected more often simply because it would be difficult to detect variable organizations in subsets of electrodes?

      The motivation for our study was to show that large-scale TWs measured outside the cortex cannot be the result of more localized activity being ‘projected up’. In this case, the temporal frequency of the artefactual waves would be the same as the localized sources, so the criticism does not apply.

      “while this point could be considered as mostly semantic in nature”

      We have changed the terminology in the paper to better coincide with standard usage. Macroscopic now refers to >1cm, while we refer to >8cm as large-scale.

      “15 cm is essentially on the scale of the whole brain, without accounting for the fact that organization in sub-systems may occur.”

      We can assume that subtle frequency variation (e.g., within an alpha phase binding) is greatest at the largest scales of cortex, or at least not less varying than measurements within regions. This means that not considering frequency-drift effects will not inflate low spatial frequency power over high spatial frequency power. Even so, the power spectrum we estimated is approximately 1/SF, so that unmeasured cross-frequency effects in binding (causal influences on local phase) would have to overcome the strength of this relation for this criticism to apply, which seems unlikely.

      “would spatial phase organizations occurring in varying subsets of electrodes and with somewhat variable temporal frequency reliably be detected?”

      See our previous comments about the low temporal frequency resolution of two cycle Morlet wavelets. The answer is yes, up to the range approximated by half-power bandwidth, which is large in the case of this method (see lines 760-764).

      Another weakness is disregarding the potential spike waveform artifact in the sEEG signal in the context of these analyses. Specifically, Zanos et al. (J Neurophysiol, 2011) showed that spike waveform artifacts can contaminate electrode recordings down to approximately 60 Hz. This point is important to consider in the context of the manuscript's results on spatial organization at temporal frequencies up to 100 Hz. Because the spike waveform artifact might affect signal phase at frequencies above 60 Hz, caution may be important in interpreting this point as evidence that there is significant phase organization across the cortex at these temporal frequencies.

      We have now added a sentence on this issue to the discussion (lines 600-602).

      However, our reading of the Zanos et al. paper is that the low temporal frequency (60-100Hz) contribution of spikes and spike patterns is negligible compared to genuine post-synaptic membrane fluctuations (see their Figure 3). These considerations come more strongly into play when correlations between LFP and spikes are calculated or spike triggered averaging is undertaken, since then a signal is being partly correlated with itself, or, partly averaged over the supposedly distinct signal with which it was detected.

      A last point is that, even though the present results provide some insight into the organization of phase across the human brain, the analyses do not directly link this to spiking activity. The predictive power that these spatial organizations of phase could provide for spiking activity - even if the analyses were not affected by the distortion due to the narrow-frequency assumption - remains unknown. This is important because relating back to spiking activity is the key factor in assessing whether these specific analyses of phase can provide insight into neural circuit dynamics. This type of analysis may be possible to do with the sEEG recordings, as well, by analyzing high-gamma power (Ray and Maunsell, PLoS Biology, 2011), which can provide an index of multi-unit spiking activity around the electrodes.

      “even if the analyses were not affected by the distortion due to the narrow-frequency assumption”

      See our earlier comment about narrow TFs; this is not the case in the present work.

      The spiking activity analysis would be an interesting avenue for future research. It appears the 1000Hz sampling frequency in the present data is not sufficient for method described in Ray & Maunsell (2011). On a related topic, we have shown that large-scale traveling waves in the MEG and 8cm waves in ECoG can both be used to predict future localized phase at a single sensor/contact, two cycles into the future (Alexander et al., 2019). This approach could be used to predict spiking activity, by combining it with the reviewer’s suggestion. However, the current manuscript is motivated by the argument that measured large-scale extra-cranial TWs are merely projections of localized cortical activity. Since spikes do not arise in this argument, we feel it is outside the scope of the present research. We have added this suggestion to the discussion as a potential line of future research (lines 686-688).

      Reviewer #3 (Public review):

      Summary:

      The authors propose a method for estimation of the spatial spectra of cortical activity from irregularly sampled data and apply it to publicly available intracranial EEG data from human patients during a delayed free recall task. The authors' main findings are that the spatial spectra of cortical activity peak at low spatial frequencies and decrease with increasing spatial frequency. This is observed over a broad range of temporal frequencies (2-100 Hz).

      Strengths:

      A strength of the study is the type of data that is used. As pointed out by the authors, spatial spectra of cortical activity are difficult to estimate from non-invasive measurements (EEG and MEG) due to signal mixing and from commonly used intracranial measurements (i.e. electrocorticography or Utah arrays) due to their limited spatial extent. In contrast, iEEG measurements are easier to interpret than EEG/MEG measurements and typically have larger spatial coverage than Utah arrays. However, iEEG is irregularly sampled within the threedimensional brain volume and this poses a methodological problem that the proposed method aims to address.

      Weaknesses:

      The used method for estimating spatial spectra from irregularly sampled data is weak in several respects.

      First, the proposed method is ad hoc, whereas there exist well-developed (Fourier-based) methods for this. The authors don't clarify why no standard methods are used, nor do they carry out a comparative evaluation.

      We disagree that the method is ad hoc, though the specific combination of SVD and multiscale differencing is novel in its application to sEEG. The SVD method has been used to isolate both ~30cm TWs in MEG and EEG (Alexander et al., 2013; 2016), as well as 8cm waves in ECoG (Alexander et al., 2013; 2019). In our opening examples in the results now reiterate these previous related findings, by way of example analysis of MEG data (Figure 3). This will better inform the reader on the extent of continuity of the method from previous research.

      Standard FFT has been used after interpolating between EEG electrodes to produce a uniform array (Alamia et al., 2023). There exist well-developed Fourier methods for nonuniform grids, such as simple interpolation, the butterfly algorithm, wavefield extrapolation and multi-scale vector field techniques. However, the problems for which these methods are designed require non-sparse sampling or less irregular arrays. The sEEG contacts (reduced in number to grey matter contacts) are well outside the spatial irregularity range of any Fourierrelated methods that we are aware of, particularly at the broad range of spatial scales of interest here (2cm up to 24cm). This would make direct comparison of these specialized Fourier method to our novel methods, in the sEEG, something of a straw-man comparison.

      We now include a summary paragraph in the introduction, which is a brief review of Fourier methods designed to deal with non-uniform sampling (lines 159-162).

      Second, the proposed method lacks a theoretical foundation and hinges on a qualitative resemblance between Fourier analysis and singular value decomposition.

      We have improved our description of the theoretical relation between Fourier analysis and SVD (additional material at lines 839-861 and 910-922). In fact, there are very strong links between the two methods, and now it should be clearer that our method does not rely on a mere qualitative resemblance.

      Third, the proposed method is not thoroughly tested using simulated data. Hence it remains unclear how accurate the estimated power spectra actually are.

      We now include a new surrogate testing procedure, which takes as inputs the empirical data and a model signal (of known spatial frequency) in various proportions. Thus, we test both the impact of small amount of surrogate signal on the empirical signal, and the impact of ‘noise’ (in the form of a small amount of empirical signal) added to the well-defined surrogate signal.

      In addition, there are a number of technical issues and limitations that need to be addressed or clarified (see recommendations to the authors).

      My assessment is that the conclusions are not completely supported by the analyses. What would convince me, is if the method is tested on simulated cortical activity in a more realistic set-up. I do believe, however, that if the authors can convincingly show that the estimated spatial spectra are accurate, the study will have an impact on the field. Regarding the methodology, I don't think that it will become a standard method in the field due to its ad hoc nature and well-developed alternatives.

      Simulations of cortical activity do not seem the most direct way to achieve this goal. The first author has published in this area (Liley et. al., 1999; Wright et al., 2001), and such simulations, for both bulk and neuronally based simulations, readily display traveling wave activity at low spatial frequencies (indeed, this was the origin of the present scientific journey). The manuscript outlines these results in the introduction, as well as theoretical treatments proposing the same. Several other recent studies have highlighted the appearance of largescale travelling waves using connectome-based models (https://www.biorxiv.org/content/10.1101/2025.07.05.663278v1; https://www.nature.com/articles/s41467-024-47860-x), which we do not include in the manuscript for reasons of brevity. In short, the emergence of TW phenomenon in models is partly a function of the assumptions put into them (i.e., spatial damping, boundary conditions, parameterization of connection fields) and would therefore be inconclusive in our view.

      Instead, we rely on the advantages provided by the way our central research question has been posed: that the spatial frequency distribution of grey matter signal can determine whether extra-cranial TWs are artefactual. The newly introduced surrogate methods reflect this advantage by directly adding ground truth spatial frequency components to individual sample measurements. This is a less expensive option than making cortical simulations to achieve the same goal.

      For the same reasons, we include testing of the methods using real cortical signals with MEG arrays (for which we could test the effects of increasing sparseness of contacts, test the effects of average referencing, and also construct surrogate time-series with alternative spectra).

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Major points

      Methods, Page 18: "... using notch filters to remove the 50Hz line signal and its harmonics ...": The sEEG data appear to have been recorded in North America, where the line frequency is 60 Hz. Is this perhaps a typo, or was a 50 Hz notch filter in fact applied here (which would be a mistake)?

      This has now been fixed in the text to read 60Hz. This is the notch filter that was applied.

      Minor points

      (1) While the authors do state that they are analyzing the "spatial frequency spectrum of phase dynamics" in the abstract, this could be more clearly emphasized. Specifically, the difference between signal power at different spatial frequencies (as analyzed by a standard Fourier analysis) and the organization of phase in space (as done here) could be more clearly distinguished.

      We now address this point explicitly on lines 167-172. We now include at the end of the results additional analyses where the TF power is included. This means that the effects of including signal power at different temporal frequencies can be directly compared to our main analysis of the SF spectrum of the phase dynamics.

      (2) Figure 1A-C: It was not immediately clear what the lengths provided in these panels (e.g."> 40 cm cortex", "< 10 cm", "< 30 cm") were meant to indicate. This could be made clearer.

      Now fixed in the caption.

      (3) Figure 2A: If this is surrogate data to explain the analysis technique, it would be helpful to note explicitly at this point.

      This Figure has been completely reworked, and now the status of the examples (from illustrative toy models to actual MEG data) should be clearer.

      (4) Figure 4A: Why change from "% explained variance" for the example data in Figure 2C to arbitrary units at this point?

      This has now been explicitly stated in the methods (lines 1033-1036).

      (5) Page 15: "This means either the results were biased by a low pass filter, or had a maximum measurable...": If the authors mean that the low-pass filter is due to spatial blurring of neural activity in the EEG signal, it would be helpful to state that more directly at this point.

      Now stated directly, lines 567-568.

      (6) Page 23: "...where |X| is the complex magnitude of X...": The modulus operation is defined on a complex number, yet here is applied to a vector of complex numbers. If the operation is elementwise, it should be defined explicitly.

      ‘Elementwise’ is now stated explicitly (line 1020).

      Reviewer #3 (Recommendations for the authors):

      In the submitted manuscript, the authors propose a method to estimate spatial (phase) spectra from irregularly sampled oscillatory cortical activity. They apply the method to intracranial (iEEG) data and argue that cortical activity is organized into global waves up to the size of the entire cortex. If true, this finding is certainly of interest, and I can imagine that it has profound implications for how we think about the functional organization of cortical activity.

      We have added a section to the discussion outlining the most radical of these implications: what does it mean to do source localization when non-local signals dominate? Lines 670-681.

      The manuscript is well-written, with comprehensive introduction and discussion sections, detailed descriptions of the results, and clear figures. However, the proposed method comprised several ad hoc elements and is not well-founded mathematically, its performance is not adequately assessed, and its limitations are not sufficiently discussed. As such, the study failed to convince (me) of the correctness of the main conclusions.

      We now have a direct surrogate testing of the method. We have also improved the mathematical explanation to show that the link between Fourier analysis and SVD is not ad hoc, but well understood in both literatures. We had addressed explicitly in the text all of the limitations raised by the reviewers.

      Major comments

      (1) The main methodological contribution of the study is summarized in the introduction section:

      "The irregular sampling of cortical spatial coordinates via stereotactic EEG was partly overcome by the resampling of the phase data into triplets corresponding to the vertices of approximately equilateral triangles within the cortical sheet."

      There exist well-established Fourier methods for handling irregularly sampled data so it is unclear why the authors did not resort to these and instead proposed a rather ad hoc method without theoretical justification (see next comment).

      We have re-reviewed the literature on non-uniform Fourier analysis. We now briefly review the Fourier methods for handling irregularly sampled data (lines 155-162) and conclude that none of the existing methods can deal with the degree of irregularity, and especially sparsity, found for the grey-matter sEEG contacts.

      (2) In the Appendix, the authors write:

      "For appropriate signals, i.e., those with power that decreases monotonically with frequency, each of the first few singular vectors, v_k, is an approximate complex sinusoid with wavenumber equal to k."

      I don't think this is true in general and if it is, there must be a formal argument that proves it. Furthermore, is it also true for irregularly sampled data? And in more than one spatial dimension? Moreover, it is also unclear exactly how the spatial Fourier spectrum is estimated from the SVD.

      In response to these reviewer queries, we now spend considerably more time in the conceptual set-up of the manuscript, giving examples of where SVD can be used to estimate the Fourier spectrum. We have now unpacked the word ‘appropriate’ and we are now more exact in our phrasing. This is laid out in lines 843-850 of the manuscript. In addition, the methods now describe the mathematical links between Fourier analysis and SVD (lines 851861 and 910-922).

      The authors write:

      "The spatial frequency spectrum can therefore be estimated using SVD by summing over the singular values assigned to each set of singular vectors with unique (or by binning over a limited range of) spatial frequencies. This procedure is illustrated in Figure 1A-C."

      First, the singular vectors are ordered to decreasing values of the corresponding singular values. Hence, if the singular values are used to estimate spectral power, the estimated spectrum will necessarily decrease with increasing spatial frequency (as can be seen in Figure 2C). Then how can traveling waves be detected by looking for local maxima of the estimated power spectra?

      TWs are not detected by looking for local maxima in the spectra. Our work has focussed on the global wave maps derived from the SVD of phase (i.e., k=1-3), which also explain most of the variance in phase. This is now mentioned in the caption to Figure 3 (lines 291-294).

      Second, how are spatial frequencies assigned to the different singular vectors? The proposed method for estimating spatial power spectra from irregularly sampled data seems rather ad hoc and it is not at all clear if, and under what conditions, it works and how accurate it is.

      The new version of the manuscript uses a combination of the method previously presented (the multi-scale differencing) and the method previously outlined in the supplementary materials (doing complex-valued SVD on the spatial vectors of phase). We hope that along with the additional expository material in the methods the new version is clearer and seems less ad hoc to the reviewer. Certainly, there are deep and well-understood links between Fourier analysis and SVD, and we hope we have brought these into focus now.

      (3) The authors define spatial power spectra in three-dimensional Euclidean space, whereas the actual cortical activity occurs on a two-dimensional sheet (the union of two topological 2spheres). As such, it is not at all clear how the estimated wavelengths in three-dimensional space relate to the actual wavelengths of the cortical activity.

      We define spatial power spectra on the folded cortical sheet, rather than Cartesian coordinates. We use geodesic distances in all cases where a distance measurement is required. We have included two new figures (Figure 5 and Supplementary Figure1) showing the mapping of the triangles onto the cortical sheet, which should bring this point home.

      (4) The authors' analysis of the iEEG data is subject to a caveat that is not mentioned in the manuscript: As a reference for the local field potentials, the average white-matter signal was used and this can lead to artifactual power at low spatial frequencies. This is because fluctuations in the reference signal are visible as standing waves in the recording array. This might also explain the observation that

      "A surprising finding was that the shape of the spatial frequency spectrum did not vary much with temporal frequency."

      because fluctuations in the reference signal are expected to have power at all temporal frequencies (1/f spectrum). When superposed with local activity at the recording electrodes, this leads to spurious power at low spatial frequencies. Can the authors exclude this interpretation of the results?

      The new version of the manuscript deals explicitly with this potential confound (lines 454467). First, the artefactual global synchrony due to the reference signal (the DC component in our spatial frequency spectra of phase) is at a distinct frequency from the lowest SF of interest here. The lowest spatial frequency is a function of the maximum spatial range of the recording array and not overlapping in our method with the DC component, despite the loss of SF resolution due to the noise of the spatial irregularity of the recording array. This can be seen from consideration of the SF tuning (Figure 4) for the MEG wave maps shown in Figure 3, and the spectra generated for sparse MEG arrays in Supplementary Figure 5. Additionally, this question led us to a series of surrogate tests which are now included in the manuscript. We used MEG to test for the effects of average reference, since in this modality the reference free case is available. The results show that even after imposing a strong and artefactual global synchrony, the method is highly robust to inflation of the DC component, which either way does not strongly influence the SF estimates in the range of interest (4c/m to 12c/m for the case of MEG).

      (5) Related to the previous comment: Contrary to the authors' claims, local field potentials are susceptible to volume conduction, particularly when average references are used (see e.g. https://www.cell.com/neuron/fulltext/S0896-6273(11)00883-X)

      Methods exist to mitigate these effects (e.g. taking first- or second-order spatial differences of the signals). I think this issue deserves to be discussed.

      We have reviewed this research and do not find it to be a problem. The authors cited by the reviewer were concerned with unacknowledged volume conduction up to 1 cm for LFP. The maximum spatial frequency we report here is 50c/m, or equivalent to 2cm. While the intercontact distance on the sEEG electrodes was 0.5cm, in practice the smallest equilateral triangles (i.e., between two electrodes) to be found in the grey matter was around 2cm linear size. We make no statements about SF in the 1cm range. We do now cite this paper and mention this short-range volume conduction (lines 602-605). The method of taking derivatives has the same problems as source localization methods. They remove both artefactual correlations (volume conduction) and real correlations (the low SF interactions of interest here). We mention this now at lines 667-669. In addition, our method to remove negative SF components from the LSVs ameliorates the effects of average referencing. There are now more details in the Methods about this step (lines 924-947), as well as a new supplementary figure illustrating its effects on signal with a known SF spectrum (MEG, supplementary Figure 6).

      (6) Could the authors add an analysis that excludes the possibility that the observed local maxima in the spectra are a necessary consequence of the analysis method, rather than reflecting true maxima in the spectra? A (possibly) similar effect can be observed in ordinary Fourier spectra that are estimated from zero-mean signals: Because the signals have zero mean, the power spectrum at frequency zero is close to zero and this leads to an artificial local maximum at low frequencies.

      We acknowledge the reviewer’s mathematical point. We do not agree that it could be an issue, though it is important to rule it out definitively. First, removing the DC component will only produce an artefactual low SF peak if the power at low SF is high. This may occur in the reviewer’s example only because temporal frequency has a ~1/f spectrum. If the true spectrum is flat, or increasing in power with f, no such artificial low SF will be produced (see Supplementary Figure 5G). Additionally,

      (1) The DC component is well separated from the low SF components in our method;

      (2) We now include several surrogate methods which show that our method finds the correct spectral distribution and is not just finding a maximum at low SFs due to the suggested effect (subtraction of the DC component). Analysis of separated wave maps in MEG (Figures 3 & 4) shows the expected peaks in SF, increasing in peak SF for each family of maps when wavenumber increases (roughly three k=1 maps, three k=2 etc.). A specific surrogate test for this query was also undertaken by creating a reverse SF spectrum in MEG phase data, in which the spectrum goes linearly with f over the SF range of interest, rather than the usual 1/f. Our method correctly finds the former spectrum (Supplementary Figure 5). Additionally, we tested for the effects of introducing the average reference and the effects of our method to remove the DC component of the phase SF spectrum (Supplementary Figure 6). We can definitively rule out the reviewer’s concern.

      A related issue (perhaps) is the observation that the location of the maximum (i.e. the peak spatial frequency of cortical activity) depends on array size: If cortical activity indeed has a characteristic wavelength (in the sense of its spectrum having a local maximum) would one not expect it to be independent of array size?

      This is only true when making estimates for relatively clean sinusoidal signals, and not from broad-band signals. Fourier analysis and our related SVD methods are very much dependent on maximum array size used to measure cortical signals. This is why the first frequency band (after the DC component) in Fourier analysis is always at a frequency equivalent to 1/array_size, even if the signal is known to contain lower frequency components. We now include a further illustration of this in Figure 3, a more detailed exposition of this point in the methods, and in Supplementary Figure 11 we provide a more detailed example of the relation between Fourier analysis and SVD when grids with two distinct scales are used.

      In short, it is not possible, mathematically, to measure wavelengths greater than the array size in broad-band data. This is now stated explicitly in the manuscript (lines 143-144). A common approach in Neuroscience research is to first do narrowband filtering, then use a method that can accurately estimate ‘instantaneous’ phase change, such as the Hilbert transform. This is not possible for highly irregular sEEG arrays.

      (7) The proposed method of estimating wavelength from irregularly sampled threedimensional iEEG data involves several steps (phase-extraction, singular value decomposition, triangle definition, dimension reduction, etc.) and it is not at all clear that the concatenation of all these steps actually yields accurate estimates.

      Did the authors use more realistic simulations of cortical activity (i.e. on the convoluted cortical sheet) to verify that the method indeed yields accurate estimates of phase spectra?

      We now included detailed surrogate testing, in which varying combinations of sEEG phase data and veridical surrogate wavelengths are added together.

      See our reply from the public reviewer comments. We assess that real neurophysiological data (here, sEEG plus surrogate and MEG manipulated in various ways) is a more accurate way to address these issues. In our experience, large scale TWs appear spontaneously in realistic cortical simulations, and we now cite the relevant papers in the manuscript (line 53).

      Minor comments

      (1) Perhaps move the first paragraph of the results section to the Introduction (it does not describe any results).

      So moved.

      (2) The authors write:

      "The stereotactic EEG contacts in the grey matter were re-referenced using the average of low-amplitude white matter contacts"

      Does this mean that the average is taken over a subset of white-matter contacts (namely those with low amplitude)? Or do the authors refer to all white-matter contacts as "low-amplitude"? And had contacts at different needles different references? Or where the contacts from all needles pooled?

      A subset of white-matter contacts was used for re-referencing, namely those 50% with lowest amplitude signals. This subset was used to construct a pooled, single, average reference. We have rephrased the sentences referring to this procedure to improve clarity (line 202 and 743745).

    1. eLife Assessment

      This study offers important insight into the pathogenic basis of intragenic frameshift deletions in the carboxy-terminal domain of MECP2, which account for some Rett syndrome cases, yet similar variants also appear in unaffected individuals. Using base editing and mouse models, the authors present convincing evidence supporting the pathogenicity of select deletion variants, with potential implications for therapeutic development. However, comments regarding the analysis of publicly available genetic databases should be addressed to strengthen the conclusions and provide greater clarity to the field.

    2. Reviewer #1 (Public review):

      Summary:

      The authors scrutinized differences in C-terminal region variant profiles between Rett syndrome patients and healthy individuals and pinpointed that subtle genetic alternation can cause benign or pathogenic output, which harbors important implications in Rett syndrome diagnosis and proposes a therapeutic strategy. This work will be beneficial to clinicians and basic scientists who work on Rett syndrome, and carries the potential to be applied to other Mendelian rare diseases.

      Strengths:

      Well-designed genetic and molecular experiments translate genetic differences into functional and clinical changes. This is a unique study resolving subtle changes in sequences that give rise to dramatic phenotypic consequences.

      Weaknesses:

      There are many base-editing and protein-expression changes throughout the manuscript, and they cause confusion. It would be helpful to readers if authors could provide a simple summary diagram at the end of the paper.

    3. Reviewer #2 (Public review):

      Summary:

      This study by Guy and Bird and colleagues is a natural follow-up to their 2018 Human Molecular Genetics paper, further clarifying the molecular basis of C-terminal deletions (CTDs) in MECP2 and how they contribute to Rett syndrome. The authors combine human genetic data with well-designed experiments in embryonic stem cells, differentiated neurons, and knock-in mice to explain why some CTD mutations are disease-causing while others are harmless. They show that pathogenic mutations create a specific amino acid motif at the C-terminus, where +2 frameshifts produce a PPX ending that greatly reduces MeCP2 protein levels (likely due to translational stalling) whereas +1 frameshifts generating SPRTX endings are well tolerated.

      Strengths:

      This is a comprehensive and rigorous study that convincingly pinpoints the molecular mechanism behind CTD pathogenicity, with strong agreement between the cell-based and animal data. The authors also provide a proof of principle that modifying the PPX termination codon can restore MeCP2-CTD protein levels and rescue symptoms in mice. In addition, they demonstrate that adenine base editing can correct this defect in cultured cells and increase MeCP2-CTD protein levels. Overall, this is a well-executed study that provides important mechanistic and translational insight into a clinically important class of MECP2 mutations.

      Weaknesses:

      The adenine base editing to change the termination codon is shown to be feasible in generated cell lines, but has yet to be shown in vivo in animal models.

    4. Reviewer #3 (Public review):

      Summary:

      Guy et al. explored the variation in the pathogenicity of carboxy-terminal frameshift deletions in the X-linked MECP2 gene. Loss-of-function variants in MECP2 are associated with Rett syndrome, a severe neurodevelopmental disorder. Although 100's of pathogenic MECP2 variants have been found in people with Rett syndrome, 8 recurrent point mutations are found in ~65% of disease cases, and frameshift insertions/deletions (indels) variants resulting in production of carboxy-terminal truncated (CTT) MeCP2 protein account for ~10% of cases. Many of these occur in a "deletion prone region" (DPR) between c.1110-1210, with common recurrent deletions c.1157-1197del (CTD1) and c.1164_1207del (CTD2). While two major protein functional domains have been defined in MeCP2, the methyl-binding domain (MBD) and the NCoR interacting domain (NID), the functional role of the carboxy-terminal domain (CTD, beyond the NID, predicted to have a disordered protein structure) has not been identified, and previous work by this group and others demonstrated that a Mecp2 "minigene" lacking the CTD retains MeCP2 function suggesting that the CTD is dispensable. This raises an important question: If the CTD is dispensable, what is the pathogenic basis of the various CTT frameshift variants? Prior work from this group demonstrated that genetically engineered mice expressing the CTD1 variant had decreased expression of Mecp2 RNA and MeCP2 protein and decreased survival, but those expressing the CTD2 variant had normal Mecp2 RNA and protein and survival. However, they noted that differences between the mouse and human coding sequences resulted in different terminal sequences between the two common CTD, with CTD1 ending in -PPX in both mouse and human, but CTD2 ending in -PPC in human but -SPX in mouse, and in the previous paper they demonstrated in humanized mouse ES cells (edited to have the same -PPX termination) containing the CTD2 deletion resulted in decreased Mecp2 RNA and protein levels. This previous work provides the underlying hypotheses that they sought to explore, which is that the pathological basis of disease causing CTD relates to the formation of truncated proteins that end with a specific amino acid sequence (-PPX), which leads to decreased mRNA and protein levels, whereas tolerated, non-pathogenic CTD do not lead to production of truncated proteins ending in this sequence and retain normal mRNA/protein expression.

      In this manuscript, they evaluate missense variants, in-frame deletions, and frame shift deletions within the DPR from the aggregated Genome Aggregated Database (gnomAD) and find that the "apparently" normal individuals within gnomAD had numerous tolerated missense variants and in-frame deletions within this region, as well as frameshift deletions (in hemizygous males) in the defined region. All of the gnomAD deletions within this region resulted in terminal amino acid sequences -SPRTX (due to +1 frameshift), whereas nearly all deletion variants in this region from people with Rett syndrome (from the Clinvar copy of the former RettBase database) had a terminal -PPX sequence, due to a +2 frameshift. They hypothesized that terminal proline codons causing ribosomal stalling and "nonsense mediated decay like" degradation of mRNA (with subsequent decreased protein expression) was the basis of the specific pathogenicity of the +2 frameshift variants, and that utilizing adenine base editors (ABE) to convert the termination codon to a tryptophan could correct this issue. They demonstrate this by engineering the change into mouse embryonic stem cell lines and mouse lines containing the CTD1 deletion and show that this change normalized Mecp2 mRNA and protein levels and mouse phenotypes. Finally, they performed an initial proof-of-concept in an inducible HEK cell line and showed the ability of targeted ABE to edit the correct adenine and cause production of the expected larger truncated Mecp2 protein from CTD1 constructs.

      The findings of this manuscript provide a level of support for their hypothesis about the pathogenicity versus non-pathogenicity of some MECP2 CTT intragenic deletions and provide preliminary evidence for a novel therapeutic approach for Rett syndrome; however, limitations in their analysis do not fully support the broader conclusions presented.

      Strengths:

      (1) Utilization of publicly available databases containing aggregated genetic sequencing data from adult cohorts (gnomAD) and people with Rett syndrome (Clinvar copy of RettBase) to compare differences in the composition of the resulting terminal amino acid sequences resulting from deletions presumed to be pathogenic (n+2) versus presumed to be tolerated (n+1).

      (2) Evaluation of a unique human pedigree containing an n+1 deletion in this region that was reported as pathogenic, with demonstration of inheritance of this from the unaffected father and presence within other unaffected family members.

      (3) Development of a novel engineered mouse model of a previously assumed n+1 pathogenic variant to demonstrate lack of detrimental effect, supporting that this is likely a benign variant and not causative of Rett syndrome.

      (4) Creation and evaluation of novel cell lines and mouse models to test the hypothesis that the pathogenicity of the n+2 deletion variants could be altered by a single base change in the frameshifted stop codon.

      (5) Initial proof-of-concept experiments demonstrating the potential of ABE to correct the pathogenicity of these n+2 deletion variants.

      Weaknesses:

      (1) While the use of the large aggregated gnomAD genetic data benefits from the overall size of the data, the presence of genetic variants within this collection does not inherently mean that they are "neutral" or benign. While gnomAD does not include children, it does include aggregated data from a variety of projects targeting neuropsychiatric (and other conditions), so there is information in gnomAD from people with various medical/neuropsychiatric conditions. The authors do make some acknowledgement of this and argue that the presence of intragenic deletion variants in their region of interest in hemizygous males indicates that it is highly likely that these are tolerated, non-pathogenic variants. Broadly, it is likely true that gnomAD MECP2 variants found in hemizygous males are unlikely to cause Rett syndrome in heterozygous females, it does not necessarily mean that these variants have no potential to cause other, milder, neuropsychiatric disorders. As a clear example, within gnomAD, there is a hemizygous male with the rs28934908 C>T variant that results in p.A140V (p.A152V in e1 transcript numbering convention). This pathogenic variant has been found in a number of pedigrees with an X-linked intellectual disability pattern, in which males have a clear neurodevelopmental disorder and heterozygous females have mild intellectual disability (see PMIDs 12325019, 24328834 as representative examples of a large number of publications describing this). Thus, while their claim that hemizygous deletion variants in gnomAD are unlikely to cause Rett syndrome, that cannot make the definitive statement that they are not pathogenic and completely benign, especially when only found in a very small number of individuals in gnomAD.

      (2) The authors focus exclusively on deletions within the "DPR", they define as between c.1110-1210 and say that these deletions account for 10% of Rett syndrome cases. However, the published studies that are the basis for this 10% estimate include all genetic variants (frameshift deletions, insertions, complex insertion/deletions, nonsense variants) resulting in truncations beyond the NID. For example, Bebbington 2010 (PMID: 19914908), which includes frameshift indels as early as c.905 and beyond c.1210. Further specific examples from RettBase are described below, but the important point is that their evaluation of only frameshift variants within c.1110-1210 is not truly representative of the totality of genetic variants that collectively are considered CTT and account for 10% of Rett cases.

      (3) The authors say that they evaluated the putative pathogenic variants contained within RettBase (which is no longer available, but the data were transferred to Clinvar) for all cases with Classic Rett syndrome and de novo deletion variants within their defined DPR domain. Looking at the data from the Clinvar copy of RettBase, there are a number (n=143) of c-terminal truncating variants (either frameshift or nonsense) present beyond the NID, but the authors only discuss 14 deletion frameshift variants in this manuscript. A number of these variants have molecular features that do not fall into the pathogenic classification proposed by the authors and are not addressed in the manuscript and do not support the generalization of the conclusions presented in this manuscript, especially the conclusion that the determination of pathogenicity of all c-terminal truncating variants can be determined according to their proposed n+2 rule, or that all of the 10% of people with Rett syndrome and c-terminal truncating variants could be treated by using a base editor to correct the -PPX termination codon.

      (4) The HEK-based system utilized is convenient for doing the initial experiments testing ABE; however, it represents an artificial system expressing cDNA without splicing. Canonical NMD is dependent on splicing, and while non-canonical "NMD-like" processes are less well understood, a concern is whether the artificial system used can adequately predict efficacy in a native setting that includes introns and splicing.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors scrutinized differences in C-terminal region variant profiles between Rett syndrome patients and healthy individuals and pinpointed that subtle genetic alternation can cause benign or pathogenic output, which harbours important implications in Rett syndrome diagnosis and proposes a therapeutic strategy. This work will be beneficial to clinicians and basic scientists who work on Rett syndrome, and carries the potential to be applied to other Mendelian rare diseases.

      Strengths:

      Well-designed genetic and molecular experiments translate genetic differences into functional and clinical changes. This is a unique study resolving subtle changes in sequences that give rise to dramatic phenotypic consequences.

      Weaknesses:

      There are many base-editing and protein-expression changes throughout the manuscript, and they cause confusion. It would be helpful to readers if authors could provide a simple summary diagram at the end of the paper.

      We thank Reviewer #1 for their encouraging comments. As suggested, we will include a summary figure of the genetic changes we have made, and the resulting expression and phenotypic consequences.

      Reviewer #2 (Public review):

      Summary:

      This study by Guy and Bird and colleagues is a natural follow-up to their 2018 Human Molecular Genetics paper, further clarifying the molecular basis of C-terminal deletions (CTDs) in MECP2 and how they contribute to Rett syndrome. The authors combine human genetic data with well-designed experiments in embryonic stem cells, differentiated neurons, and knock-in mice to explain why some CTD mutations are disease-causing while others are harmless. They show that pathogenic mutations create a specific amino acid motif at the C-terminus, where +2 frameshifts produce a PPX ending that greatly reduces MeCP2 protein levels (likely due to translational stalling) whereas +1 frameshifts generating SPRTX endings are well tolerated.

      Strengths:

      This is a comprehensive and rigorous study that convincingly pinpoints the molecular mechanism behind CTD pathogenicity, with strong agreement between the cell-based and animal data. The authors also provide a proof of principle that modifying the PPX termination codon can restore MeCP2-CTD protein levels and rescue symptoms in mice. In addition, they demonstrate that adenine base editing can correct this defect in cultured cells and increase MeCP2-CTD protein levels. Overall, this is a well-executed study that provides important mechanistic and translational insight into a clinically important class of MECP2 mutations.

      Weaknesses:

      The adenine base editing to change the termination codon is shown to be feasible in generated cell lines, but has yet to be shown in vivo in animal models.

      We thank Reviewer #2 for their positive comments. We agree that an in vivo study demonstrating effective DNA base editing in our CTD-1 mouse model is the obvious next step, and this work is in progress. However, given the ever-increasing use of pre- and neonatal screening for genetic diseases, we felt it important to disseminate our findings as soon as possible. The family pedigree in Figure 3C is a clear demonstration of this need.

      Reviewer #3 (Public review):

      Summary:

      Guy et al. explored the variation in the pathogenicity of carboxy-terminal frameshift deletions in the X-linked MECP2 gene. Loss-of-function variants in MECP2 are associated with Rett syndrome, a severe neurodevelopmental disorder. Although 100's of pathogenic MECP2 variants have been found in people with Rett syndrome, 8 recurrent point mutations are found in ~65% of disease cases, and frameshift insertions/deletions (indels) variants resulting in production of carboxy-terminal truncated (CTT) MeCP2 protein account for ~10% of cases. Many of these occur in a "deletion prone region" (DPR) between c.1110-1210, with common recurrent deletions c.1157-1197del (CTD1) and c.1164_1207del (CTD2). While two major protein functional domains have been defined in MeCP2, the methyl-binding domain (MBD) and the NCoR interacting domain (NID), the functional role of the carboxy-terminal domain (CTD, beyond the NID, predicted to have a disordered protein structure) has not been identified, and previous work by this group and others demonstrated that a Mecp2 "minigene" lacking the CTD retains MeCP2 function suggesting that the CTD is dispensable. This raises an important question: If the CTD is dispensable, what is the pathogenic basis of the various CTT frameshift variants? Prior work from this group demonstrated that genetically engineered mice expressing the CTD1 variant had decreased expression of Mecp2 RNA and MeCP2 protein and decreased survival, but those expressing the CTD2 variant had normal Mecp2 RNA and protein and survival. However, they noted that differences between the mouse and human coding sequences resulted in different terminal sequences between the two common CTD, with CTD1 ending in -PPX in both mouse and human, but CTD2 ending in -PPC in human but -SPX in mouse, and in the previous paper they demonstrated in humanized mouse ES cells (edited to have the same -PPX termination) containing the CTD2 deletion resulted in decreased Mecp2 RNA and protein levels. This previous work provides the underlying hypotheses that they sought to explore, which is that the pathological basis of disease causing CTD relates to the formation of truncated proteins that end with a specific amino acid sequence (-PPX), which leads to decreased mRNA and protein levels, whereas tolerated, non-pathogenic CTD do not lead to production of truncated proteins ending in this sequence and retain normal mRNA/protein expression.

      In this manuscript, they evaluate missense variants, in-frame deletions, and frame shift deletions within the DPR from the aggregated Genome Aggregated Database (gnomAD) and find that the "apparently" normal individuals within gnomAD had numerous tolerated missense variants and in-frame deletions within this region, as well as frameshift deletions (in hemizygous males) in the defined region. All of the gnomAD deletions within this region resulted in terminal amino acid sequences -SPRTX (due to +1 frameshift), whereas nearly all deletion variants in this region from people with Rett syndrome (from the Clinvar copy of the former RettBase database) had a terminal -PPX sequence, due to a +2 frameshift. They hypothesized that terminal proline codons causing ribosomal stalling and "nonsense mediated decay like" degradation of mRNA (with subsequent decreased protein expression) was the basis of the specific pathogenicity of the +2 frameshift variants, and that utilizing adenine base editors (ABE) to convert the termination codon to a tryptophan could correct this issue. They demonstrate this by engineering the change into mouse embryonic stem cell lines and mouse lines containing the CTD1 deletion and show that this change normalized Mecp2 mRNA and protein levels and mouse phenotypes. Finally, they performed an initial proof-of-concept in an inducible HEK cell line and showed the ability of targeted ABE to edit the correct adenine and cause production of the expected larger truncated Mecp2 protein from CTD1 constructs.

      The findings of this manuscript provide a level of support for their hypothesis about the pathogenicity versus non-pathogenicity of some MECP2 CTT intragenic deletions and provide preliminary evidence for a novel therapeutic approach for Rett syndrome; however, limitations in their analysis do not fully support the broader conclusions presented.

      Strengths:

      (1) Utilization of publicly available databases containing aggregated genetic sequencing data from adult cohorts (gnomAD) and people with Rett syndrome (Clinvar copy of RettBase) to compare differences in the composition of the resulting terminal amino acid sequences resulting from deletions presumed to be pathogenic (n+2) versus presumed to be tolerated (n+1).

      (2) Evaluation of a unique human pedigree containing an n+1 deletion in this region that was reported as pathogenic, with demonstration of inheritance of this from the unaffected father and presence within other unaffected family members.

      (3) Development of a novel engineered mouse model of a previously assumed n+1 pathogenic variant to demonstrate lack of detrimental effect, supporting that this is likely a benign variant and not causative of Rett syndrome.

      (4) Creation and evaluation of novel cell lines and mouse models to test the hypothesis that the pathogenicity of the n+2 deletion variants could be altered by a single base change in the frameshifted stop codon.

      (5) Initial proof-of-concept experiments demonstrating the potential of ABE to correct the pathogenicity of these n+2 deletion variants.

      Weaknesses:

      (1) While the use of the large aggregated gnomAD genetic data benefits from the overall size of the data, the presence of genetic variants within this collection does not inherently mean that they are "neutral" or benign. While gnomAD does not include children, it does include aggregated data from a variety of projects targeting neuropsychiatric (and other conditions), so there is information in gnomAD from people with various medical/neuropsychiatric conditions. The authors do make some acknowledgement of this and argue that the presence of intragenic deletion variants in their region of interest in hemizygous males indicates that it is highly likely that these are tolerated, non-pathogenic variants. Broadly, it is likely true that gnomAD MECP2 variants found in hemizygous males are unlikely to cause Rett syndrome in heterozygous females, it does not necessarily mean that these variants have no potential to cause other, milder, neuropsychiatric disorders. As a clear example, within gnomAD, there is a hemizygous male with the rs28934908 C>T variant that results in p.A140V (p.A152V in e1 transcript numbering convention). This pathogenic variant has been found in a number of pedigrees with an X-linked intellectual disability pattern, in which males have a clear neurodevelopmental disorder and heterozygous females have mild intellectual disability (see PMIDs 12325019, 24328834 as representative examples of a large number of publications describing this). Thus, while their claim that hemizygous deletion variants in gnomAD are unlikely to cause Rett syndrome, that cannot make the definitive statement that they are not pathogenic and completely benign, especially when only found in a very small number of individuals in gnomAD.

      (2) The authors focus exclusively on deletions within the "DPR", they define as between c.1110-1210 and say that these deletions account for 10% of Rett syndrome cases. However, the published studies that are the basis for this 10% estimate include all genetic variants (frameshift deletions, insertions, complex insertion/deletions, nonsense variants) resulting in truncations beyond the NID. For example, Bebbington 2010 (PMID: 19914908), which includes frameshift indels as early as c.905 and beyond c.1210. Further specific examples from RettBase are described below, but the important point is that their evaluation of only frameshift variants within c.1110-1210 is not truly representative of the totality of genetic variants that collectively are considered CTT and account for 10% of Rett cases.

      (3) The authors say that they evaluated the putative pathogenic variants contained within RettBase (which is no longer available, but the data were transferred to Clinvar) for all cases with Classic Rett syndrome and de novo deletion variants within their defined DPR domain. Looking at the data from the Clinvar copy of RettBase, there are a number (n=143) of c-terminal truncating variants (either frameshift or nonsense) present beyond the NID, but the authors only discuss 14 deletion frameshift variants in this manuscript. A number of these variants have molecular features that do not fall into the pathogenic classification proposed by the authors and are not addressed in the manuscript and do not support the generalization of the conclusions presented in this manuscript, especially the conclusion that the determination of pathogenicity of all c-terminal truncating variants can be determined according to their proposed n+2 rule, or that all of the 10% of people with Rett syndrome and c-terminal truncating variants could be treated by using a base editor to correct the -PPX termination codon.

      (4) The HEK-based system utilized is convenient for doing the initial experiments testing ABE; however, it represents an artificial system expressing cDNA without splicing. Canonical NMD is dependent on splicing, and while non-canonical "NMD-like" processes are less well understood, a concern is whether the artificial system used can adequately predict efficacy in a native setting that includes introns and splicing.

      We thank reviewer #3 for their constructive comments. A number of these relate to our analysis of databases of pathogenic (RettBASE) and non-pathogenic (gnomAD) databases. We disagree with their assertion that we are looking at only a small subset of RTT CTD mutations. We detail 14 different RTT CTDs in the manuscript, but these include the 3 most frequently occurring, which alone account for 121 RTT cases in RettBASE.

      We used the original RettBASE database for our analysis, which contained significantly more information than was transferred to Clinvar. We may not have made this sufficiently clear and will remedy this during revision of the manuscript.

      We stress that RettBASE contained many non-RTT causing mutations. For this reason, we employed stringent selection criteria to define a high-confidence set of RTT CTD alleles. Importantly, this set was chosen before any investigation of reading frame or C-terminal amino acid sequence. Our stringent set was selected based on three criteria: location within the C-terminal deletion prone region (CT-DPR), a diagnosis of Classical RTT and at least one case where that mutation had been shown to be absent from both parents (i.e. that it was a de novo mutation). This excluded a large number of CTD alleles which fitted the +2 frameshift/PPX ending category as well as some in other categories. There are good reasons to believe that the vast majority of genuinely pathogenic RTT CTD mutations do fall into this class.

      Concerning gnomAD CTDs, we chose to restrict our detailed analysis to those which are present in the hemizygous state, to exclude individuals which mask a pathogenic mutation due to skewed X-inactivation. Data from all zygosities are shown in Fig. 3, SF1.

      We will revise the manuscript to include tables of all extracted data relevant to this region, from both gnomAD and RettBASE, along with analysis of a less stringent set of RettBASE CTDs for reading frame and C-terminal amino acid sequence. We hope this will make clear the strength of the evidence for our conclusions.

      We agree with Reviewer #3 that inclusions of variants in gnomAD does not exclude the possibility that they may cause medical/psychiatric conditions other than RTT. This point is already mentioned in the Discussion, but we plan to emphasise it further. The pedigree included in the paper, as well as others that we have learned of, argue that loss of the C-terminus of MeCP2 has few if any phenotypic consequences, but more cases are needed to robustly assess this conclusion.

      We disagree that our HEK cell-based system is not suitable for testing efficacy of reagents for use on endogenous alleles in vivo. The editing process is not dependent on splicing, and we have shown in this manuscript that making this single base change has the same effect on an endogenous knock-in allele (CTD1 X>W) or a cDNA-based transgene (Flp-In T-REx CTD1 + base editing), namely, to increase the amount of truncated MeCP2 produced.

    1. eLife Assessment

      This study provides an important assessment of how body size influences the occurrence of macro-organisms in urban areas across the globe. Size in most plants, but only some animal families, was positively associated with urban tolerance. The data set is impressive, but the evidence for broad-scale conclusions is incomplete due to methodological issues that need to be resolved.

    2. Reviewer #1 (Public review):

      Summary:

      The authors integrate multiple large databases to test whether body sizes were positively associated with which species tolerate urban areas. In general, many plant families showed a positive association between body size and urban tolerance, whereas a smaller, though still non-trivial, percentage of animal families showed the same pattern. Notably, the authors are careful in the interpretation of their findings and provide helpful context for the ways that this analysis can be generative in shaping new hypotheses and theory around how urbanization influences biodiversity at large. They are careful to discuss how body size is an important trait, but the absence of a relationship between body size and urban tolerance in many families suggests a variety of other traits undergird urban success.

      Strengths:

      The authors aggregated a large dataset, but they also applied robust filters to ensure they had an adequate and representative number of detections for a given species, family, geography, etc. The authors also applied their analysis at multiple taxonomic scales (family and order), which allowed for a better interpretation of the patterns in the data and at what taxonomic scale body size might be important.

      Weaknesses:

      My main concern is that it is not fully clear how the measure of body size might influence the result. The authors were unable to obtain consistent measures of body size (mean, median, maximum, or sex variation). This, of course, could be very consequential as means and medians can differ quite a bit, and they certainly will differ substantially from a maximum. And of course, sex differences can be marked in multiple directions or absent altogether. The authors do note that they selected the measure that was most common in a family, but it was not clear whether species in that family that did not have that measure were removed or not. This could potentially shape the variability in the dataset and obscure true patterns. This may require additional clarity from the authors and is also a real constraint in compiling large data from disparate sources.

    3. Reviewer #2 (Public review):

      I have completed a thorough review of this paper, which seeks to use the large datasets of species occurrences available through GBIF to estimate variation in how large numbers of plant and animal species are associated with urbanization throughout the world, describing what they call the "species urbanness distribution" or SUD. They explore how these SUDs differ between regions and different taxonomic levels. They then calculate a measure of urban tolerance and seek to explore whether organism size predicts variation in tolerance among species and across regions.

      The study is impressive in many respects. Over the course of several papers, Callaghan and coauthors have been leaders in using "big [biodiversity] data" to create metrics of how species' occurrence data are associated with urban environments, and in describing variation in urban tolerance among taxa and regions. This work has been creative, novel, and it has pushed the boundaries of understanding how urbanization affects a wide diversity of taxa. The current paper takes this to a new level by performing analyses on over 94000 observations from >30,000 species of plants and animals, across more than 370 plant and animal taxonomic families. All of these analyses were focused on answering two main questions:

      (1) What is the shape of species' urban tolerance distributions within regional communities?

      (2) Does body size consistently correlate with species' urban tolerance across taxonomic groups and biogeographic contexts?

      Overall, I think the questions are interesting and important, the size and scope of the data and analyses are impressive, and this paper has a potentially large contribution to make in pushing forward urban macroecology specifically and urban ecology and evolution more generally.

      Despite my enthusiasm for this paper and its potential impact, there are aspects that could be improved, and I believe the paper requires major revision.

      Some of these revisions ideally involve being clearer about the methodology or arguments being made. In other cases, I think their metrics of urban tolerance are flawed and need to be rethought and recalculated, and some of the conclusions are inaccurate. I hope the authors will address these comments carefully and thoroughly. I recognize that there is no obligation for authors to make revisions. However, revising the paper along the lines of the comments made below would increase the impact of the paper and its clarity to a broad readership.

      Major Comments:

      (1) Subrealms

      Where does the concept of "subrealms" come from? No citation is given, and it could be said that this sounds like an idea straight out of Middle Earth. How do subrealms relate to known bioclimatic designations like Koppen Climate classifications, which would arguably be more appropriate? Or are subrealms more socio-ecologically oriented? From what I can tell, each subrealm lumps together climatically diverse areas. It might be better and more tractable to break things in terms of continents, as the rationale for subrealms is unclear, and it makes the analyses and results more confusing. The authors rationalized the use of subrealms to account for potential intraspecific differences in species' response to urbanization, but that is never a core part of the questions or interpretation in the paper, and averaging across subrealms also accounts for intraspecific variation. Another issue with using the subrealm approach is that the authors only included a species if it had 100 observations in a given subrealm, leading to a focus on only the most common species, which may be biased in their SUD distribution. How many more species would be included if they did their analysis at the continental or global scale, and would this change the shape of SUDs?

      (2) Methods - urban score

      The authors describe their "urban score" as being calculated as "the mean of the distribution of VIIRS values as a relative species specific measure of a response to urban land cover."

      I don't understand how this is a "relative species-specific measure". What is it relative to? Figures S4 and S5 show the mean distribution of VIIRS for various taxa, and this mean looks to be an absolute measure. Mean VIIRS for a given species would be fine and appropriate as an "urban score", but the authors then state in the next sentence: "this urban score represents the relative ranking of that species to other species in response to urban land cover".

      That doesn't follow from the description of how this is calculated. Something is missing here. Please clarify and add an explicit equation for how the urban score is calculated because the text is unclear and confusing.

      (3) Methods - urban tolerance

      How the authors are defining and calculating tolerance is unclear, confusing, and flawed in my opinion.

      Tolerance is a common concept in ecology, evolution, and physiology, typically defined as the ability for an organism to maintain some measure of performance (e.g., fitness, growth, physiological homeostasis) in the presence versus absence of some stressor. As one example, in the herbivory literature, tolerance is often measured as the absolute or relative difference in fitness of plants that are damaged versus undamaged (e.g., https://academic.oup.com/evolut/article/62/9/2429/6853425?login=true).

      On line 309, after describing the calculation of urban scores across subrealms, they write: "Therefore, a species could be represented across multiple subrealms with differing measures of urban tolerance (Fig. S4). Importantly, this continuous metric of urban tolerance is a relative measure of a species' preference, or affinity, to urban areas: it should be interpreted only within each subrealm".

      This is problematic on several fronts. First, the authors never define what they mean by the term "tolerance". Second, they refer to urban tolerance throughout the paper, but don't describe the calculation until lines 315-319, where they write (text in [ ] is from the reviewer):

      "Within each subrealm, we further accounted for the potential of different levels of urbanization by scaling each species' urban score by subtracting the mean VIIRS of all observations in the subrealm (this value is hereafter referred to as urban tolerance). This 'urban tolerance' (Fig. S5) value can be negative - when species under-occupy urban areas [relative to the average across all species] suggesting they actively avoid them-or positive-when species over-occupy urban areas [relative to the average across all species] suggesting they prefer them (i.e., ranging from urban avoiders to urban exploiters, respectively).<br /> They are taking a relativized urban score and then subtracting the mean VIIRS of all observations across species in a subrealm. How exactly one interprets the magnitude isn't clear and they admit this metric is "not interpretative across subrealms".

      This is not a true measure of tolerance, at least not in the conventional sense of how tolerance is typically defined. The problem is that a species distribution isn't being compared to some metric of urbanness, but instead it is relative to other species' urban scores, where species may, on average, be highly urban or highly nonurban in their distribution, and this may vary from subrealm to subrealm. A measure of urban tolerance should be independent of how other species are responding, and should be interpretable across subrealms, continents, and the globe.

      I propose the authors use one of two metrics of urban tolerance:

      (i) Absolute Urban Tolerance = Mean VIIRS of species_i - Mean VIIRS of city centers<br /> Here, the mean VIIRS of city centers could be taken from the center of multiple cities throughout a subrealm, across a continent, or across the world. Here, the units are in the original VIIRS units where 0 would correspond to species being centered on the most extreme urban habitats, and the most extreme negative values would correspond to species that occupy the most non-urban habitats (i.e., no artificial light at night). In essence, this measure of tolerance would quantify how far a species' distribution is shifted relative to the most highly urbanized habitat available.

      (ii) % Urban Tolerance = (Mean VIIRS of species_i - Mean VIIRS of city centers)/MeanVIIRS of city centers * 100%<br /> This metric provides a % change in species mean VIIRS distribution relative to the most urban habitats. This value could theoretically be negative or positive, but will typically be negative, with -100% being completely non-urban, and 0% being completely urban tolerant.

      Both of these metrics can be compared across the world, as it would provide either absolute (equation 1) or relative (equation 2) metrics of urban tolerance that are comparable and easily interpretable in any region.

      In summary, the definition of tolerance should be clear, the metric should be a true measure of tolerance that is comparable across regions, and an equation should be given.

      (4) Figure 1: The figure does not stand alone. For example, what is the hypothesis for thermophily or the temperature-size rule? The authors should expand the legend slightly to make the hypotheses being illustrated clearer.

      (5) SUDs: I don't agree with the conclusion given on line 83 ("pattern was consistent across subrealms and several taxonomic levels") or in the legend of Figure 2 ("there were consistent patterns for kingdoms, classes, and orders, as shown by generally similar density histograms shapes for each of these").

      The shapes of the curves are quite different, especially for the two Kingdoms and the different classes. I agree they are relatively consistent for the different taxonomic Orders of insects.

    4. Reviewer #3 (Public review):

      Summary:

      This paper reports on an association between body size and the occurrence of species in cities, which is quantified using an 'urban score' that can be visualized as a 'Species Urbanness Distribution' for particular taxa. The authors use species records from the Global Biodiversity Information Facility (GBIF) and link the occurrence data to nighttime lighting quantified using satellite data (Visible Infrared Imaging Radiometer Suite-VIIRS). They link the urban score to body size data to find 'heterogeneous relationship between body size and urban tolerance across the tree'. The results are then discussed with reference to potential mechanisms that could possibly produce the observed effects (cf. Figure 1).

      Strengths:

      The novelty of this study lies in the huge number of species analyzed and the comparison of results among animal taxa, rather than in a thorough analysis of what traits allow species to persist under urban conditions. Such analyses have been done using a much more thorough approach that employs presence-absence data as well as a suite of traits by other studies, for example, in (Hahs et al. 2023, Neate-Clegg et al. 2023). The dataset that the authors produced would also be very valuable if these raw data were published, both the cleaned species records as well as the body sizes.

      The paper could strongly add to our understanding of what species occur in cities when the open questions are addressed.

      Weaknesses:

      I value the approach of the authors, but I think the paper needs to be revised.

      In my view, the authors could more carefully validate their approach. Currently, any weakness or biases in the approach are quickly explained away rather than carefully explored. This concerns particularly the use of presence-only data, but also the calculation of the urban score.

      The vast majority of data in GBIF is presence-only data. This produces a strong bias in the analysis presented in the paper. For some taxa, it is likely that occurrences within the city are overrepresented, and for other taxa, the opposite is true (cf. Sweet et al. 2022). I think the authors should try to address this problem.

      The authors should compare their results to studies focusing on particular taxa where extensive trait-based analyses have already been performed, i.e., plants and birds. In fact, I strongly suggest that the authors should compare their results to previous studies on the relationship between traits, including body size and occurrences along a gradient of urbanisation, to draw conclusions about the validity of the approach used in the current study, which has a number of weaknesses.

      They should be be more careful in coming up with post-hoc explanations of why the pattern found in this study makes sense or suggests a particular mechanism. This reviewer considers that there is no way in which the current study can disentangle the different possible mechanisms without further analyses and data, so I would suggest pointing out carefully how the mechanisms could be studied

      More details should be given about the methodology. The readers should be able to understand the methods without having to read a number of other papers.

      References:

      Hahs, A. K., B. Fournier, M. F. Aronson, C. H. Nilon, A. Herrera-Montes, A. B. Salisbury, C. G. Threlfall, C. C. Rega-Brodsky, C. A. Lepczyk, and F. A. La Sorte. 2023. Urbanisation generates multiple trait syndromes for terrestrial animal taxa worldwide. Nature Communications 14:4751.

      Neate-Clegg, M. H. C., B. A. Tonelli, C. Youngflesh, J. X. Wu, G. A. Montgomery, Ç. H. Şekercioğlu, and M. W. Tingley. 2023. Traits shaping urban tolerance in birds differ around the world. Current Biology 33:1677-1688.

      Sweet, F. S. T., B. Apfelbeck, M. Hanusch, C. Garland Monteagudo, and W. W. Weisser. 2022. Data from public and governmental databases show that a large proportion of the regional animal species pool occur in cities in Germany. Journal of Urban Ecology 8:juac002.

    1. eLife Assessment

      The goal of this useful study is to examine learning-related changes in neural representations of global and local spatial reference frames in a spatial navigation task. Although the study addresses an interesting question, the evidence for neural representations in the hippocampus and retrosplenial cortex remains incomplete because of confounds in the experimental design and partial data analysis. There are further concerns about the framing of the study in the context of the relevant literature as well as the discussion.

    2. Reviewer #1 (Public review):

      Summary:

      In this paper, Qiu et al. developed a novel spatial navigation task to investigate the formation of multi-scale representations in the human brain. Over multiple sessions and diverse tasks, participants learned the location of 32 objects distributed across 4 different rooms. The key task was a "judgement of relative direction" task delivered in the scanner, which was designed to assess whether object representations reflect local (within-room) or global (across-room) similarity structures. In between the two scanning sessions, participants received extensive further training. The goal of this manipulation was to test how spatial representations change with learning.

      Strengths:

      The authors designed a very comprehensive set of tasks in virtual reality to teach participants a novel spatial map. The spatial layout is well-designed to address the question of interest in principle. Participants were trained in a multi-day procedure, and representations were assessed twice, allowing the authors to investigate changes in the representation over multiple days.

      Weaknesses:

      Unfortunately, I see multiple problems with the experimental design that make it difficult to draw conclusions from the results.

      (1) In the JRD task (the key task in this paper), participants were instructed to imagine standing in front of the reference object and judge whether the second object was to their left or right. The authors assume that participants solve this task by retrieving the corresponding object locations from memory, rotating their imagined viewpoint and computing the target object's relative orientation. This is a challenging task, so it is not surprising that participants do not perform particularly well after the initial training (performance between 60-70% accuracy). Notably, the authors report that after extensive training, they reached more than 90% accuracy.

      However, I wonder whether participants indeed perform the task as intended by the authors, especially after the second training session. A much simpler behavioural strategy is memorising the mapping between a reference object and an associated button press, irrespective of the specific target object. This basic strategy should lead to quite high success rates, since the same direction is always correct for four of the eight objects (the two objects located at the door and the two opposite the door). For the four remaining objects, the correct button press is still the same for four of the six target objects that are not located opposite to the reference object. Simply memorising the button press associated with each reference object should therefore lead to a high overall task accuracy without the necessity to mentally simulate the spatial geometry of the object relations at all.

      I also wonder whether the random effect coefficients might reflect interindividual differences in such a strategy shift - someone who learnt this relationship between objects and buttons might show larger increases in RTs compared to someone who did not.

      (2) On a related note, the neural effect that appears to reflect the emergence of a global representation might be more parsimoniously explained by the formation of pairwise associations between reference and target objects. Since both objects always came from the same room, an RDM reflecting how many times an object pair acted as a reference-target pair will correlate with the categorical RDM reflecting the rooms corresponding to each object. Since the categorical RDM is highly correlated with the global RDM, this means that what the authors measure here might not reflect the formation of a global spatial map, but simply the formation of pairwise associations between objects presented jointly.

      (3) In general, the authors attribute changes in neural effects to new learning. But of course, many things can change between sessions (expectancy, fatigue, change in strategy, but also physiological factors...). Baseline phsiological effects are less likely to influence patterns of activity, so the RSA analyses should be less sensitive to this problem, but especially the basic differences in activation for the contrast of post-learning > pre-learning stages in the judgment of relative direction (JRD) task could in theory just reflect baseline differences in blood oxygenation, possibly due to differences in time of day, caffeine intake, sleep, etc. To really infer that any change in activity or representation is due to learning, an active control would have been great.

      (4) RSA typically compares voxel patterns associated with specific stimuli. However, the authors always presented two objects on the screen simultaneously. From what I understand, this is not considered in the analysis ("The β-maps for each reference object were averaged across trials to create an overall β-map for that object."). Furthermore, participants were asked to perform a complex mental operation on each trial ("imagine standing at A, looking at B, then perform the corresponding motor response"). Assuming that participants did this (although see points 1 and 2 above), this means that the resulting neural representation likely reflects a mixture of the two object representations, the mental transformation and the corresponding motor command, and possibly additionally the semantic and perceptual similarity between the two presented words. This means that the βs taken to reflect the reference object representation must be very noisy.

      This problem is aggravated by two additional points. Firstly, not all object pairs occurred equally often, because only a fraction of all potential pairs were sampled. If the selection of the object pairs is not carefully balanced, this could easily lead to sampling biases, which RSA is highly sensitive to.

      Secondly, the events in the scanner are not jittered. Instead, they are phase-locked to the TR (1.2 sec TR, 1.2 sec fixation, 4.8 sec stimulus presentation). This means that every object onset starts at the same phase of the image acquisition, making HRF sampling inefficient and hurting trial-wise estimation of betas used for the RSA. This likely significantly weakens the strength of the neural inferences that are possible using this dataset.

      (5) It is not clear why the authors focus their report of the results in the main manuscript on the preselected ROIs instead of showing whole-brain results. This can be misleading, as it provides the false impression that the neural effects are highly specific to those regions.

      (6) I am missing behavioural support for the authors' claims.

      Overall, I am not convinced that the main conclusion that global spatial representations emerge during learning is supported by the data. Unfortunately, I think there are some fundamental problems in the experimental design that might make it difficult to address the concerns.

      However, if the authors can provide convincing evidence for their claims, I think the paper will have an impact on the field. The question of how multi-scale representations are represented in the human brain is a timely and important one.

    3. Reviewer #2 (Public review):

      Summary:

      Qui and colleagues studied human participants who learned about the locations of 32 different objects located across 4 different rooms in a common spatial environment. Participants were extensively trained on the object locations, and fMRI scans were done during a relative direction judgement task in a pre- and post-session. Using RSA analysis, the authors report that the hippocampus increased global relative to local representations with learning; the RSC showed a similar pattern, but also increased effects of both global and local information with time.

      Strengths:

      (1) The manuscript asks a generally interesting question concerning the learning of global versus local spatial information.

      (2) The virtual environment task provides a rich and naturalistic spatial setting for participants, and the setup with 32 objects across 4 rooms is interesting.

      (3) The within-subject design and use of verbal cues for spatial retrieval is elegant .

      Weaknesses:

      (1) My main concern is that the global Euclidean distances and room identity are confounded. I fear this means that all neural effects in the RSA could be alternatively explained by associations to the visual features of the rooms that build up over time.

      (2) The direction judgement task is not very informative about cognitive changes, as only objects in a room are compared. The setup also discourages global learning, and leaves unclear whether participants focussed on learning the left/right relationships required by the task.

      (3) With N = 23, the power is low, and the effects are weak.

      (4) It appears no real multiple comparisons correction is done for the ROI based approach, and significance across ROIs is not tested directly.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Qui et al. explores the issue of spatial learning in both local (rooms) and global (connected rooms) environments. The authors perform a pointing task, which involves either pressing the right or left button in the scanner to indicate where an object is located relative to another object. Participants are repeatedly exposed to rooms over sessions of learning, with one "pre" and one "post" learning session. The authors report that the hippocampus shifted from lower to higher RSA for the global but not the local environment after learning. RSC and OFC showed higher RSA for global object pointing. Other brain regions also showed effects, including ACC, which seemed to show a similar pattern as the hippocampus, as well as other regions shown in Figure S5. The authors attempt to tie their results in with local vs. global spatial representations.

      Strengths:

      Extensive testing of subjects before and after learning a spatial environment, with data suggesting that there may be fMRI codes sensitive to both global and local codes. Behavioral data suggest that subjects are performing well at the task and learning both global and local object locations, although see further comments.

      Weaknesses:

      (1) The authors frame the entire introduction around confirming the presence of the cognitive map either locally or globally. There are some significant issues with this framing. For one, the introduction appears to be confirmatory and not testing specific hypotheses that can be falsified. What exactly are the hypotheses being tested? I believe that this relates to the testing whether neural representations are global and/or local. However, this is not clear. Given that a previous paper (Marchette et al. 2014 Nature Neuro, which bears many similarities) showed only local coding in RSC, this paper needs to be discussed in far more depth in terms of its similarities and differences. This paper looked at both position and direction, while the current paper looks at direction. Even here, direction in the current study is somewhat impoverished: it involves either pointing right or left to an object, and much of this could be categorical or even lucky guesses. From what I could tell, all behavioral inferences are based on reaction time and not accuracy, and therefore, it is difficult to determine if the subject's behavior actually reflects knowledge gained or simply faster reaction time, either due to motor learning or a speed-accuracy trade-off. The pointing task is largely egocentric: it can be solved by remembering a facing direction and an object relative to that. It is not the JRD task as has been used in other studies (e.g., Huffman et al. 2019 Neuron), which is continuous and has an allocentric component. This "version" of the task would be largely egocentric. In this way, the pointing task used does not test the core tenets of the cognitive map during navigation, which is defined as allocentric and Euclidean (please see O'Keefe and Nadel 1978, The Hippocampus as a Cognitive Map). Since neither of these assumptions appears valid, the paper should be reframed to reflect spatial representations more broadly or even egocentric spatial representations.

      (2) The fMRI data workup is insufficient. What do the authors mean by "deactivations" in Figure 3b? Does this mean the object task showed more activation than the spatial task in HSC? Given that HSC is one of these regions, this would seem to suggest that the hippocampus is more involved in object than spatial processing, although it is difficult to tell from how things are written. The RSA is more helpful, but now a concern is that the analysis focuses on small clusters that are based on analyses determined previously. This appears to be the case for the correlations shown in Figure 3e as well. The issues here are several-fold. For one, it has been shown in previous work that basing secondary analyses on related first analyses can inflate the risk of false positives (i.e., Kriegeskorte et al. 2009 Nature Neuro). The authors should perform secondary analyses in ways that are unbiased by the first analyses, preferably, selecting cluster centers (if they choose to go this route) from previous papers rather than their own analyses. Another option would be to perform analyses at the level of the entire ROI, meaning that the results would generalize more readily. The authors should also perform permutation tests to ensure that the RSA results are reliable, as these can run the risk of false positives (e.g., Nolan et al. 2018 eNeuro). If these results hold, the authors should perform post-hoc (corrected) t-tests for global vs. local before and after learning to ensure these differences are robust and not simply rely on the interaction effect. The figures were difficult to follow in this regard, and an interaction effect does not necessarily mean the differences that are critical (global higher than local after) are necessarily significant. The end part of the results was hard to follow. If ACC showed a similar effect to HC and RSC, why is it not being considered? Many other areas that seemed to show local vs. global effects were dismissed, but these should instead be discussed in terms of whether they are consistent or inconsistent with the hypotheses.

      (3) Concerns about the discussion: there are areas involving reverse inference about brain areas rather than connecting the findings with hypotheses (see Poldrack et al. 2006 Trends in Cognitive Science). The authors also argue for 'transfer" of information (for example, from ACC to OFC), but did not perform any connectivity analyses, so these conclusions are not based on any results. Instead, the authors should carefully compare what can be concluded from the reaction time findings and the fMRI data. What is consistent vs. inconsistent with the hypotheses? The authors should also provide a much more detailed comparison with past work. The Marchette et al. paper comes to different conclusions regarding RSC and involves more detailed analyses than those done here, including position. What is different in the current paper that might explain the differences in results? Another previous paper that came to a different conclusion (hippocampus local, retrosplenial global) and should be carefully considered and compared, as it also involved learning of environments and comparisons at different phases (e.g., Wolbers & Buchel 2005 J Neuro). Other papers that have used the JRD task have demonstrated similar, although not identical, networks (e.g., Huffman et al. 2019 Neuron) and the results here should be more carefully compared, as the current task is largely egocentric while the Huffman et al. paper involves a continuous and allocentric version of the JRD task.

      (4) The authors cite rodent papers involving single neuron recordings. These are quite different experiments, however: they involve rodents, the rodents are freely moving, and single neurons are recorded. Here, the study involves humans who are supine and an indirect vascular measure of neural activity. Citations should be to studies of spatial memory and navigation in humans using fMRI: over-reliance on rodent studies should be avoided for the reasons mentioned above.

    1. eLife Assessment

      This study presents a valuable approach for revealing large-scale brain attractor dynamics during resting states, task processing, and disease conditions using insights from Hopfield neural networks. The evidence supporting the findings is convincing across the many datasets analysed. The work will be of broad interest to neuroscientists using neuroimaging data with interest in computational modelling of brain activity.

    2. Reviewer #1 (Public review):

      Summary:

      Englert et al. proposed a functional connectivity-based Attractor Neural Network (fcANN) to reveal attractor states and activity flows across various conditions, including resting state, task-evoked, and pathological conditions. The large-scale brain attractors reconstructed by fcANNs are orthogonal organization, which is in line with the free-energy theoretical framework. Additionally, the fcANN demonstrates differences in attractor states between individuals with autism and typically developing individuals.

      The study used seven datasets, which ensures robust replication and validation of generalization across various conditions. The study is a representative example that combines experimental evidence based on fcANN and the theoretical framework. The fcANN projection offers an interesting way of visualization, allowing researchers to observe attractor states and activity flow patterns directly. Overall, the study may offer valuable insights into brain dynamics and computational neuroscience.

      Comments on revision:

      The authors have addressed my previous concerns and substantially improved the manuscript. Fig.4 and Fig.5 still keep fcHNN rather than the updated fcANN.

    3. Reviewer #2 (Public review):

      Summary:

      Englert et al. use a novel modelling approach called functional connectome-based Hopfield Neural Networks (fcHNN) to describe spontaneous and task-evoked brain activity, and the alterations in brain disorders. Given its novelty, the authors first validate the model parameters (the temperature and noise) with empirical resting-state function data and against null models. Through the optimisation of the temperature parameter, they first show that the optimal number of attractor states is four before fixing the optimal noise that best reflects the empirical data, through stochastic relaxation. Then, they demonstrate how these fcHNN generated dynamics predict task-based functional activity relating to pain and self-regulation. To do so, they characterise the different brain states (here as different conditions of the experimental pain paradigm) in terms of the distribution of the data on the fcHNN projections and flow-analysis. Lastly, a similar analysis was performed on a population with autism condition. Through Hopfield modeling, this work proposes a comprehensive framework that links various types of functional activity under a unified interpretation with high predictive validity.

      Strengths:

      The phenomenological nature of the Hopfield model and its validation across multiple datasets presents a comprehensive and intuitive framework for the analysis of functional activity. The results presented in this work further motivate the study of phenomenological models as an adequate mechanistic characterisation of large-scale brain activity.

      Following up from Cole et al. 2016, the authors put forward a hypothesis that many of the changes to the brain activity, here, in terms of task-evoked and clinical data, can be inferred from the resting-state brain data alone. This brings together neatly the idea of different facets of brain activity emerging from a common space of functional (ghost) attractors.

      The use of the null models motivates the benefit for non-linear dynamics in the context of phenomenological models when assessing the similarity to the real empirical data.

      Comments on revision:

      I am happy with how the authors addressed the comments and am happy to move ahead without further comments.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Englert et al. proposed a functional connectome-based Hopfield artificial neural network (fcHNN) architecture to reveal attractor states and activity flows across various conditions, including resting state, task-evoked, and pathological conditions. The fcHNN can reconstruct characteristics of resting-state and task-evoked brain activities. Additionally, the fcHNN demonstrates differences in attractor states between individuals with autism and typically developing individuals.

      Strengths:

      (1) The study used seven datasets, which somewhat ensures robust replication and validation of generalization across various conditions.

      (2) The proposed fcHNN improves upon existing activity flow models by mimicking artificial neural networks, thereby enhancing the representational ability of the model. This advancement enables the model to more accurately reconstruct the dynamic characteristics of brain activity.

      (3) The fcHNN projection offers an interesting visualization, allowing researchers to observe attractor states and activity flow patterns directly.

      We are grateful to the reviewer for highlighting the robustness of our findings across multiple datasets and for appreciating the novelty and representational advantages of our fcHNN model (which has been renamed to fcANN in the revised manuscript).

      Weaknesses:

      (1) The fcHNN projection can offer low-dimensional dynamic visualizations, but its interpretability is limited, making it difficult to make strong claims based on these projections. The interpretability should be enhanced in the results and discussion.

      We thank the reviewer for these important points. We agree that the interpretability of the low-dimensional projection is limited. In the revised manuscript, we have reframed the fcANN projection primarily as a visualization tool (see e.g. line 359) and moved the corresponding part of Figure 2 to the Supplementary Material (Supplementary Figure 2). We have also implemented a substantial revision of the manuscript, which now directly links our analysis to the novel theoretical framework of self-orthogonalizing attractor networks (Spisak & Friston, 2025), opening several new avenues in terms of interpretation and shedding light on the computational principles underlying attractor dynamics in the brain (see the revised introduction and the new section “Theoretical background”, starting at lines 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). As part of these efforts, we now provide evidence for the brain’s functional organization approximating a special, computationally efficient class of attractor networks, the so-called Kanter-Sompolinsky projector network (Figure 2A-C, line 346, see also our answer to your next comment). This is exactly, what the theoretical framework of free-energy-minimizing attractor networks predicts.

      (2) The presentation of results is not clear enough, including figures, wording, and statistical analysis, which contributes to the overall difficulty in understanding the manuscript. This lack of clarity in presenting key findings can obscure the insights that the study aims to convey, making it challenging for readers to fully grasp the implications and significance of the research.

      We have thoroughly revised the manuscript for clarity in wording, figures (see e.g. lines 257, 482, 529 in the Results and lines 1128, 1266, 1300, 1367 in the Methods). We carefully improved statistical reporting and ensured that we always report test statistics, effect sizes and clearly refer to the null modelling approach used (e.g. lines 461, 542, 550, 565, 573, 619, as well as Figures 2-4). As absolute effect sizes, in many analyses, do not have a straightforward interpretation, we provided Glass’ , as a standardized effect size measure, expressing the distance of the true observation from the null distribution as a ratio of the null standard deviation. To further improve clarity, we now clearly define our research questions and the corresponding analyses and null models in the revised manuscript, both in the main text and in two new tables (Tables 1 and 2). We denoted research questions and null model with Q1-7 and NM1-5, respectively and refer to them at multiple instances when detailing the analyses and the results.

      Reviewer #2 (Public Review):

      Summary:

      Englert et al. use a novel modelling approach called functional connectome-based Hopfield Neural Networks (fcHNN) to describe spontaneous and task-evoked brain activity and the alterations in brain disorders. Given its novelty, the authors first validate the model parameters (the temperature and noise) with empirical resting-state function data and against null models. Through the optimisation of the temperature parameter, they first show that the optimal number of attractor states is four before fixing the optimal noise that best reflects the empirical data, through stochastic relaxation. Then, they demonstrate how these fcHNN-generated dynamics predict task-based functional activity relating to pain and self-regulation. To do so, they characterise the different brain states (here as different conditions of the experimental pain paradigm) in terms of the distribution of the data on the fcHNN projections and flow analysis. Lastly, a similar analysis was performed on a population with autism condition. Through Hopfield modeling, this work proposes a comprehensive framework that links various types of functional activity under a unified interpretation with high predictive validity.

      Strengths:

      The phenomenological nature of the Hopfield model and its validation across multiple datasets presents a comprehensive and intuitive framework for the analysis of functional activity. The results presented in this work further motivate the study of phenomenological models as an adequate mechanistic characterisation of large-scale brain activity.

      Following up on Cole et al. 2016, the authors put forward a hypothesis that many of the changes to the brain activity, here, in terms of task-evoked and clinical data, can be inferred from the resting-state brain data alone. This brings together neatly the idea of different facets of brain activity emerging from a common space of functional (ghost) attractors.

      The use of the null models motivates the benefit of non-linear dynamics in the context of phenomenological models when assessing the similarity to the real empirical data.

      We thank the reviewer for recognizing the comprehensive and intuitive nature of our framework and for acknowledging the strength of our hypothesis that diverse brain activity facets emerge from a common resting state attractor landscape.

      Weaknesses:

      While the use of the Hopfield model is neat and very well presented, it still begs the question of why to use the functional connectome (as derived by activity flow analysis from Cole et al. 2016). Deriving the functional connectome on the resting-state data that are then used for the analysis reads as circular.

      We agree that starting from functional couplings to study dynamics is in stark contrast with the common practice of estimating the interregional couplings based on structural connectome data. We now explicitly discuss how this affects the scope of the questions we can address with the approach, with explicit notes on the inability of this approach to study the structure-function coupling and its limitations in deriving mechanistic insights at the level of biophysical implementation.

      Line 894:

      “The proposed approach is not without limitations. First, as the proposed approach does not incorporate information about anatomical connectivity and does not explitly model biophysical details. Thus, in its present form, the model is not suitable to study the structure-function coupling and cannot yiled mechanistic explanations underlying (altered) polysynaptic connections, at the level of biophysical details.”

      We are confident, however, that our approach is not circular. At the high level, our approach can be considered as a function-to-function generative model, with twofold aims.

      First, we link large-scale brain dynamics to theoretical artificial neural network models and show that the functional connectome display characteristics that render it as an exceptionally “well-behaving” attractor network (e.g. superior convergence properties, as contrasted against appropriate respective null models). In the revised manuscript, we have significantly improved upon this aspect by explicitly linking the fcANN model to the theoretical framework of self-orthogonalizing attractor networks (Spisak & Friston, 2025) (see the revised introduction and the new section “Theoretical background”, starting at lines 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). As part of these efforts, we now provide evidence for the brain’s functional organization approximating a special, computationally efficient class of attractor networks, the so-called Kanter-Sompolinsky projector network (Figure 2A-C, line 346, see also our answer to your next comment). This is exactly, what the theoretical framework of free-energy-minimizing attractor networks predicts. This result is not circular, as the empirical model does not use the key mechanism (the Hebbian/anti-Hebbian learning rule) that induces self-orthogonalization in the theoretical framework. We clarify this in the revised manuscript, e.g. in line 736.

      Second, we benchmark ability of the proposed function-to-function generative model to predict unseen data (new datasets) or data characteristics that are not directly encompassed in the connectivity matrix (e.g. non-Gaussian conditional dependencies, temporal autocorrelation, dynamical responses to perturbations on the system). These benchmarks are constructed against well defined null models, which provide reasonable references. We have now significantly improved the discussion of these null models in the revised manuscript (Tables 1 and 2, lines 257). We not only show, that our model - when reconstructing resting state dynamics - can generalize to unseen data over and beyond what is possible with the baseline descriptive measure (e.g. covariance measures and PCA), but also demonstrate the ability of the framework to reconstruct the effects of perturbations on this dynamics (such as task-evoked changes), based solely on the resting state data form another sample.

      If the fcHNN derives the basins of four attractors that reflect the first two principal components of functional connectivity, it perhaps suffices to use the empirically derived components alone and project the task and clinical data on it without the need for the fcHNN framework.

      We are thankful for the reviewer for highlighting this important point, which encouraged us to develop a detailed understanding of the origins of the close alignment between attractors and principal components (eigenvectors of the coupling matrix) and the corresponding (approximate) orthogonality. Here, we would like to emphasize that the attractor-eigenvector correspondence is by no means a general feature of any arbitrary attractor network. In fact, such networks are a very special class of attractor neural networks (the so-called Kanter-Sompolinsky projector neural network (Kanter & Sompolinsky, 1987)), with a high degree of computational efficiency, maximal memory capacity and perfect memory recall. It has been rigorously shown that in such networks, the eigenvectors of the coupling matrix (i.e. PCA on the timeseries data) and the attractors become equivalent (Kanter & Sompolinsky, 1987). This in turn made us ask the question, what are the learning and plasticity rules that drive attractor networks towards developing approximately orthogonal attractors? We found that this is a general tendency of networks obeying the free energy principle ( Figure 2A-C, line 346, see also our answer to your next comment). The formal derivation of this framework in now presented in an accompanying theoretical piece (Spisak & Friston, 2025). In the revised manuscript, we provide a short, high-level overview of these results (in the Introduction form line 55 and in the new section “Theoretical background”, line 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). According to this new theoretical model, attractor states can be understood as a set of priors (in the Bayesian sense) that together constitute an optimal orthogonal basis, equipping the update process (which is akin to a Markov-chain Monte Carlo sampling) to find posteriors that generalize effectively within the spanned subspace. Thus, in sum, understanding brain function in terms of attractor dynamics - instead of PCA-like descriptive projections - provides important links towards a Bayesian interpretation of brain activity. At the same time, the eigenvector-attractor correspondence also explains, why descriptive decomposition approaches, like PCA or ICA are so effective at capturing the dynamics of the system, at the first place.

      As presented here, the Hopfield model is excellent in its simplicity and power, and it seems suited to tackle the structure-function relationship with the power of going further to explain task-evoked and clinical data. The work could be strengthened if that was taken into consideration. As such the model would not suffer from circularity problems and it would be possible to claim its mechanistic properties. Furthermore, as mentioned above, in the current setup, the connectivity matrix is based on statistical properties of functional activity amongst regions, and as such it is difficult to talk about a certain mechanism. This contention has for example been addressed in the Cole et al. 2016 paper with the use of a biophysical model linking structure and function, thus strengthening the mechanistic claim of the work.

      We agree that investigating how the structural connectome constraints macro-scale dynamics is a crucial next step. Linking our results with the theoretical framework of self-orthogonalizing attractor networks provides a principled approach to this, as the “self-orthogonalizing” learning rule in the accompanying theoretical work provides the opportunity to fit attractor networks with structural constraints to functional data, shedding light on the plastic processes which maintain the observed approximate orthogonality even in the presence of these structural constraints. We have revised the manuscript to clarify that our phenomenological approach is inherently limited in its ability to answer mechanistic questions at the level of biophysical details (lines 894) and discuss this promising direction as follows:

      Lines 803:

      “A promising application of this is to consider structural brain connectivity (as measured by diffusion MRI) as a sparsity constraint for the coupling weights and then train the fcANN model to match the observed resting-state brain dynamics. If the resulting structural-functional ANN model is able to closely match the observed functional brain substate dynamics, it can be used as a novel approach to quantify and understand the structural functional coupling in the brain”.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The statistical analyses are poorly described throughout the manuscript. The authors should provide more details on the statistical methods used for each comparison, as well as the corresponding statistics and degrees of freedom, rather than solely reporting p-values.

      We thank the reviewer for pointing this out. We have revised the manuscript to include the specific test statistics, precise p-values and raw effect sizes for all reported analyses to ensure full transparency and replicability, see e.g. lines 461, 542, 550, 565, 573, 619, as well as Figures 2-4. Additionally, as absolute effect sizes - in many analyses - do not have a straightforward interpretation, we provided Glass’ Δ, as a standardized effect size measure, expressing the distance of the true observation from the null distribution as a ratio of the null standard deviation.

      We have also improved the description of the statistical methods used in the manuscript (lines 1270, 1306, 1339, 1367, 1404) and added two overview tables (Tables 1 and 2) that summarize the methodological approaches and the corresponding null models.

      Furthermore, we have fully revised the analysis corresponding to noise optimization. We only retained null model 2 (covariance-matched Gaussian) in the main text and on Figure 3, and moved model 1 (spatial phase randomization) into the Supplementary Material (Supplementary Figure 6) and is less appropriate for this analysis (trivially significant in all cases). Furthermore, as test statistic, no we use a Wasserstein distance between the 122-dimensional empirical and the simulated data (instead of focusing on the 2-dimensional projection). This analysis now directly quantifies the capacity of the fcANN model to capture non-Gaussian conditionals in the data.

      (2) The convergence procedure is not clearly explained in the manuscript. Is this an optimization procedure to minimize energy? If so, the authors should provide more details about the optimizer used.

      We apologize for the lack of clarity. The convergence is not an optimization procedure per se, in a sense that it does not involve any external optimizer. It is simply the repeated (deterministic) application of the same update rule also known from Hopfield networks or Boltzmann machines. However, as detailed in the accompanying theoretical paper, this update rule (or inference rule) inherently solves and optimization problem: it performs gradient descent on the free energy landscape of the network. As such, it is guaranteed to converge to a local free energy minimum in the deterministic case. We have clarified this process in the Results and Methods sections as follows:

      Line 161:

      “Inference arises from minimizing free energy with respect to the states \sigma. For a single unit, this yields a local update rule homologous to the relaxation dynamics in Hopfield networks”.

      Line 181:

      “In the basis framework (Spisak & Friston, 2025), inference is a gradient descent on the variational free energy landscape with respect to the states σ and can be interpreted as a form of approximate Bayesian inference, where the expected value of the state σ<sub>i</sub> is interpreted as the posterior mean given the attractor states currently encoded in the network (serving as a macro-scale prior) and the previous state, including external inputs (serving as likelihood in the Bayesian sense)”.

      Line 1252:

      “As the inference rule was derived as a gradient descent on free energy, iterations monotonically decrease the free energy function and therefore converge to a local free‑energy minimum without any external optimizer. Thus, convergence does not require any optimization procedure with an external optimizer. Instead, it arises as the fixed point of repeated local inference updates, which implement gradient descent on free energy in the deterministic symmetric case.”

      (3) In Figure 2G, the beta values range from 0.035 to 0.06, but they are reported as 0.4 in the main text and the Supplementary Figure. Please clarify this discrepancy.

      We are grateful to the reviewer for spotting this typo. The correct value for β is 0.04, as reported in the Methods section. We have corrected this inconsistency in the revised manuscript and as well as in Supplementary Figure 5.

      (4) Line 174: What type of null model was used to evaluate the impact of the beta values? The authors did not provide details on this anywhere in the manuscript.

      We apologize for this omission. The null model is based on permuting the connectome weights while retaining the matrix symmetry, which destroys the specific topological structure but preserves the overall weight distribution. We have now clarified this at multiple places in the revised manuscript (lines 432, Table 1-2, Figure 2), and added new overview tables (Tables 1 and 2) to summarize the methodological approaches and the corresponding null models.

      (5) Figure 3B: It appears that the authors only demonstrate the reproducibility of the “internal” attractor across different datasets. What about other states?

      Thank you for noticing this. We now visualize all attractor states in Figure 3B (note that these essentially consist of two symmetric pairs).

      (6) Figure 3: What does “empirical” represent in Figure 3? Is it PCA? If the “empirical” method, which is a much simpler method, can achieve results similar to those of the fcHNN in terms of state occupancy, distribution, and activity flow, what are the benefits of the proposed method? Furthermore, the authors claim that the explanatory power of the fcHNN is higher than that of the empirical model and shows significant differences. However, from my perspective, this difference is not substantial (37.0% vs. 39.9%). What does this signify, particularly in comparison to PCA?

      This is a crucial point that is now a central theme of our revised manuscript. The reviewer is correct that the “empirical” method is PCA. PCA - by identifying variance-heavy orthogonal directions - aims to explain the highest amount of variance possible in the data (with the assumption of Gaussian conditionals). While empirical attractors are closely aligned to the PCs (i.e. eigenvectors of the inverse covariance matrix, as shown in the new analysis Q1), the alignment is only approximate. We basically take advantage of this small “gap” to quantify, weather attractor states are a better fit to the unseen data than the PCs. Obviously, due to the otherwise strong PC-attractor correspondence, this is expected to be only a small improvement. However, it is an important piece of evidence for the validity of our framework, as it shows that attractors are not just a complementary, perhaps “noisier” variety of the PCs, but a “substrate” that generalizes better to unseen data than the PCs themselves. We have revised the manuscript to clarify this point (lines 528).

      Reviewer #2 (Recommendations For The Authors):

      For clarity, it might be useful to define and use consistently certain key terms. Connectome often refers to structural (anatomical) connectivity unless defined specifically this should be considered, in Figure 1B title for example Brain state often refers to different conditions ie autism, neurotypical, sleep, etc... see for review Kringelbach et al. 2020, Cell Reports. When referring to attractors of brain activity they might be called substates.

      We thank the reviewer for these helpful suggestions. We have carefully revised the manuscript to ensure our terminology is precise and consistent. We now explicitly refer to the “functional connectome” (including the title) and avoid using the too general term “brain state” and use “substates” instead.

      In Figure 2 some terms are not defined. Noise is sigma in the text but elpsilon in the figure. Only in methods, the link becomes clear. Perhaps define epsilon in the caption for clarity. The same applies to μ in the methods. It is only described above in the methods, I suggest repeating the epsilon definition for clarity

      We appreciate this feedback and apologize for the inconsistency. We have revised all figures and the Methods section to ensure that all mathematical symbols (including ε, σ, and μ) are clearly and consistently defined upon their first appearance and in all figure captions. For instance, noise level is now consistently referred to as ϵ. We improved the consistency and clarity for other terms, too, including:

      functional connectome-based Hopfiled network (fcHNN) => functional connectivity-based attractor network (fcANN);

      temperature => inverse temperature;

      And improved grammar and language throughout the manuscript.

      References

      Kanter, I., & Sompolinsky, H. (1987). Associative recall of memory without errors. Physical Review A, 35(1), 380–392. 10.1103/physreva.35.380

      Spisak T & Friston K (2025). Self-orthogonalizing attractor neural networks emerging from the free energy principle. arXiv preprint arXiv:2505.22749.

    1. eLife Assessment

      O'Brien and co-authors provide important data demonstrating that tissue-resident macrophages can exert physiological functions and influence endocrine systems.Their model in which AMs negatively regulate aldosterone production via effects exerted in the lung is solid. The work will be of broad interest to cell biologists and immunologists.

    2. Reviewer #2 (Public review):

      Summary:

      Tissue-resident macrophages are more and more thought to exert key homeostatic functions and contribute to physiological responses. In the report of O'Brien and Colleagues, the idea that the macrophage-expressed scavenger receptor MARCO could regulate adrenal corticosteroid output at steady-state was explored. The authors found that male MARCO-deficient mice exhibited higher plasma aldosterone levels and higher lung ACE expression as compared to wild-type mice, while the availability of cholesterol and the machinery required to produce aldosterone in the adrenal gland were not affected by MARCO deficiency. The authors take these data to conclude that MARCO in alveolar macrophages can negatively regulate ACE expression and aldosterone production at steady-state and that MARCO-deficient mice suffer from a secondary hyperaldosteronism.

      Strengths:

      If properly demonstrated and validated, the fact that tissue-resident macrophages can exert physiological functions and influence endocrine systems would be highly significant and could be amenable to novel therapies.

      Major weakness:

      The comparison between C57BL/6J wild-type mice and knock-out mice for which a precise information about the genetic background and the history of breedings and crossings is lacking can lead to misinterpretations of the results obtained. Hence, MARCO-deficient mice should be compared with true littermate controls.

    3. Author response:

      The following is the authors’ response to the original reviews

      We again thank the reviewers for their comments and recommendations. In response to the reviewer’s suggestions, we have performed several additional experiments, added additional discussion, and updated our conclusions to reflect the additional work. Specifically, we have performed additional analyses in female WT and Marco-deficient animals, demonstrating that the Marco-associated phonotypes observed in male mice (reduced adrenal weight, increased lung Ace mRNA and protein expression, unchanged expression of adrenal corticosteroid biosynthetic enzymes) are not present in female mice. We also report new data on the physiological consequences of increased aldosterone levels observed in male mice, namely plasma sodium and potassium titres, and blood pressure alterations in WT vs Marco-deficient male mice. In an attempt to address the reviewer’s comments relating to our proposed mechanism on the regulation of lung Ace expression, we additionally performed a co-culture experiment using an alveolar macrophage cell line and an endothelial cell line. In light of the additional evidence presented herein, we have updated our conclusions from this study and changed the title of our work to acknowledge that the mechanism underlying the reported phenotype remains incompletely understood. Specific responses to reviewers can be seen below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The investigators sought to determine whether Marco regulates the levels of aldosterone by limiting uptake of its parent molecule cholesterol in the adrenal gland. Instead, they identify an unexpected role for Marco on alveolar macrophages in lowering the levels of angiotensin-converting enzyme in the lung. This suggests an unexpected role of alveolar macrophages and lung ACE in the production of aldosterone.

      Strengths:

      The investigators suggest an unexpected role for ACE in the lung in the regulation of systemic aldosterone levels.

      The investigators suggest important sex-related differences in the regulation of aldosterone by alveolar macrophages and ACE in the lung.

      Studies to exclude a role for Marco in the adrenal gland are strong, suggesting an extra-adrenal source for the excess Marco observed in male Marco knockout mice.

      Weaknesses:

      While the investigators have identified important sex differences in the regulation of extrapulmonary ACE in the regulation of aldosterone levels, the mechanisms underlying these differences are not explored.

      The physiologic impact of the increased aldosterone levels observed in Marco -/- male mice on blood pressure or response to injury is not clear.

      The intracellular signaling mechanism linking lung macrophage levels with the expression of ACE in the lung is not supported by direct evidence.

      Reviewer #2 (Public Review):

      Summary:

      Tissue-resident macrophages are more and more thought to exert key homeostatic functions and contribute to physiological responses. In the report of O'Brien and Colleagues, the idea that the macrophage-expressed scavenger receptor MARCO could regulate adrenal corticosteroid output at steady-state was explored. The authors found that male MARCO-deficient mice exhibited higher plasma aldosterone levels and higher lung ACE expression as compared to wild-type mice, while the availability of cholesterol and the machinery required to produce aldosterone in the adrenal gland were not affected by MARCO deficiency. The authors take these data to conclude that MARCO in alveolar macrophages can negatively regulate ACE expression and aldosterone production at steady-state and that MARCO-deficient mice suffer from secondary hyperaldosteronism.

      Strengths:

      If properly demonstrated and validated, the fact that tissue-resident macrophages can exert physiological functions and influence endocrine systems would be highly significant and could be amenable to novel therapies.

      Weaknesses:

      The data provided by the authors currently do not support the major claim of the authors that alveolar macrophages, via MARCO, are involved in the regulation of a hormonal output in vivo at steady-state. At this point, there are two interesting but descriptive observations in male, but not female, MARCO-deficient animals, and overall, the study lacks key controls and validation experiments, as detailed below.

      Major weaknesses:

      (1) According to the reviewer's own experience, the comparison between C57BL/6J wild-type mice and knock-out mice for which precise information about the genetic background and the history of breedings and crossings is lacking, can lead to misinterpretations of the results obtained. Hence, MARCO-deficient mice should be compared with true littermate controls.

      (2) The use of mice globally deficient for MARCO combined with the fact that alveolar macrophages produce high levels of MARCO is not sufficient to prove that the phenotype observed is linked to alveolar macrophage-expressed MARCO (see below for suggestions of experiments).

      (3) If the hypothesis of the authors is correct, then additional read-outs could be performed to reinforce their claims: levels of Angiotensin I would be lower in MARCO-deficient mice, levels of Antiotensin II would be higher in MARCO-deficient mice, Arterial blood pressure would be higher in MARCO-deficient mice, natremia would be higher in MARCO-deficient mice, while kaliemia would be lower in MARCO-deficient mice. In addition, co-culture experiments between MARCO-sufficient or deficient alveolar macrophages and lung endothelial cells, combined with the assessment of ACE expression, would allow the authors to evaluate whether the AM-expressed MARCO can directly regulate ACE expression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Corticosterone levels in male Marco -/- mice are not significantly different, but there is (by eye) substantially more variability in the knockout compared to the wild type. A power analysis should be performed to determine the number of mice needed to detect a similar % difference in corticosterone to the difference observed in aldosterone between male Marco knockout and wild-type mice. If necessary the experiments should be repeated with an adequately powered cohort.

      Using a power calculator (www.gigacalculator.com) it was determined that our sample size of 13 was one less than sufficient to detect a similar % difference in corticosterone as was detected in corticosterone. We regret that we unable to perform additional measurements as the author suggested in the available timeframe.

      (2) All of the data throughout the MS (particularly data in the lung) should be presented in male and female mice. For example, the induction of ACE in the lungs of Marco-/- female mice should be absent. Similar concerns relate to the dexamethasone suppression studies. Also would be useful if the single cell data could be examined by sex--should be possible even post hoc using Xist etc.

      Given the limitations outlined in our previous response to reviewers it was not possible to repeat every experiment from the original manuscript. We were able to measure the expression of lung Ace mRNA, ACE protein, adrenal weights, adrenal expression of steroid biosynthetic enzymes, presence of myeloid cells, and levels of serum electrolytes in female animals. These are presented in figures 1G, 3B, 4A, 4E, 4F, 4I, and 4J. We have elected to not present single cell seq data according to sex as it did not indicate substantial differences between males and females in Marco or Ace expression and so does not substantively change our approach.

      (3) IF is notoriously unreliable in the lung, which has high levels of autofluorescence. This is the only method used to show ACE levels are increased in the absence of Marco. Orthogonal methods (e.g. immunoblots of flow-sorted cells, or ideally CITE-seq that includes both male and female mice) should be used.

      We used negative controls to guide our settings during acquisition of immunofluorescent images. Additionally, we also used qPCR to show an increase in Ace mRNA expression in the lung in addition to the protein level. This data was presented in the original manuscript and is further bolstered by our additional presentation of expression data for Ace mRNA and protein in female animals in this revised manuscript.

      (4) Given the central importance of ACE staining to the conclusions, validation of the antibody should be included in the supplement.

      We don’t have ACE-deficient mice so cannot do KO validation of the antibody. We did perform secondary stain controls which confirmed the signal observed is primary antibody-derived. Moreover, we specifically chose an anti-ACE antibody (Invitrogen catalogue # MA5-32741) that has undergone advanced verification with the manufacturer. We additionally tested the antibody in the brain and liver and observed no significant levels of staining.

      Author response image 1.

      (5) The link between alveolar macrophage Marco and ACE is poorly explored.

      We carried out a co-culture experiments of alveolar macrophages and endothelial cells and measure ACE/Ace expression as a consequence. This is presented in figure 5D and the discussion.

      (6) Mechanisms explaining the substantial sex difference in the primary outcome are not explored.

      This is outside the scope if this project, though we would consider exploring such experiments in future studies.

      (7) Are there physiologic consequences either in homeostasis or under stress to the increased aldosterone (or lung ACE levels) observed in Marco-/- male mice?

      We measured blood electrolytes and blood pressure in Marco-deficient and Marco-sufficient mice. The results from these experiments are presented in 4G-4M.

      Reviewer #2 (Recommendations For The Authors):

      Below is a suggestion of important control or validation experiments to be performed in order to support the authors' claims.

      (1) It is imperative to validate that the phenotype observed in MARCO-deficient mice is indeed caused by the deficiency in MARCO. To this end, littermate mice issued from the crossing between heterozygous MARCO +/- mice should be compared to each other. C57BL/6J mice can first be crossed with MARCO-deficient mice in F0, and F1 heterozygous MARCO +/- mice should be crossed together to produce F2 MARCO +/+, MARCO +/- and MARCO -/- littermate mice that can be used for experiments.

      We thank the reviewer for their comments. We recognise the concern of the reviewer but due to limited experimenter availability we are unable to undertake such a breeding programme to address this particular concern.

      (2) The use of mice in which AM, but not other cells, lack MARCO expression would demonstrate that the effect is indeed linked to AM. To this end, AM-deficient Csf2rb-deficient mice could be adoptively transferred with MARCO-deficient AM. In addition, the phenotype of MARCO-deficient mice should be restored by the adoptive transfer of wild-type, MARCO-expressing AM. Alternatively, bone marrow chimeras in which only the hematopoietic compartment is deficient in MARCO would be another option, albeit less specific for AM.

      We recognise the concern of the reviewer. We carried out a co-culture experiments of alveolar macrophages and endothelial cells and measure ACE/Ace expression as a consequence. This is presented in figure 5D and the implications explored in the discussion.

      (3) If the hypothesis of the authors is correct, then additional read-outs could be performed to reinforce their claims: levels of Angiotensin I would be lower in MARCO-deficient mice, levels of Antiotensin II would be higher in MARCO-deficient mice, Arterial blood pressure would be higher in MARCO-deficient mice, natremia would be higher in MARCO-deficient mice, while kaliemia would be lower in MARCO-deficient mice. Similar read-outs could also be performed in the models proposed in point 2).

      We measured blood electrolytes and blood pressure in Marco-deficient and Marco-sufficient mice. The results from these experiments are presented in 4G-4M.

      (4) Co-culture experiments between MARCO-sufficient or deficient alveolar macrophages and lung endothelial cells, combined with the assessment of ACE expression, would allow the authors to evaluate whether the AM-expressed MARCO can directly regulate ACE expression.

      To address this concern we carried out a co-culture experiment as described above.

    1. eLife Assessment

      This useful study presents Altair-LSFM, a solid and well-documented implementation of a light-sheet fluorescence microscope (LSFM) designed for accessibility and cost reduction. While the approach offers strengths such as the use of custom-machined baseplates and detailed assembly instructions, its overall impact is limited by the lack of live-cell imaging capabilities and the absence of a clear, quantitative comparison to existing LSFM platforms. As such, although technically competent, the broader utility and uptake of this system by the community may be limited.

    2. Reviewer #1 (Public review):

      Summary:

      The article presents the details of the high-resolution light-sheet microscopy system developed by the group. In addition to presenting the technical details of the system, its resolution has been characterized and its functionality demonstrated by visualizing subcellular structures in a biological sample.

      Strengths:

      (1) The article includes extensive supplementary material that complements the information in the main article.

      (2) However, in some sections, the information provided is somewhat superficial.

      Weaknesses:

      (1) Although a comparison is made with other light-sheet microscopy systems, the presented system does not represent a significant advance over existing systems. It uses high numerical aperture objectives and Gaussian beams, achieving resolution close to theoretical after deconvolution. The main advantage of the presented system is its ease of construction, thanks to the design of a perforated base plate.

      (2) Using similar objectives (Nikon 25x and Thorlabs 20x), the results obtained are similar to those of the LLSM system (using a Gaussian beam without laser modulation). However, the article does not mention the difficulties of mounting the sample in the implemented configuration.

      (3) The authors present a low-cost, open-source system. Although they provide open source code for the software (navigate), the use of proprietary electronics (ASI, NI, etc.) makes the system relatively expensive. Its low cost is not justified.

      (4) The fibroblast images provided are of exceptional quality. However, these are fixed samples. The system lacks the necessary elements for monitoring cells in vivo, such as temperature or pH control.

    3. Reviewer #2 (Public review):

      Summary:

      The authors present Altair-LSFM (Light Sheet Fluorescence Microscope), a high-resolution, open-source microscope, that is relatively easy to align and construct and achieves sub-cellular resolution. The authors developed this microscope to fill a perceived need that current open-source systems are primarily designed for large specimens and lack sub-cellular resolution or are difficult to construct and align, and are not stable. While commercial alternatives exist that offer sub-cellular resolution, they are expensive. The authors' manuscript centers around comparisons to the highly successful lattice light-sheet microscope, including the choice of detection and excitation objectives. The authors thus claim that there remains a critical need for high-resolution, economical, and easy-to-implement LSFM systems.

      Strengths:

      The authors succeed in their goals of implementing a relatively low-cost (~ USD 150K) open-source microscope that is easy to align. The ease of alignment rests on using custom-designed baseplates with dowel pins for precise positioning of optics based on computer analysis of opto-mechanical tolerances, as well as the optical path design. They simplify the excitation optics over Lattice light-sheet microscopes by using a Gaussian beam for illumination while maintaining lateral and axial resolutions of 235 and 350 nm across a 260-um field of view after deconvolution. In doing so they rest on foundational principles of optical microscopy that what matters for lateral resolution is the numerical aperture of the detection objective and proper sampling of the image field on to the detection, and the axial resolution depends on the thickness of the light-sheet when it is thinner than the depth of field of the detection objective. This concept has unfortunately not been completely clear to users of high-resolution light-sheet microscopes and is thus a valuable demonstration. The microscope is controlled by an open-source software, Navigate, developed by the authors, and it is thus foreseeable that different versions of this system could be implemented depending on experimental needs while maintaining easy alignment and low cost. They demonstrate system performance successfully by characterizing their sheet, point-spread function, and visualization of sub-cellular structures in mammalian cells, including microtubules, actin filaments, nuclei, and the Golgi apparatus.

      Weaknesses:

      There is a fixation on comparison to the first-generation lattice light-sheet microscope, which has evolved significantly since then:

      (1) The authors claim that commercial lattice light-sheet microscopes (LLSM) are "complex, expensive, and alignment intensive", I believe this sentence applies to the open-source version of LLSM, which was made available for wide dissemination. Since then, a commercial solution has been provided by 3i, which is now being used in multiple cores and labs but does require routine alignments. However, Zeiss has also released a commercial turn-key system, which, while expensive, is stable, and the complexity does not interfere with the experience of the user. Though in general, statements on ease of use and stability might be considered anecdotal and may not belong in a scientific article, unreferenced or without data.

      (2) One of the major limitations of the first generation LLSM was the use of a 5 mm coverslip, which was a hinderance for many users. However, the Zeiss system elegantly solves this problem, and so does Oblique Plane Microscopy (OPM), while the Altair-LSFM retains this feature, which may dissuade widespread adoption. This limitation and how it may be overcome in future iterations is not discussed.

      (3) Further, on the point of sample flexibility, all generations of the LLSM, and by the nature of its design, the OPM, can accommodate live-cell imaging with temperature, gas, and humidity control. It is unclear how this would be implemented with the current sample chamber. This limitation would severely limit use cases for cell biologists, for which this microscope is designed. There is no discussion on this limitation or how it may be overcome in future iterations.

      (4) The authors' comparison to LLSM is constrained to the "square" lattice, which, as they point out, is the most used optical lattice (though this also might be considered anecdotal). The LLSM original design, however, goes far beyond the square lattice, including hexagonal lattices, the ability to do structured illumination, and greater flexibility in general in terms of light-sheet tuning for different experimental needs, as well as not being limited to just sample scanning. Thus, the Alstair-LSFM cannot compare to the original LLSM in terms of versatility, even if comparisons to the resolution provided by the square lattice are fair.

      (5) There is no demonstration of the system's live-imaging capabilities or temporal resolution, which is the main advantage of existing light-sheet systems.

      While the microscope is well designed and completely open source, it will require experience with optics, electronics, and microscopy to implement and align properly. Experience with custom machining or soliciting a machine shop is also necessary. Thus, in my opinion, it is unlikely to be implemented by a lab that has zero prior experience with custom optics or can hire someone who does. Altair-LSFM may not be as easily adaptable or implementable as the authors describe or perceive in any lab that is interested, even if they can afford it. The authors indicate they will offer "workshops," but this does not necessarily remove the barrier to entry or lower it, perhaps as significantly as the authors describe.

      There is a claim that this design is easily adaptable. However, the requirement of custom-machined baseplates and in silico optimization of the optical path basically means that each new instrument is a new design, even if the Navigate software can be used. It is unclear how Altair-LSFM demonstrates a modular design that reduces times from conception to optimization compared to previous implementations.

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript introduces a high-resolution, open-source light-sheet fluorescence microscope optimized for sub-cellular imaging.

      The system is designed for ease of assembly and use, incorporating a custom-machined baseplate and in silico optimized optical paths to ensure robust alignment and performance. The authors demonstrate lateral and axial resolutions of ~235 nm and ~350 nm after deconvolution, enabling imaging of sub-diffraction structures in mammalian cells.

      The important feature of the microscope is the clever and elegant adaptation of simple gaussian beams, smart beam shaping, galvo pivoting and high NA objectives to ensure a uniform thin light-sheet of around 400 nm in thickness, over a 266 micron wide Field of view, pushing the axial resolution of the system beyond the regular diffraction limited-based tradeoffs of light-sheet fluorescence microscopy.

      Compelling validation using fluorescent beads and multicolor cellular imaging highlights the system's performance and accessibility. Moreover, a very extensive and comprehensive manual of operation is provided in the form of supplementary materials. This provides a DIY blueprint for researchers who want to implement such a system.

      Strengths:

      (1) Strong and accessible technical innovation:

      With an elegant combination of beam shaping and optical modelling, the authors provide a high-resolution light-sheet system that overcomes the classical light-sheet tradeoff limit of a thin light-sheet and a small field of view. In addition, the integration of in silico modelling with a custom-machined baseplate is very practical and allows for ease of alignment procedures. Combining these features with the solid and super-extensive guide provided in the supplementary information, this provides a protocol for replicating the microscope in any other lab.

      (2) Impeccable optical performance and ease of mounting of samples:

      The system takes advantage of the same sample-holding method seen already in other implementations, but reduces the optical complexity. At the same time, the authors claim to achieve similar lateral and axial resolution to Lattice-light-sheet microscopy (although without a direct comparison (see below in the "weaknesses" section). The optical characterization of the system is comprehensive and well-detailed. Additionally, the authors validate the system imaging sub-cellular structures in mammalian cells.

      (3) Transparency and comprehensiveness of documentation and resources:

      A very detailed protocol provides detailed documentation about the setup, the optical modeling, and the total cost.

      Weaknesses:

      (1) Limited quantitative comparisons:

      Although some qualitative comparison with previously published systems (diSPIM, lattice light-sheet) is provided throughout the manuscript, some side-by-side comparison would be of great benefit for the manuscript, even in the form of a theoretical simulation. While having a direct imaging comparison would be ideal, it's understandable that this goes beyond the interest of the paper; however, a table referencing image quality parameters (taken from the literature), such as signal-to-noise ratio, light-sheet thickness, and resolutions, would really enhance the features of the setup presented. Moreover, based also on the necessity for optical simplification, an additional comment on the importance/difference of dual objective/single objective light-sheet systems could really benefit the discussion.

      (2) Limitation to a fixed sample:

      In the manuscript, there is no mention of incubation temperature, CO₂ regulation, Humidity control, or possible integration of commercial environmental control systems. This is a major limitation for an imaging technique that owes its popularity to fast, volumetric, live-cell imaging of biological samples.

      (3) System cost and data storage cost:

      While the system presented has the advantage of being open-source, it remains relatively expensive (considering the 150k without laser source and optical table, for example). The manuscript could benefit from a more direct comparison of the performance/cost ratio of existing systems, considering academic settings with budgets that most of the time would not allow for expensive architectures. Moreover, it would also be beneficial to discuss the adaptability of the system, in case a 30k objective could not be feasible. Will this system work with different optics (with the obvious limitations coming with the lower NA objective)? This could be an interesting point of discussion. Adaptability of the system in case of lower budgets or more cost-effective choices, depending on the needs.

      Last, not much is said about the need for data storage. Light-sheet microscopy's bottleneck is the creation of increasingly large datasets, and it could be beneficial to discuss more about the storage needs and the quantity of data generated.

      Conclusion:

      Altair-LSFM represents a well-engineered and accessible light-sheet system that addresses a longstanding need for high-resolution, reproducible, and affordable sub-cellular light-sheet imaging. While some aspects-comparative benchmarking and validation, limitation for fixed samples-would benefit from further development, the manuscript makes a compelling case for Altair-LSFM as a valuable contribution to the open microscopy scientific community.

    5. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study presents Altair-LSFM, a solid and well-documented implementation of a light-sheet fluorescence microscope (LSFM) designed for accessibility and cost reduction. While the approach offers strengths such as the use of custom-machined baseplates and detailed assembly instructions, its overall impact is limited by the lack of live-cell imaging capabilities and the absence of a clear, quantitative comparison to existing LSFM platforms. As such, although technically competent, the broader utility and uptake of this system by the community may be limited.

      We thank the editors and reviewers for their thoughtful evaluation of our work and for recognizing the technical strengths of the Altair-LSFM platform, including the custom-machined baseplates and detailed documentation provided to promote accessibility and reproducibility. Below, we provide point-by-point responses to each referee comment. In the process, we have significantly revised the manuscript to include live-cell imaging data and a quantitative evaluation of imaging speed. We now more explicitly describe the different variants of lattice light-sheet microscopy—highlighting differences in their illumination flexibility and image acquisition modes—and clarify how Altair-LSFM compares to each. We further discuss challenges associated with the 5 mm coverslip and propose practical strategies to overcome them. Additionally, we outline cost-reduction opportunities, explain the rationale behind key equipment selections, and provide guidance for implementing environmental control. Altogether, we believe these additions have strengthened the manuscript and clarified both the capabilities and limitations of AltairLSFM.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      The article presents the details of the high-resolution light-sheet microscopy system developed by the group. In addition to presenting the technical details of the system, its resolution has been characterized and its functionality demonstrated by visualizing subcellular structures in a biological sample.

      Strengths: 

      (1) The article includes extensive supplementary material that complements the information in the main article.

      (2) However, in some sections, the information provided is somewhat superficial.

      We thank the reviewer for their thoughtful assessment and for recognizing the strengths of our manuscript, including the extensive supplementary material. Our goal was to make the supplemental content as comprehensive and useful as possible. In addition to the materials provided with the manuscript, our intention is for the online documentation (available at thedeanlab.github.io/altair) to serve as a living resource that evolves in response to user feedback. We would therefore greatly appreciate the reviewer’s guidance on which sections were perceived as superficial so that we can expand them to better support readers and builders of the system.

      Weaknesses:

      (1) Although a comparison is made with other light-sheet microscopy systems, the presented system does not represent a significant advance over existing systems. It uses high numerical aperture objectives and Gaussian beams, achieving resolution close to theoretical after deconvolution. The main advantage of the presented system is its ease of construction, thanks to the design of a perforated base plate.

      We appreciate the reviewer’s assessment and the opportunity to clarify our intent. Our primary goal was not to introduce new optical functionality beyond that of existing high-performance light-sheet systems, but rather to substantially reduce the barrier to entry for non-specialist laboratories. Many open-source implementations, such as OpenSPIM, OpenSPIN, and Benchtop mesoSPIM, similarly focused on accessibility and reproducibility rather than introducing new optical modalities, yet have had a measureable impact on the field by enabling broader community participation. Altair-LSFM follows this tradition, providing sub-cellular resolution performance comparable to advanced systems like LLSM, while emphasizing reproducibility, ease of construction through a precision-machined baseplate, and comprehensive documentation to facilitate dissemination and adoption.

      (2) Using similar objectives (Nikon 25x and Thorlabs 20x), the results obtained are similar to those of the LLSM system (using a Gaussian beam without laser modulation). However, the article does not mention the difficulties of mounting the sample in the implemented configuration.

      We appreciate the reviewer’s comment and agree that there are practical challenges associated with handling 5 mm diameter coverslips in this configuration. In the revised manuscript, we now explicitly describe these challenges and provide practical solutions. Specifically, we highlight the use of a custommachined coverslip holder designed to simplify mounting and handling, and we direct readers to an alternative configuration using the Zeiss W Plan-Apochromat 20×/1.0 objective, which eliminates the need for small coverslips altogether.

      (3) The authors present a low-cost, open-source system. Although they provide open source code for the software (navigate), the use of proprietary electronics (ASI, NI, etc.) makes the system relatively expensive. Its low cost is not justified.

      We appreciate the reviewer’s perspective and understand the concern regarding the use of proprietary control hardware such as the ASI Tiger Controller and NI data acquisition cards. Our decision to use these components was intentional: relying on a unified, professionally supported and maintained platform minimizes complexity associated with sourcing, configuring, and integrating hardware from multiple vendors, thereby reducing non-financial barriers to entry for non-specialist users.

      Importantly, these components are not the primary cost driver of Altair-LSFM (they represent roughly 18% of the total system cost). Nonetheless, for individuals where the price is prohibitive, we also outline several viable cost-reduction options in the revised manuscript (e.g., substituting manual stages, omitting the filter wheel, or using industrial CMOS cameras), while discussing the trade-offs these substitutions introduce in performance and usability. These considerations are now summarized in Supplementary Note 1, which provides a transparent rationale for our design and cost decisions.

      Finally, we note that even with these professional-grade components, Altair-LSFM remains substantially less expensive than commercial systems offering comparable optical performance, such as LLSM implementations from Zeiss or 3i.

      (4) The fibroblast images provided are of exceptional quality. However, these are fixed samples. The system lacks the necessary elements for monitoring cells in vivo, such as temperature or pH control.

      We thank the reviewer for their positive comment regarding the quality of our data. As noted, the current manuscript focuses on validating the optical performance and resolution of the system using fixed specimens to ensure reproducibility and stability.

      We fully agree on the importance of environmental control for live-cell imaging. In the revised manuscript, we now describe in detail how temperature regulation can be achieved using a custom-designed heated sample chamber, accompanied by detailed assembly instructions on our GitHub repository and summarized in Supplementary Note 2. For pH stabilization in systems lacking a 5% CO₂ atmosphere, we recommend supplementing the imaging medium with 10–25 mM HEPES buffer. Additionally, we include new live-cell imaging data demonstrating that Altair-LSFM supports in vitro time-lapse imaging of dynamic cellular processes under controlled temperature conditions.

      Reviewer #2 (Public review): 

      Summary: 

      The authors present Altair-LSFM (Light Sheet Fluorescence Microscope), a high-resolution, open-source microscope, that is relatively easy to align and construct and achieves sub-cellular resolution. The authors developed this microscope to fill a perceived need that current open-source systems are primarily designed for large specimens and lack sub-cellular resolution or are difficult to construct and align, and are not stable. While commercial alternatives exist that offer sub-cellular resolution, they are expensive. The authors' manuscript centers around comparisons to the highly successful lattice light-sheet microscope, including the choice of detection and excitation objectives. The authors thus claim that there remains a critical need for high-resolution, economical, and easy-to-implement LSFM systems. 

      We thank the reviewer for their thoughtful summary. We agree that existing open-source systems primarily emphasize imaging of large specimens, whereas commercial systems that achieve sub-cellular resolution remain costly and complex. Our aim with Altair-LSFM was to bridge this gap—providing LLSM-level performance in a substantially more accessible and reproducible format. By combining high-NA optics with a precision-machined baseplate and open-source documentation, Altair offers a practical, high-resolution solution that can be readily adopted by non-specialist laboratories.

      Strengths: 

      The authors succeed in their goals of implementing a relatively low-cost (~ USD 150K) open-source microscope that is easy to align. The ease of alignment rests on using custom-designed baseplates with dowel pins for precise positioning of optics based on computer analysis of opto-mechanical tolerances, as well as the optical path design. They simplify the excitation optics over Lattice light-sheet microscopes by using a Gaussian beam for illumination while maintaining lateral and axial resolutions of 235 and 350 nm across a 260-um field of view after deconvolution. In doing so they rest on foundational principles of optical microscopy that what matters for lateral resolution is the numerical aperture of the detection objective and proper sampling of the image field on to the detection, and the axial resolution depends on the thickness of the light-sheet when it is thinner than the depth of field of the detection objective. This concept has unfortunately not been completely clear to users of high-resolution light-sheet microscopes and is thus a valuable demonstration. The microscope is controlled by an open-source software, Navigate, developed by the authors, and it is thus foreseeable that different versions of this system could be implemented depending on experimental needs while maintaining easy alignment and low cost. They demonstrate system performance successfully by characterizing their sheet, point-spread function, and visualization of sub-cellular structures in mammalian cells, including microtubules, actin filaments, nuclei, and the Golgi apparatus.

      We thank the reviewer for their thoughtful and generous assessment of our work. We are pleased that the manuscript’s emphasis on fundamental optical principles, design rationale, and practical implementation was clearly conveyed. We agree that Altair’s modular and accessible architecture provides a strong foundation for future variants tailored to specific experimental needs. To facilitate this, we have made all Zemax simulations, CAD files, and build documentation openly available on our GitHub repository, enabling users to adapt and extend the system for diverse imaging applications.

      Weaknesses:

      There is a fixation on comparison to the first-generation lattice light-sheet microscope, which has evolved significantly since then:

      (1) The authors claim that commercial lattice light-sheet microscopes (LLSM) are "complex, expensive, and alignment intensive", I believe this sentence applies to the open-source version of LLSM, which was made available for wide dissemination. Since then, a commercial solution has been provided by 3i, which is now being used in multiple cores and labs but does require routine alignments. However, Zeiss has also released a commercial turn-key system, which, while expensive, is stable, and the complexity does not interfere with the experience of the user. Though in general, statements on ease of use and stability might be considered anecdotal and may not belong in a scientific article, unreferenced or without data.

      We thank the reviewer for this thoughtful and constructive comment. We have revised the manuscript to more clearly distinguish between the original open-source implementation of LLSM and subsequent commercial versions by 3i and ZEISS. The revised Introduction and Discussion now explicitly note that while open-source and early implementations of LLSM can require expert alignment and maintenance, commercial systems—particularly the ZEISS Lattice Lightsheet 7—are designed for automated operation and stable, turn-key use, albeit at higher cost and with limited modifiability. We have also moderated earlier language regarding usability and stability to avoid anecdotal phrasing.

      We also now provide a more objective proxy for system complexity: the number of optical elements that require precise alignment during assembly and maintenance thereafter. The original open-source LLSM setup includes approximately 29 optical components that must each be carefully positioned laterally, angularly, and coaxially along the optical path. In contrast, the first-generation Altair-LSFM system contains only nine such elements. By this metric, Altair-LSFM is considerably simpler to assemble and align, supporting our overarching goal of making high-resolution light-sheet imaging more accessible to non-specialist laboratories.

      (2) One of the major limitations of the first generation LLSM was the use of a 5 mm coverslip, which was a hinderance for many users. However, the Zeiss system elegantly solves this problem, and so does Oblique Plane Microscopy (OPM), while the Altair-LSFM retains this feature, which may dissuade widespread adoption. This limitation and how it may be overcome in future iterations is not discussed.

      We thank the reviewer for this helpful comment. We agree that the use of 5 mm diameter coverslips, while enabling high-NA imaging in the current Altair-LSFM configuration, may pose a practical limitation for some users. We now discuss this more explicitly in the revised manuscript. Specifically, we note that replacing the detection objective provides a straightforward solution to this constraint. For example, as demonstrated by Moore et al. (Lab Chip, 2021), pairing the Zeiss W Plan-Apochromat 20×/1.0 detection objective with the Thorlabs TL20X-MPL illumination objective allows imaging beyond the physical surfaces of both objectives, eliminating the need for small-format coverslips. In the revised text, we propose this modification as an accessible path toward greater compatibility with conventional sample mounting formats. We also note in the Discussion that Oblique Plane Microscopy (OPM) inherently avoids such nonstandard mounting requirements and, owing to its single-objective architecture, is fully compatible with standard environmental chambers.

      (3) Further, on the point of sample flexibility, all generations of the LLSM, and by the nature of its design, the OPM, can accommodate live-cell imaging with temperature, gas, and humidity control. It is unclear how this would be implemented with the current sample chamber. This limitation would severely limit use cases for cell biologists, for which this microscope is designed. There is no discussion on this limitation or how it may be overcome in future iterations.

      We thank the reviewer for this important observation and agree that environmental control is critical for live-cell imaging applications. It is worth noting that the original open-source LLSM design, as well as the commercial version developed by 3i, provided temperature regulation but did not include integrated control of CO2 or humidity. Despite this limitation, these systems have been widely adopted and have generated significant biological insights. We also acknowledge that both OPM and the ZEISS implementation of LLSM offer clear advantages in this respect, providing compatibility with standard commercial environmental chambers that support full regulation of temperature, CO₂, and humidity.

      In the revised manuscript, we expand our discussion of environmental control in Supplementary Note 2, where we describe the Altair-LSFM chamber design in more detail and discuss its current implementation of temperature regulation and HEPES-based pH stabilization. Additionally, the Discussion now explicitly notes that OPM avoids the challenges associated with non-standard sample mounting and is inherently compatible with conventional environmental enclosures.

      (4) The authors' comparison to LLSM is constrained to the "square" lattice, which, as they point out, is the most used optical lattice (though this also might be considered anecdotal). The LLSM original design, however, goes far beyond the square lattice, including hexagonal lattices, the ability to do structured illumination, and greater flexibility in general in terms of light-sheet tuning for different experimental needs, as well as not being limited to just sample scanning. Thus, the Alstair-LSFM cannot compare to the original LLSM in terms of versatility, even if comparisons to the resolution provided by the square lattice are fair.

      We agree that the original LLSM design offers substantially greater flexibility than what is reflected in our initial comparison, including the ability to generate multiple lattice geometries (e.g., square and hexagonal), operate in structured illumination mode, and acquire volumes using both sample- and lightsheet–scanning strategies. To address this, we now include Supplementary Note 3 that provides a detailed overview of the illumination modes and imaging flexibility afforded by the original LLSM implementation, and how these capabilities compare to both the commercial ZEISS Lattice Lightsheet 7 and our AltairLSFM system. In addition, we have revised the discussion to explicitly acknowledge that the original LLSM could operate in alternative scan strategies beyond sample scanning, providing greater context for readers and ensuring a more balanced comparison.

      (5) There is no demonstration of the system's live-imaging capabilities or temporal resolution, which is the main advantage of existing light-sheet systems.

      In the revised manuscript, we now include a demonstration of live-cell imaging to directly validate AltairLSFM’s suitability for dynamic biological applications. We also explicitly discuss the temporal resolution of the system in the main text (see Optoelectronic Design of Altair-LSFM), where we detail both software- and hardware-related limitations. Specifically, we evaluate the maximum imaging speed achievable with Altair-LSFM in conjunction with our open-source control software, navigate.

      For simplicity and reduced optoelectronic complexity, the current implementation powers the piezo through the ASI Tiger Controller, which modestly reduces its bandwidth. Nonetheless, for a 100 µm stroke typical of light-sheet imaging, we achieved sufficient performance to support volumetric imaging at most biologically relevant timescales. These results, along with additional discussion of the design trade-offs and performance considerations, are now included in the revised manuscript and expanded upon in the supplementary material.

      While the microscope is well designed and completely open source, it will require experience with optics, electronics, and microscopy to implement and align properly. Experience with custom machining or soliciting a machine shop is also necessary. Thus, in my opinion, it is unlikely to be implemented by a lab that has zero prior experience with custom optics or can hire someone who does. Altair-LSFM may not be as easily adaptable or implementable as the authors describe or perceive in any lab that is interested, even if they can afford it. The authors indicate they will offer "workshops," but this does not necessarily remove the barrier to entry or lower it, perhaps as significantly as the authors describe.

      We appreciate the reviewer’s perspective and agree that building any high-performance custom microscope—Altair-LSFM included—requires a basic understanding of (or willingness to learn) optics, electronics, and instrumentation. Such a barrier exists for all open-source microscopes, and our goal is not to eliminate this requirement entirely but to substantially reduce the technical and logistical challenges that typically accompany the construction of custom light-sheet systems.

      Importantly, no machining experience or in-house fabrication capabilities are required. Users can simply submit the provided CAD design files and specifications directly to commercial vendors for fabrication. We have made this process as straightforward as possible by supplying detailed build instructions, recommended materials, and vendor-ready files through our GitHub repository. Our dissemination strategy draws inspiration from other successful open-source projects such as mesoSPIM, which has seen widespread adoption—over 30 implementations worldwide—through a similar model of exhaustive documentation, open-source software, and community support via user meetings and workshops.

      We also recognize that documentation alone cannot fully replace hands-on experience. To further lower barriers to adoption, we are actively working with commercial vendors to streamline procurement and assembly, and Altair-LSFM is supported by a Biomedical Technology Development and Dissemination (BTDD) grant that provides resources for hosting workshops, offering real-time community support, and developing supplementary training materials.

      In the revised manuscript, we now expand the Discussion to explicitly acknowledge these implementation considerations and to outline our ongoing efforts to support a broad and diverse user base, ensuring that laboratories with varying levels of technical expertise can successfully adopt and maintain the Altair-LSFM platform.

      There is a claim that this design is easily adaptable. However, the requirement of custom-machined baseplates and in silico optimization of the optical path basically means that each new instrument is a new design, even if the Navigate software can be used. It is unclear how Altair-LSFM demonstrates a modular design that reduces times from conception to optimization compared to previous implementations.

      We thank the reviewer for this insightful comment and agree that our original language regarding adaptability may have overstated the degree to which Altair-LSFM can be modified without prior experience. It was not our intention to imply that the system can be easily redesigned by users with limited technical background. Meaningful adaptations of the optical or mechanical design do require expertise in optical layout, optomechanical design, and alignment.

      That said, for laboratories with such expertise, we aim to facilitate modifications by providing comprehensive resources—including detailed Zemax simulations, complete CAD models, and alignment documentation. These materials are intended to reduce the development burden for expert users seeking to tailor the system to specific experimental requirements, without necessitating a complete re-optimization of the optical path from first principles.

      In the revised manuscript, we clarify this point and temper our language regarding adaptability to better reflect the realistic scope of customization. Specifically, we now state in the Discussion: “For expert users who wish to tailor the instrument, we also provide all Zemax illumination-path simulations and CAD files, along with step-by-step optimization protocols, enabling modification and re-optimization of the optical system as needed.” This revision ensures that readers clearly understand that Altair-LSFM is designed for reproducibility and straightforward assembly in its default configuration, while still offering the flexibility for modification by experienced users.

      Reviewer #3 (Public review):

      Summary: 

      This manuscript introduces a high-resolution, open-source light-sheet fluorescence microscope optimized for sub-cellular imaging. The system is designed for ease of assembly and use, incorporating a custommachined baseplate and in silico optimized optical paths to ensure robust alignment and performance. The authors demonstrate lateral and axial resolutions of ~235 nm and ~350 nm after deconvolution, enabling imaging of sub-diffraction structures in mammalian cells. The important feature of the microscope is the clever and elegant adaptation of simple gaussian beams, smart beam shaping, galvo pivoting and high NA objectives to ensure a uniform thin light-sheet of around 400 nm in thickness, over a 266 micron wide Field of view, pushing the axial resolution of the system beyond the regular diffraction limited-based tradeoffs of light-sheet fluorescence microscopy. Compelling validation using fluorescent beads and multicolor cellular imaging highlights the system's performance and accessibility. Moreover, a very extensive and comprehensive manual of operation is provided in the form of supplementary materials. This provides a DIY blueprint for researchers who want to implement such a system.

      We thank the reviewer for their thoughtful and positive assessment of our work. We appreciate their recognition of Altair-LSFM’s design and performance, including its ability to achieve high-resolution, imaging throughout a 266-micron field of view. While Altair-LSFM approaches the practical limits of diffraction-limited performance, it does not exceed the fundamental diffraction limit; rather, it achieves near-theoretical resolution through careful optical optimization, beam shaping, and alignment. We are grateful for the reviewer’s acknowledgment of the accessibility and comprehensive documentation that make this system broadly implementable.

      Strengths:

      (1) Strong and accessible technical innovation: With an elegant combination of beam shaping and optical modelling, the authors provide a high-resolution light-sheet system that overcomes the classical light-sheet tradeoff limit of a thin light-sheet and a small field of view. In addition, the integration of in silico modelling with a custom-machined baseplate is very practical and allows for ease of alignment procedures. Combining these features with the solid and super-extensive guide provided in the supplementary information, this provides a protocol for replicating the microscope in any other lab.

      (2) Impeccable optical performance and ease of mounting of samples: The system takes advantage of the same sample-holding method seen already in other implementations, but reduces the optical complexity.

      At the same time, the authors claim to achieve similar lateral and axial resolution to Lattice-light-sheet microscopy (although without a direct comparison (see below in the "weaknesses" section). The optical characterization of the system is comprehensive and well-detailed. Additionally, the authors validate the system imaging sub-cellular structures in mammalian cells.

      (3) Transparency and comprehensiveness of documentation and resources: A very detailed protocol provides detailed documentation about the setup, the optical modeling, and the total cost.

      We thank the reviewer for their thoughtful and encouraging comments. We are pleased that the technical innovation, optical performance, and accessibility of Altair-LSFM were recognized. Our goal from the outset was to develop a diffraction-limited, high-resolution light-sheet system that balances optical performance with reproducibility and ease of implementation. We are also pleased that the use of precisionmachined baseplates was recognized as a practical and effective strategy for achieving performance while maintaining ease of assembly.

      Weaknesses: 

      (1) Limited quantitative comparisons: Although some qualitative comparison with previously published systems (diSPIM, lattice light-sheet) is provided throughout the manuscript, some side-by-side comparison would be of great benefit for the manuscript, even in the form of a theoretical simulation. While having a direct imaging comparison would be ideal, it's understandable that this goes beyond the interest of the paper; however, a table referencing image quality parameters (taken from the literature), such as signalto-noise ratio, light-sheet thickness, and resolutions, would really enhance the features of the setup presented. Moreover, based also on the necessity for optical simplification, an additional comment on the importance/difference of dual objective/single objective light-sheet systems could really benefit the discussion.

      In the revised manuscript, we have significantly expanded our discussion of different light-sheet systems to provide clearer quantitative and conceptual context for Altair-LSFM. These comparisons are based on values reported in the literature, as we do not have access to many of these instruments (e.g., DaXi, diSPIM, or commercial and open-source variants of LLSM), and a direct experimental comparison is beyond the scope of this work.

      We note that while quantitative parameters such as signal-to-noise ratio are important, they are highly sample-dependent and strongly influenced by imaging conditions, including fluorophore brightness, camera characteristics, and filter bandpass selection. For this reason, we limited our comparison to more general image-quality metrics—such as light-sheet thickness, resolution, and field of view—that can be reliably compared across systems.

      Finally, per the reviewer’s recommendation, we have added additional discussion clarifying the differences between dual-objective and single-objective light-sheet architectures, outlining their respective strengths, limitations, and suitability for different experimental contexts.

      (2) Limitation to a fixed sample: In the manuscript, there is no mention of incubation temperature, CO₂ regulation, Humidity control, or possible integration of commercial environmental control systems. This is a major limitation for an imaging technique that owes its popularity to fast, volumetric, live-cell imaging of biological samples.

      We fully agree that environmental control is critical for live-cell imaging applications. In the revised manuscript, we now describe the design and implementation of a temperature-regulated sample chamber in Supplementary Note 2, which maintains stable imaging conditions through the use of integrated heating elements and thermocouples. This approach enables precise temperature control while minimizing thermal gradients and optical drift. For pH stabilization, we recommend the use of 10–25 mM HEPES in place of CO₂ regulation, consistent with established practice for most light-sheet systems, including the initial variant of LLSM. Although full humidity and CO₂ control are not readily implemented in dual-objective configurations, we note that single-objective designs such as OPM are inherently compatible with commercial environmental chambers and avoid these constraints. Together, these additions clarify how environmental control can be achieved within Altair-LSFM and situate its capabilities within the broader LSFM design space.

      (3) System cost and data storage cost: While the system presented has the advantage of being opensource, it remains relatively expensive (considering the 150k without laser source and optical table, for example). The manuscript could benefit from a more direct comparison of the performance/cost ratio of existing systems, considering academic settings with budgets that most of the time would not allow for expensive architectures. Moreover, it would also be beneficial to discuss the adaptability of the system, in case a 30k objective could not be feasible. Will this system work with different optics (with the obvious limitations coming with the lower NA objective)? This could be an interesting point of discussion. Adaptability of the system in case of lower budgets or more cost-effective choices, depending on the needs.

      We agree that cost considerations are critical for adoption in academic environments. We would also like to clarify that the quoted $150k includes the optical table and laser source. In the revised manuscript, Supplementary Note 1 now includes an expanded discussion of cost–performance trade-offs and potential paths for cost reduction.

      Last, not much is said about the need for data storage. Light-sheet microscopy's bottleneck is the creation of increasingly large datasets, and it could be beneficial to discuss more about the storage needs and the quantity of data generated.

      In the revised manuscript, we now include Supplementary Note 4, which provides a high-level discussion of data storage needs, approximate costs, and practical strategies for managing large datasets generated by light-sheet microscopy. This section offers general guidance—including file-format recommendations, and cost considerations—but we note that actual costs will vary by institution and contractual agreements.

      Conclusion:

      Altair-LSFM represents a well-engineered and accessible light-sheet system that addresses a longstanding need for high-resolution, reproducible, and affordable sub-cellular light-sheet imaging. While some aspects-comparative benchmarking and validation, limitation for fixed samples-would benefit from further development, the manuscript makes a compelling case for Altair-LSFM as a valuable contribution to the open microscopy scientific community. 

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) A picture, or full CAD design of the complete instrument, should be included as a main figure.

      A complete CAD rendering of the microscope is now provided in Supplementary Figure 4.

      (2) There is no quantitative comparison of the effects of the tilting resonant galvo; only a cartoon, a figure should be included.

      The cartoon was intended purely as an educational illustration to conceptually explain the role of the tilting resonant galvo in shaping and homogenizing the light sheet. To clarify this intent, we have revised both the figure legend and corresponding text in the main manuscript. For readers seeking quantitative comparisons, we now reference the original study that provides a detailed analysis of this optical approach, as well as a review on the subject.

      (3) Description of L4 is missing in the Figure 1 caption.

      Thank you for catching this omission. We have corrected it.

      (4) The beam profiles in Figures 1c and 3a, please crop and make the image bigger so the profile can be appreciated. The PSFs in Figure 3c-e should similarly be enlarged and presented using a dynamic range/LUT such that any aberrations can be appreciated.

      In Figure 1c, our goal was to qualitatively illustrate the uniformity of the light-sheet across the full field of view, while Figure 1d provided the corresponding quantitative cross-section. To improve clarity, we have added an additional figure panel offering a higher-magnification, localized view of the light-sheet profile. For Figure 3c–e, we have enlarged the PSF images and adjusted the display range to better convey the underlying signal and allow subtle aberrations to be appreciated.

      (5) It is unclear why LLSM is being used as the gold standard, since in its current commercial form, available from Zeiss, it is a turn-key system designed for core facilities. The original LLSM is also a versatile instrument that provides much more than the square lattice for illumination, including structured illumination, hexagonal lattices, live-cell imaging, wide-field illumination, different scan modes, etc. These additional features are not even mentioned when compared to the Altair-LSFM. If a comparison is to be provided, it should be fair and balanced. Furthermore, as outlined in the public review, anecdotal statements on "most used", "difficult to align", or "unstable" should not be provided without data.

      In the revised manuscript, we have carefully removed anecdotal statements and, where appropriate, replaced them with quantitative or verifiable information. For instance, we now explicitly report that the square lattice was used in 16 of the 20 figure subpanels in the original LLSM publication, and we include a proxy for optical complexity based on the number of optical elements requiring alignment in each system.

      We also now clearly distinguish between the original LLSM design—which supports multiple illumination and scanning modes—and its subsequent commercial variants, including the ZEISS Lattice Lightsheet 7, which prioritizes stability and ease of use over configurational flexibility (see Supplementary Note 3).

      (6) The authors should recognize that implementing custom optics, no matter how well designed, is a big barrier to cross for most cell biology labs.

      We fully understand and now acknowledge in the main text that implementing custom optics can present a significant barrier, particularly for laboratories without prior experience in optical system assembly. However, similar challenges were encountered during the adoption of other open-source microscopy platforms, such as mesoSPIM and OpenSPIM, both of which have nonetheless achieved widespread implementation. Their success has largely been driven by exhaustive documentation, strong community support, and standardized design principles—approaches we have also prioritized in Altair-LSFM. We have therefore made all CAD files, alignment guides, and detailed build documentation publicly available and continue to develop instructional materials and community resources to further reduce the barrier to adoption.

      (7) Statements on "hands on workshops" though laudable, may not be appropriate to include in a scientific publication without some documentation on the influence they have had on implanting the microscope.

      We understand the concern. Our intention in mentioning hands-on workshops was to convey that the dissemination effort is supported by an NIH Biomedical Technology Development and Dissemination grant, which includes dedicated channels for outreach and community engagement. Nonetheless, we agree that such statements are not appropriate without formal documentation of their impact, and we have therefore removed this text from the revised manuscript.

      (8) It is claimed that the microscope is "reliable" in the discussion, but with no proof, long-term stability should be assessed and included.

      Our experience with Altair-LSFM has been that it remains well-aligned over time—especially in comparison to other light-sheet systems we worked on throughout the last 11 years—we acknowledge that this assessment is anecdotal. As such, we have omitted this claim from the revised manuscript.

      (9) Due to the reliance on anecdotal statements and comparisons without proof to other systems, this paper at times reads like a brochure rather than a scientific publication. The authors should consider editing their manuscript accordingly to focus on the technical and quantifiable aspects of their work.

      We agree with the reviewer’s assessment and have revised the manuscript to remove anecdotal comparisons and subjective language. Where possible, we now provide quantitative metrics or verifiable data to support our statements.

      Reviewer #3 (Recommendations for the authors):

      Other minor points that could improve the manuscript (although some of these points are explained in the huge supplementary manual): 

      (1) The authors explain thoroughly their design, and they chose a sample-scanning method. I think that a brief discussion of the advantages and disadvantages of such a method over, for example, a laserscanning system (with fixed sample) in the main text will be highly beneficial for the users.

      In the revised manuscript, we now include a brief discussion in the main text outlining the advantages and limitations of a sample-scanning approach relative to a light-sheet–scanning system. Specifically, we note that for thin, adherent specimens, sample scanning minimizes the optical path length through the sample, allowing the use of more tightly focused illumination beams that improve axial resolution. We also include a new supplementary figure illustrating how this configuration reduces the propagation length of the illumination light sheet, thereby enhancing axial resolution.

      (2) The authors justify selecting a 0.6 NA illumination objective over alternatives (e.g., Special Optics), but the manuscript would benefit from a more quantitative trade-off analysis (beam waist, working distance, sample compatibility) with other possibilities. Within the objective context, a comparison of the performances of this system with the new and upcoming single-objective light-sheet methods (and the ones based also on optical refocusing, e.g., DAXI) would be very interesting for the goodness of the manuscript.

      In the revised manuscript, we now provide a quantitative trade-off analysis of the illumination objectives in Supplementary Note 1, including comparisons of beam waist, working distance, and sample compatibility. This section also presents calculated point spread functions for both the 0.6 NA and 0.67 NA objectives, outlining the performance trade-offs that informed our design choice. In addition, Supplementary Note 3 now includes a broader comparison of Altair-LSFM with other light-sheet modalities, including diSPIM, ASLM, and OPM, to further contextualize the system’s capabilities within the evolving light-sheet microscopy landscape.

      (3) The modularity of the system is implied in the context of the manuscript, but not fully explained. The authors should specify more clearly, for example, if cameras could be easily changed, objectives could be easily swapped, light-sheet thickness could be tuned by changing cylindrical lens, how users might adapt the system for different samples (e.g., embryos, cleared tissue, live imaging), .etc, and discuss eventual constraints or compatibility issues to these implementations.

      Altair-LSFM was explicitly designed and optimized for imaging live adherent cells, where sample scanning and short light-sheet propagation lengths provide optimal axial resolution (Supplementary Note 3). While the same platform could be used for superficial imaging in embryos, systems implementing multiview illumination and detection schemes are better suited for such specimens. Similarly, cleared tissue imaging typically requires specialized solvent-compatible objectives and approaches such as ASLM that maximize the field of view. We have now added some text to the Design Principles section that explicitly state this.

      Altair-LSFM offers varying levels of modularity depending on the user’s level of expertise. For entry-level users, the illumination numerical aperture—and therefore the light-sheet thickness and propagation length—can be readily adjusted by tuning the rectangular aperture conjugate to the back pupil of the illumination objective, as described in the Design Principles section. For mid-level users, alternative configurations of Altair-LSFM, including different detection objectives, stages, filter wheels, or cameras, can be readily implemented (Supplementary Note 1). Importantly, navigate natively supports a broad range of hardware devices, and new components can be easily integrated through its modular interface. For expert users, all Zemax simulations, CAD models, and step-by-step optimization protocols are openly provided, enabling complete re-optimization of the optical design to meet specific experimental requirements.

      (4) Resolution measurements before and after deconvolution are central to the performance claim, but the deconvolution method (PetaKit5D) is only briefly mentioned in the main text, it's not referenced, and has to be clarified in more detail, coherently with the precision of the supplementary information. More specifically, PetaKit5D should be referenced in the main text, the details of the deconvolution parameters discussed in the Methods section, and the computational requirements should also be mentioned. 

      In the revised manuscript, we now provide a dedicated description of the deconvolution process in the Methods section, including the specific parameters and algorithms used. We have also explicitly referenced PetaKit5D in the main text to ensure proper attribution and clarity. Additionally, we note the computational requirements associated with this analysis in the same section for completeness.

      (5)  Image post-processing is not fully explained in the main text. Since the system is sample-scanning based, no word in the main text is spent on deskewing, which is an integral part of the post-processing to obtain a "straight" 3D stack. Since other systems implement such a post-processing algorithm (for example, single-objective architectures), it would be beneficial to have some discussion about this, and also a brief comparison to other systems in the main text in the methods section. 

      In the revised manuscript, we now explicitly describe both deskewing (shearing) and deconvolution procedures in the Alignment and Characterization section of the main text and direct readers to the Methods section. We also briefly explain why the data must be sheared to correct for the angled sample-scanning geometry for LLSM and Altair-LSFM, as well as both sample-scanning and laser-scanning-variants of OPMs.

      (6) A brief discussion on comparative costs with other systems (LLSM, dispim, etc.) could be helpful for non-imaging expert researchers who could try to implement such an optical architecture in their lab.

      Unfortunately, the exact costs of commercial systems such as LLSM or diSPIM are typically not publicly available, as they depend on institutional agreements and vendor-specific quotations. Nonetheless, we now provide approximate cost estimates in Supplementary Note 1 to help readers and prospective users gauge the expected scale of investment relative to other advanced light-sheet microscopy systems.

      (7) The "navigate" control software is provided, but a brief discussion on its advantages compared to an already open-access system, such as Micromanager, could be useful for the users.

      In the revised manuscript, we now include Supplementary Note 5 that discusses the advantages and disadvantages of different open-source microscope control platforms, including navigate and MicroManager. In brief, navigate was designed to provide turnkey support for multiple light-sheet architectures, with pre-configured acquisition routines optimized for Altair-LSFM, integrated data management with support for multiple file formats (TIFF, HDF5, N5, and Zarr), and full interoperability with OMEcompliant workflows. By contrast, while Micro-Manager offers a broader library of hardware drivers, it typically requires manual configuration and custom scripting for advanced light-sheet imaging workflows.

      (8) The cost and parts are well documented, but the time and expertise required are not crystal clear.Adding a simple time estimate (perhaps in the Supplement Section) of assembly/alignment/installation/validation and first imaging will be very beneficial for users. Also, what level of expertise is assumed (prior optics experience, for example) to be needed to install a system like this? This can help non-optics-expert users to better understand what kind of adventure they are putting themselves through.

      We thank the reviewer for this helpful suggestion. To address this, we have added Supplementary Table S5, which provides approximate time estimates for assembly, alignment, validation, and first imaging based on the user’s prior experience with optical systems. The table distinguishes between novice (no prior experience), moderate (some experience using but not assembling optical systems), and expert (experienced in building and aligning optical systems) users. This addition is intended to give prospective builders a realistic sense of the time commitment and level of expertise required to assemble and validate AltairLSFM.

      Minor things in the main text:

      (1) Line 109: The cost is considered "excluding the laser source". But then in the table of costs, you mention L4cc as a "multicolor laser source", for 25 K. Can you explain this better? Are the costs correct with or without the laser source? 

      We acknowledge that the statement in line 109 was incorrect—the quoted ~$150k system cost does include the laser source (L4cc, listed at $25k in the cost table). We have corrected this in the revised manuscript.

      (2) Line 113: You say "lateral resolution, but then you state a 3D resolution (230 nm x 230 nm x 370 nm). This needs to be fixed.

      Thank you, we have corrected this.

      (3) Line 138: Is the light-sheet uniformity proven also with a fluorescent dye? This could be beneficial for the main text, showing the performance of the instrument in a fluorescent environment.

      The light-sheet profiles shown in the manuscript were acquired using fluorescein to visualize the beam. We have revised the main text and figure legends to clearly state this.

      (4) Line 149: This is one of the most important features of the system, defying the usual tradeoff between light-sheet thickness and field of view, with a regular Gaussian beam. I would clarify more specifically how you achieve this because this really is the most powerful takeaway of the paper.

      We thank the reviewer for this key observation. The ability of Altair-LSFM to maintain a thin light sheet across a large field of view arises from diffraction effects inherent to high NA illumination. Specifically, diffraction elongates the PSF along the beam’s propagation direction, effectively extending the region over which the light sheet remains sufficiently thin for high-resolution imaging. This phenomenon, which has been the subject of active discussion within the light-sheet microscopy community, allows Altair-LSFM to partially overcome the conventional trade-off between light-sheet thickness and propagation length. We now clarify this point in the main text and provide a more detailed discussion in Supplementary Note 3, which is explicitly referenced in the discussion of the revised manuscript.

      (5) Line 171: You talk about repeatable assembly...have you tried many different baseplates? Otherwise, this is a complicated statement, since this is a proof-of-concept paper. 

      We thank the reviewer for this comment. We have not yet validated the design across multiple independently assembled baseplates and therefore agree that our previous statement regarding repeatable assembly was premature. To avoid overstating the current level of validation, we have removed this statement from the revised manuscript.

      (6) Line 187: same as above. You mention "long-term stability". For how long did you try this? This should be specified in numbers (days, weeks, months, years?) Otherwise, it is a complicated statement to make, since this is a proof-of-concept paper.

      We also agree that referencing long-term stability without quantitative backing is inappropriate, and have removed this statement from the revised manuscript.

      (7) Line 198: "rapid z-stack acquisition. How rapid? Also, what is the limitation of the galvo-scanning in terms of the imaging speed of the system? This should be noted in the methods section.

      In the revised manuscript, we now clarify these points in the Optoelectronic Design section. Specifically, we explicitly note that the resonant galvo used for shadow reduction operates at 4 kHz, ensuring that it is not rate-limiting for any imaging mode. In the same section, we also evaluate the maximum acquisition speeds achievable using navigate and report the theoretical bandwidth of the sample-scanning piezo, which together define the practical limits of volumetric acquisition speed for Altair-LSFM.

      (8) Line 234: Peta5Kit is discussed in the additional documentation, but should be referenced here, as well.

      We now reference and cite PetaKit5D.

      (9) Line 256: "values are on par with LLSM", but no values are provided. Some details should also be provided in the main text.

      In the revised manuscript, we now provide the lateral and axial resolution values originally reported for LLSM in the main text to facilitate direct comparison with Altair-LSFM. Additionally, Supplementary Note 3 now includes an expanded discussion on the nuances of resolution measurement and reporting in lightsheet microscopy.

      Figures:

      (1) Figure 1 could be implemented with Figure 3. They're both discussing the validation of the system (theoretically and with simulations), and they could be together in different panels of the same figure. The experimental light-sheet seems to be shown in a transmission mode. Showing a pattern in a fluorescent dye could also be beneficial for the paper.

      In Figure 1, our goal was to guide readers through the design process—illustrating how the detection objective’s NA sets the system’s resolution, which defines the required pixel size for Nyquist sampling and, in turn, the field of view. We then use Figure 1b–c to show how the illumination beam was designed and simulated to achieve that field of view. In contrast, Figure 3 presents the experimental validation of the illumination system. To avoid confusion, we now clarify in the text that the light sheet shown in Figure 3 was visualized in a fluorescein solution and imaged in transmission mode. While we agree that Figures 1 and 3 both serve to validate the system, we prefer to keep them as separate figures to maintain focus within each panel. We believe this organization better supports the narrative structure and allows readers to digest the theoretical and experimental validations independently.

      (2) Figure 3: Panels d and e show the same thing. Why would you expect that xz and yz profiles should be different? Is this due to the orientation of the objectives towards the sample?

      In Figure 3, we present the PSF from all three orthogonal views, as this provides the most transparent assessment of PSF quality—certain aberration modes can be obscured when only select perspectives are shown. In principle, the XZ and YZ projections should be equivalent in a well-aligned system. However, as seen in the XZ projection, a small degree of coma is present that is not evident in the YZ view. We now explicitly note this observation in the revised figure caption to clarify the difference between these panels.

      (3) Figure 4's single boxes lack a scale bar, and some of the Supplementary Figures (e.g. Figure 5) lack detailed axis labels or scale bars. Also, in the detailed documentation, some figures are referred to as Figure 5. Figure 7 or, for example, figure 6. Figure 8, and this makes the cross-references very complicated to follow

      In the revised manuscript, we have corrected these issues. All figures and supplementary figures now include appropriate scale bars, axis labels, and consistent formatting. We have also carefully reviewed and standardized all cross-references throughout the main text and supplementary documentation to ensure that figure numbering is accurate and easy to follow.

    1. eLife Assessment

      This is an important account of replay as recency-weighted context-guided memory reactivation that explains a number of empirical findings across human and rodent memory literatures. The evidence is compelling and the work is likely to inspire further adaptions to incorporate additional biological and cognitive features.

    2. Reviewer #1 (Public review):

      Summary:

      Zhou and colleagues developed a computational model of replay that heavily builds on cognitive models of memory in context (e.g., the context-maintenance and retrieval model), which have been successfully used to explain memory phenomena in the past. Their model produces results that mirror previous empirical findings in rodents and offers a new computational framework for thinking about replay.

      Strengths:

      The model is compelling and seems to explain a number of findings from the rodent literature. It is commendable that the authors implement commonly used algorithms from wakefulness to model sleep/rest, thereby linking wake and sleep phenomena in a parsimonious way. Additionally, the manuscript's comprehensive perspective on replay, bridging humans and non-human animals, enhanced its theoretical contribution.

      Weaknesses:

      This reviewer is not a computational neuroscientist by training, so some comments may stem from misunderstandings. I hope the authors would see those instances as opportunities to clarify their findings for broader audiences.

      (1) The model predicts that temporally close items will be co-reactivated, yet evidence from humans suggests that temporal context doesn't guide sleep benefits (instead, semantic connections seem to be of more importance; Liu and Ranganath 2021, Schechtman et al 2023). Could these findings be reconciled with the model or is this a limitation of the current framework?

      (2) During replay, the model is set so that the next reactivated item is sampled without replacement (i.e., the model cannot get "stuck" on a single item). I'm not sure what the biological backing behind this is and why the brain can't reactivate the same item consistently. Furthermore, I'm afraid that such a rule may artificially generate sequential reactivation of items regardless of wake training. Could the authors explain this better or show that this isn't the case?

      (3) If I understand correctly, there are two ways in which novelty (i.e., less exposure) is accounted for in the model. The first and more talked about is the suppression mechanism (lines 639-646). The second is a change in learning rates (lines 593-595). It's unclear to me why both procedures are needed, how they differ, and whether these are two different mechanisms that the model implements. Also, since the authors controlled the extent to which each item was experienced during wakefulness, it's not entirely clear to me which of the simulations manipulated novelty on an individual item level, as described in lines 593-595 (if any).

      As to the first mechanism - experience-based suppression - I find it challenging to think of a biological mechanism that would achieve this and is selectively activated immediately before sleep (somehow anticipating its onset). In fact, the prominent synaptic homeostasis hypothesis suggests that such suppression, at least on a synaptic level, is exactly what sleep itself does (i.e., prune or weaken synapses that were enhanced due to learning during the day). This begs the question of whether certain sleep stages (or ultradian cycles) may be involved in pruning, whereas others leverage its results for reactivation (e.g., a sequential hypothesis; Rasch & Born, 2013). That could be a compelling synthesis of this literature. Regardless of whether the authors agree, I believe that this point is a major caveat to the current model. It is addressed in the discussion, but perhaps it would be beneficial to explicitly state to what extent the results rely on the assumption of a pre-sleep suppression mechanism.

      (4) As the manuscript mentions, the only difference between sleep and wake in the model is the initial conditions (a0). This is an obvious simplification, especially given the last author's recent models discussing the very different roles of REM vs NREM. Could the authors suggest how different sleep stages may relate to the model or how it could be developed to interact with other successful models such as the ones the last author has developed (e.g., C-HORSE)? Finally, I wonder how the model would explain findings (including the authors') showing a preference for reactivation of weaker memories. The literature seems to suggest that it isn't just a matter of novelty or exposure, but encoding strength. Can the model explain this? Or would it require additional assumptions or some mechanism for selective endogenous reactivation during sleep and rest?

      (5) Lines 186-200 - Perhaps I'm misunderstanding, but wouldn't it be trivial that an external cue at the end-item of Figure 7a would result in backward replay, simply because there is no potential for forward replay for sequences starting at the last item (there simply aren't any subsequent items)? The opposite is true, of course, for the first-item replay, which can't go backward. More generally, my understanding of the literature on forward vs backward replay is that neither is linked to the rodent's location. Both commonly happen at a resting station that is further away from the track. It seems as though the model's result may not hold if replay occurs away from the track (i.e. if a0 would be equal for both pre- and post-run).

      (6) The manuscript describes a study by Bendor & Wilson (2012) and tightly mimics their results. However, notably, that study did not find triggered replay immediately following sound presentation, but rather a general bias toward reactivation of the cued sequence over longer stretches of time. In other words, it seems that the model's results don't fully mirror the empirical results. One idea that came to mind is that perhaps it is the R/L context - not the first R/L item - that is cued in this study. This is in line with other TMR studies showing what may be seen as contextual reactivation. If the authors think that such a simulation may better mirror the empirical results, I encourage them to try. If not, however, this limitation should be discussed.

      (7) There is some discussion about replay's benefit to memory. One point of interest could be whether this benefit changes between wake and sleep. Relatedly, it would be interesting to see whether the proportion of forward replay, backward replay, or both correlated with memory benefits. I encourage the authors to extend the section on the function of replay and explore these questions.

      (8) Replay has been mostly studied in rodents, with few exceptions, whereas CMR and similar models have mostly been used in humans. Although replay is considered a good model of episodic memory, it is still limited due to limited findings of sequential replay in humans and its reliance on very structured and inherently autocorrelated items (i.e., place fields). I'm wondering if the authors could speak to the implications of those limitations on the generalizability of their model. Relatedly, I wonder if the model could or does lead to generalization to some extent in a way that would align with the complementary learning systems framework.

    3. Reviewer #3 (Public review):

      In this manuscript, Zhou et al. present a computational model of memory replay. Their model (CMR-replay) draws from temporal context models of human memory (e.g., TCM, CMR) and claims replay may be another instance of a context-guided memory process. During awake learning, CMR-replay (like its predecessors) encodes items alongside a drifting mental context that maintains a recency-weighted history of recently encoded contexts/items. In this way, the presently encoded item becomes associated with other recently learned items via their shared context representation - giving rise to typical effects in recall such as primacy, recency and contiguity. Unlike its predecessors, CMR-replay has built in replay periods. These replay periods are designed to approximate sleep or wakeful quiescence, in which an item is spontaneously reactivated, causing a subsequent cascade of item-context reactivations that further update the model's items-context associations.

      Using this model of replay, Zhou et al. were able to reproduce a variety of empirical findings in the replay literature: e.g., greater forward replay at the beginning of a track and more backwards replay at the end; more replay for rewarded events; the occurrence of remote replay; reduced replay for repeated items, etc. Furthermore, the model diverges considerably (in implementation and predictions) from other prominent models of replay that, instead, emphasize replay as a way of predicting value from a reinforcement learning framing (i.e., EVB, expected value backup).

      Overall, I found the manuscript clear and easy to follow, despite not being a computational modeller myself. (Which is pretty commendable, I'd say). The model also was effective at capturing several important empirical results from the replay literature while relying on a concise set of mechanisms - which will have implications for subsequent theory building in the field.

      The authors addressed my concerns with respect to adding methodological detail. I am satisfied with the changes.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zhou and colleagues developed a computational model of replay that heavily builds on cognitive models of memory in context (e.g., the context-maintenance and retrieval model), which have been successfully used to explain memory phenomena in the past. Their model produces results that mirror previous empirical findings in rodents and offers a new computational framework for thinking about replay.

      Strengths:

      The model is compelling and seems to explain a number of findings from the rodent literature. It is commendable that the authors implement commonly used algorithms from wakefulness to model sleep/rest, thereby linking wake and sleep phenomena in a parsimonious way. Additionally, the manuscript's comprehensive perspective on replay, bridging humans and non-human animals, enhanced its theoretical contribution.

      Weaknesses:

      This reviewer is not a computational neuroscientist by training, so some comments may stem from misunderstandings. I hope the authors would see those instances as opportunities to clarify their findings for broader audiences.

      (1) The model predicts that temporally close items will be co-reactivated, yet evidence from humans suggests that temporal context doesn't guide sleep benefits (instead, semantic connections seem to be of more importance; Liu and Ranganath 2021, Schechtman et al 2023). Could these findings be reconciled with the model or is this a limitation of the current framework?

      We appreciate the encouragement to discuss this connection. Our framework can accommodate semantic associations as determinants of sleep-dependent consolidation, which can in principle outweigh temporal associations. Indeed, prior models in this lineage have extensively simulated how semantic associations support encoding and retrieval alongside temporal associations. It would therefore be straightforward to extend our model to simulate how semantic associations guide sleep benefits, and to compare their contribution against that conferred by temporal associations across different experimental paradigms. In the revised manuscript, we have added a discussion of how our framework may simulate the role of semantic associations in sleep-dependent consolidation.

      “Several recent studies have argued for dominance of semantic associations over temporal associations in the process of human sleep-dependent consolidation (Schechtman et al., 2023; Liu and Ranganath 2021; Sherman et al., 2025), with one study observing no role at all for temporal associations (Schechtman et al., 2023). At first glance, these findings appear in tension with our model, where temporal associations drive offline consolidation. Indeed, prior models have accounted for these findings by suppressing temporal context during sleep (Liu and Ranganath 2024; Sherman et al., 2025). However, earlier models in the CMR lineage have successfully captured the joint contributions of semantic and temporal associations to encoding and retrieval (Polyn et al., 2009), and these processes could extend naturally to offline replay. In a paradigm where semantic associations are especially salient during awake learning, the model could weight these associations more and account for greater co-reactivation and sleep-dependent memory benefits for semantically related than temporally related items. Consistent with this idea, Schechtman et al. (2023) speculated that their null temporal effects likely reflected the task’s emphasis on semantic associations. When temporal associations are more salient and task-relevant, sleep-related benefits for temporally contiguous items are more likely to emerge (e.g., Drosopoulos et al., 2007; King et al., 2017).”

      The reviewer’s comment points to fruitful directions for future work that could employ our framework to dissect the relative contributions of semantic and temporal associations to memory consolidation.

      (2) During replay, the model is set so that the next reactivated item is sampled without replacement (i.e., the model cannot get "stuck" on a single item). I'm not sure what the biological backing behind this is and why the brain can't reactivate the same item consistently.

      Furthermore, I'm afraid that such a rule may artificially generate sequential reactivation of items regardless of wake training. Could the authors explain this better or show that this isn't the case?

      We appreciate the opportunity to clarify this aspect of the model. We first note that this mechanism has long been a fundamental component of this class of models (Howard & Kahana 2002). Many classic memory models (Brown et al., 2000; Burgess & Hitch, 1991; Lewandowsky & Murdock 1989) incorporate response suppression, in which activated items are temporarily inhibited. The simplest implementation, which we use here, removes activated items from the pool of candidate items. Alternative implementations achieve this through transient inhibition, often conceptualized as neuronal fatigue (Burgess & Hitch, 1991; Grossberg 1978). Our model adopts a similar perspective, interpreting this mechanism as mimicking a brief refractory period that renders reactivated neurons unlikely to fire again within a short physiological event such as a sharp-wave ripple. Importantly, this approach does not generate spurious sequences. Instead, the model’s ability to preserve the structure of wake experience during replay depends entirely on the learned associations between items (without these associations, item order would be random). Similar assumptions are also common in models of replay. For example, reinforcement learning models of replay incorporate mechanisms such as inhibition to prevent repeated reactivations (e.g., Diekmann & Cheng, 2023) or prioritize reactivation based on ranking to limit items to a single replay (e.g., Mattar & Daw, 2018). We now discuss these points in the section titled “A context model of memory replay”

      “This mechanism of sampling without replacement, akin to response suppression in established context memory models (Howard & Kahana 2002), could be implemented by neuronal fatigue or refractory dynamics (Burgess & Hitch, 1991; Grossberg 1978). Non-repetition during reactivation is also a common assumption in replay models that regulate reactivation through inhibition or prioritization (Diekmann & Cheng 2023; Mattar & Daw 2018; Singh et al., 2022).”

      (3) If I understand correctly, there are two ways in which novelty (i.e., less exposure) is accounted for in the model. The first and more talked about is the suppression mechanism (lines 639-646). The second is a change in learning rates (lines 593-595). It's unclear to me why both procedures are needed, how they differ, and whether these are two different mechanisms that the model implements. Also, since the authors controlled the extent to which each item was experienced during wakefulness, it's not entirely clear to me which of the simulations manipulated novelty on an individual item level, as described in lines 593-595 (if any).

      We agree that these mechanisms and their relationships would benefit from clarification. As noted, novelty influences learning through two distinct mechanisms. First, the suppression mechanism is essential for capturing the inverse relationship between the amount of wake experience and the frequency of replay, as observed in several studies. This mechanism ensures that items with high wake activity are less likely to dominate replay. Second, the decrease in learning rates with repetition is crucial for preserving the stochasticity of replay. Without this mechanism, the model would increase weights linearly, leading to an exponential increase in the probability of successive wake items being reactivated back-to-back due to the use of a softmax choice rule. This would result in deterministic replay patterns, which are inconsistent with experimental observations.

      We have revised the Methods section to explicitly distinguish these two mechanisms:

      “This experience-dependent suppression mechanism is distinct from the reduction of learning rates through repetition; it does not modulate the update of memory associations but exclusively governs which items are most likely to initiate replay.”

      We have also clarified our rationale for including a learning rate reduction mechanism:

      “The reduction in learning rates with repetition is important for maintaining a degree of stochasticity in the model’s replay during task repetition, since linearly increasing weights would, through the softmax choice rule, exponentially amplify differences in item reactivation probabilities, sharply reducing variability in replay.”

      Finally, we now specify exactly where the learning-rate reduction applied, namely in simulations where sequences are repeated across multiple sessions:

      “In this simulation, the learning rates progressively decrease across sessions, as described above.“

      As to the first mechanism - experience-based suppression - I find it challenging to think of a biological mechanism that would achieve this and is selectively activated immediately before sleep (somehow anticipating its onset). In fact, the prominent synaptic homeostasis hypothesis suggests that such suppression, at least on a synaptic level, is exactly what sleep itself does (i.e., prune or weaken synapses that were enhanced due to learning during the day). This begs the question of whether certain sleep stages (or ultradian cycles) may be involved in pruning, whereas others leverage its results for reactivation (e.g., a sequential hypothesis; Rasch & Born, 2013). That could be a compelling synthesis of this literature. Regardless of whether the authors agree, I believe that this point is a major caveat to the current model. It is addressed in the discussion, but perhaps it would be beneficial to explicitly state to what extent the results rely on the assumption of a pre-sleep suppression mechanism.

      We appreciate the reviewer raising this important point. Unlike the mechanism proposed by the synaptic homeostasis hypothesis, the suppression mechanism in our model does not suppress items based on synapse strength, nor does it modify synaptic weights. Instead, it determines the level of suppression for each item based on activity during awake experience. The brain could implement such a mechanism by tagging each item according to its activity level during wakefulness. During subsequent consolidation, the initial reactivation of an item during replay would reflect this tag, influencing how easily it can be reactivated.

      A related hypothesis has been proposed in recent work, suggesting that replay avoids recently active trajectories due to spike frequency adaptation in neurons (Mallory et al., 2024). Similarly, the suppression mechanism in our model is critical for explaining the observed negative relationship between the amount of recent wake experience and the degree of replay.

      We discuss the biological plausibility of this mechanism and its relationship with existing models in the Introduction. In the section titled “The influence of experience”, we have added the following:

      “Our model implements an activity‑dependent suppression mechanism that, at the onset of each offline replay event, assigns each item a selection probability inversely proportional to its activation during preceding wakefulness. The brain could implement this by tagging each memory trace in proportion to its recent activation; during consolidation, that tag would then regulate starting replay probability, making highly active items less likely to be reactivated. A recent paper found that replay avoids recently traversed trajectories through awake spike‑frequency adaptation (Mallory et al., 2025), which could implement this kind of mechanism. In our simulations, this suppression is essential for capturing the inverse relationship between replay frequency and prior experience. Note that, unlike the synaptic homeostasis hypothesis (Tononi & Cirelli 2006), which proposes that the brain globally downscales synaptic weights during sleep, this mechanism leaves synaptic weights unchanged and instead biases the selection process during replay.”

      (4) As the manuscript mentions, the only difference between sleep and wake in the model is the initial conditions (a0). This is an obvious simplification, especially given the last author's recent models discussing the very different roles of REM vs NREM. Could the authors suggest how different sleep stages may relate to the model or how it could be developed to interact with other successful models such as the ones the last author has developed (e.g., C-HORSE)? 

      We appreciate the encouragement to comment on the roles of different sleep stages in the manuscript, especially since, as noted, the lab is very interested in this and has explored it in other work. We chose to focus on NREM in this work because the vast majority of electrophysiological studies of sleep replay have identified these events during NREM. In addition, our lab’s theory of the role of REM (Singh et al., 2022, PNAS) is that it is a time for the neocortex to replay remote memories, in complement to the more recent memories replayed during NREM. The experiments we simulate all involve recent memories. Indeed, our view is that part of the reason that there is so little data on REM replay may be that experimenters are almost always looking for traces of recent memories (for good practical and technical reasons).

      Regarding the simplicity of the distinction between simulated wake and sleep replay, we view it as an asset of the model that it can account for many of the different characteristics of awake and NREM replay with very simple assumptions about differences in the initial conditions. There are of course many other differences between the states that could be relevant to the impact of replay, but the current target empirical data did not necessitate us taking those into account. This allows us to argue that differences in initial conditions should play a substantial role in an account of the differences between wake and sleep replay.

      We have added discussion of these ideas and how they might be incorporated into future versions of the model in the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      Finally, I wonder how the model would explain findings (including the authors') showing a preference for reactivation of weaker memories. The literature seems to suggest that it isn't just a matter of novelty or exposure, but encoding strength. Can the model explain this? Or would it require additional assumptions or some mechanism for selective endogenous reactivation during sleep and rest?

      We appreciate the encouragement to discuss this, as we do think the model could explain findings showing a preference for reactivation of weaker memories, as in Schapiro et al. (2018). In our framework, memory strength is reflected in the magnitude of each memory’s associated synaptic weights, so that stronger memories yield higher retrieved‑context activity during wake encoding than weaker ones. Because the model’s suppression mechanism reduces an item’s replay probability in proportion to its retrieved‑context activity, items with larger weights (strong memories) are more heavily suppressed at the onset of replay, while those with smaller weights (weaker memories) receive less suppression. When items have matched reward exposure, this dynamic would bias offline replay toward weaker memories, therefore preferentially reactivating weak memories. 

      In the section titled “The influence of experience”, we updated a sentence to discuss this idea more explicitly: 

      “Such a suppression mechanism may be adaptive, allowing replay to benefit not only the most recently or strongly encoded items but also to provide opportunities for the consolidation of weaker or older memories, consistent with empirical evidence (e.g., Schapiro et al. 2018; Yu et al., 2024).”

      (5) Lines 186-200 - Perhaps I'm misunderstanding, but wouldn't it be trivial that an external cue at the end-item of Figure 7a would result in backward replay, simply because there is no potential for forward replay for sequences starting at the last item (there simply aren't any subsequent items)? The opposite is true, of course, for the first-item replay, which can't go backward. More generally, my understanding of the literature on forward vs backward replay is that neither is linked to the rodent's location. Both commonly happen at a resting station that is further away from the track. It seems as though the model's result may not hold if replay occurs away from the track (i.e. if a0 would be equal for both pre- and post-run).

      In studies where animals run back and forth on a linear track, replay events are decoded separately for left and right runs, identifying both forward and reverse sequences for each direction, for example using direction-specific place cell sequence templates. Accordingly, in our simulation of, e.g., Ambrose et al. (2016), we use two independent sequences, one for left runs and one for right runs (an approach that has been taken in prior replay modeling work). Crucially, our model assumes a context reset between running episodes, preventing the final item of one traversal from acquiring contextual associations with the first item of the next. As a result, learning in the two sequences remains independent, and when an external cue is presented at the track’s end, replay predominantly unfolds in the backward direction, only occasionally producing forward segments when the cue briefly reactivates an earlier sequence item before proceeding forward.

      We added a note to the section titled “The context-dependency of memory replay” to clarify this:

      “In our model, these patterns are identical to those in our simulation of Ambrose et al. (2016), which uses two independent sequences to mimic the two run directions. This is because the drifting context resets before each run sequence is encoded, with the pause between runs acting as an event boundary that prevents the final item of one traversal from associating with the first item of the next, thereby keeping learning in each direction independent.”

      To our knowledge, no study has observed a similar asymmetry when animals are fully removed from the track, although both types of replay can be observed when animals are away from the track. For example, Gupta et al. (2010) demonstrated that when animals replay trajectories far from their current location, the ratio of forward vs. backward replay appears more balanced. We now highlight this result in the manuscript and explain how it aligns with the predictions of our model:

      “For example, in tasks where the goal is positioned in the middle of an arm rather than at its end, CMR-replay predicts a more balanced ratio of forward and reverse replay, whereas the EVB model still predicts a dominance of reverse replay due to backward gain propagation from the reward. This contrast aligns with empirical findings showing that when the goal is located in the middle of an arm, replay events are more evenly split between forward and reverse directions (Gupta et al., 2010), whereas placing the goal at the end of a track produces a stronger bias toward reverse replay (Diba & Buzsaki 2007).” 

      Although no studies, to our knowledge, have observed a context-dependent asymmetry between forward and backward replay when the animal is away from the track, our model does posit conditions under which it could. Specifically, it predicts that deliberation on a specific memory, such as during planning, could generate an internal context input that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track.

      We now discuss this prediction in the section titled “The context-dependency of memory replay”:

      “Our model also predicts that deliberation on a specific memory, such as during planning, could serve to elicit an internal context cue that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track. While not explored here, this mechanism presents a potential avenue for future modeling and empirical work.”

      (6) The manuscript describes a study by Bendor & Wilson (2012) and tightly mimics their results. However, notably, that study did not find triggered replay immediately following sound presentation, but rather a general bias toward reactivation of the cued sequence over longer stretches of time. In other words, it seems that the model's results don't fully mirror the empirical results. One idea that came to mind is that perhaps it is the R/L context - not the first R/L item - that is cued in this study. This is in line with other TMR studies showing what may be seen as contextual reactivation. If the authors think that such a simulation may better mirror the empirical results, I encourage them to try. If not, however, this limitation should be discussed.

      Although our model predicts that replay is triggered immediately by the sound cue, it also predicts a sustained bias toward the cued sequence. Replay in our model unfolds across the rest phase as multiple successive events, so the bias observed in our sleep simulations indeed reflects a prolonged preference for the cued sequence.

      We now discuss this issue, acknowledging the discrepancy:

      “Bendor and Wilson (2012) found that sound cues during sleep did not trigger immediate replay, but instead biased reactivation toward the cued sequence over an extended period of time. While the model does exhibit some replay triggered immediately by the cue, it also captures the sustained bias toward the cued sequence over an extended period.”

      Second, within this framework, context is modeled as a weighted average of the features associated with items. As a result, cueing the model with the first R/L item produces qualitatively similar outcomes as cueing it with a more extended R/L cue that incorporates features of additional items. This is because both approaches ultimately use context features unique to the two sides.

      (7) There is some discussion about replay's benefit to memory. One point of interest could be whether this benefit changes between wake and sleep. Relatedly, it would be interesting to see whether the proportion of forward replay, backward replay, or both correlated with memory benefits. I encourage the authors to extend the section on the function of replay and explore these questions.

      We thank the reviewer for this suggestion. Regarding differences in the contribution of wake and sleep to memory, our current simulations predict that compared to rest in the task environment, sleep is less biased toward initiating replay at specific items, leading to a more uniform benefit across all memories. Regarding the contributions of forward and backward replay, our model predicts that both strengthen bidirectional associations between items and contexts, benefiting memory in qualitatively similar ways. Furthermore, we suggest that the offline learning captured  by our teacher-student simulations reflects consolidation processes that are specific to sleep.

      We have expanded the section titled The influence of experience to discuss these predictions of the model: 

      “The results outlined above arise from the model's assumption that replay strengthens bidirectional associations between items and contexts to benefit memory. This assumption leads to several predictions about differences across replay types. First, the model predicts that sleep yields different memory benefits compared to rest in the task environment: Sleep is less biased toward initiating replay at specific items, resulting in a more uniform benefit across all memories. Second, the model predicts that forward and backward replay contribute to memory in qualitatively similar ways but tend to benefit different memories. This divergence arises because forward and backward replay exhibit distinct item preferences, with backward replay being more likely to include rewarded items, thereby preferentially benefiting those memories.”

      We also updated the “The function of replay” section to include our teacher-student speculation:

      “We speculate that the offline learning observed in these simulations corresponds to consolidation processes that operate specifically during sleep, when hippocampal-neocortical dynamics are especially tightly coupled (Klinzing et al., 2019).”

      (8) Replay has been mostly studied in rodents, with few exceptions, whereas CMR and similar models have mostly been used in humans. Although replay is considered a good model of episodic memory, it is still limited due to limited findings of sequential replay in humans and its reliance on very structured and inherently autocorrelated items (i.e., place fields). I'm wondering if the authors could speak to the implications of those limitations on the generalizability of their model. Relatedly, I wonder if the model could or does lead to generalization to some extent in a way that would align with the complementary learning systems framework.

      We appreciate these insightful comments. Traditionally, replay studies have focused on spatial tasks with autocorrelated item representations (e.g., place fields). However, an increasing number of human studies have demonstrated sequential replay using stimuli with distinct, unrelated representations. Our model is designed to accommodate both scenarios. In our current simulations, we employ orthogonal item representations while leveraging a shared, temporally autocorrelated context to link successive items. We anticipate that incorporating autocorrelated item representations would further enhance sequence memory by increasing the similarity between successive contexts. Overall, we believe that the model generalizes across a broad range of experimental settings, regardless of the degree of autocorrelation between items. Moreover, the underlying framework has been successfully applied to explain sequential memory in both spatial domains, explaining place cell firing properties (e.g., Howard et al., 2004), and in non-spatial domains, such as free recall experiments where items are arbitrarily related. 

      In the section titled “A context model of memory replay”, we added this comment to address this point:

      “Its contiguity bias stems from its use of shared, temporally autocorrelated context to link successive items, despite the orthogonal nature of individual item representations. This bias would be even stronger if items had overlapping representations, as observed in place fields.”

      Since CMR-replay learns distributed context representations where overlap across context vectors captures associative structure, and replay helps strengthen that overlap, this could indeed be viewed as consonant with complementary learning systems integration processes. 

      Reviewer #2 (Public Review):

      This manuscript proposes a model of replay that focuses on the relation between an item and its context, without considering the value of the item. The model simulates awake learning, awake replay, and sleep replay, and demonstrates parallels between memory phenomenon driven by encoding strength, replay of sequence learning, and activation of nearest neighbor to infer causality. There is some discussion of the importance of suppression/inhibition to reduce activation of only dominant memories to be replayed, potentially boosting memories that are weakly encoded. Very nice replications of several key replay findings including the effect of reward and remote replay, demonstrating the equally salient cue of context for offline memory consolidation.

      I have no suggestions for the main body of the study, including methods and simulations, as the work is comprehensive, transparent, and well-described. However, I would like to understand how the CMRreplay model fits with the current understanding of the importance of excitation vs inhibition, remembering vs forgetting, activation vs deactivation, strengthening vs elimination of synapses, and even NREM vs REM as Schapiro has modeled. There seems to be a strong association with the efforts of the model to instantiate a memory as well as how that reinstantiation changes across time. But that is not all this is to consolidation. The specific roles of different brain states and how they might change replay is also an important consideration.

      We are gratified that the reviewer appreciated the work, and we agree that the paper would benefit from comment on the connections to these other features of consolidation.

      Excitation vs. inhibition: CMR-replay does not model variations in the excitation-inhibition balance across brain states (as in other models, e.g., Chenkov et al., 2017), since it does not include inhibitory connections. However, we posit that the experience-dependent suppression mechanism in the model might, in the brain, involve inhibitory processes. Supporting this idea, studies have observed increased inhibition with task repetition (Berners-Lee et al., 2022). We hypothesize that such mechanisms may underlie the observed inverse relationship between task experience and replay frequency in many studies. We discuss this in the section titled “A context model of memory replay”:

      “The proposal that a suppression mechanism plays a role in replay aligns with models that regulate place cell reactivation via inhibition (Malerba et al., 2016) and with empirical observations of increased hippocampal inhibitory interneuron activity with experience (Berners-Lee et al., 2022). Our model assumes the presence of such inhibitory mechanisms but does not explicitly model them.”

      Remembering/forgetting, activation/deactivation, and strengthening/elimination of synapses: The model does not simulate synaptic weight reduction or pruning, so it does not forget memories through the weakening of associated weights. However, forgetting can occur when a memory is replayed less frequently than others, leading to reduced activation of that memory compared to its competitors during context-driven retrieval. In the Discussion section, we acknowledge that a biologically implausible aspect of our model is that it implements only synaptic strengthening: 

      “Aspects of the model, such as its lack of regulation of the cumulative positive weight changes that can accrue through repeated replay, are biologically implausible (as biological learning results in both increases and decreases in synaptic weights) and limit the ability to engage with certain forms of low level neural data (e.g., changes in spine density over sleep periods; de Vivo et al., 2017; Maret et al., 2011). It will be useful for future work to explore model variants with more elements of biological plausibility.” Different brain states and NREM vs REM: Reviewer 1 also raised this important issue (see above). We have added the following thoughts on differences between these states and the relationship to our prior work to the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      We hope these points clarify the model’s scope and its potential for future extensions.

      Do the authors suggest that these replay systems are more universal to offline processes beyond episodic memory? What about procedural memories and working memory?

      We thank the reviewer for raising this important question. We have clarified in the manuscript:

      “We focus on the model as a formulation of hippocampal replay, capturing how the hippocampus may replay past experiences through simple and interpretable mechanisms.”

      With respect to other forms of memory, we now note that:

      “This motor memory simulation using a model of hippocampal replay is consistent with evidence that hippocampal replay can contribute to consolidating memories that are not hippocampally dependent at encoding (Schapiro et al., 2019; Sawangjit et al., 2018). It is possible that replay in other, more domain-specific areas could also contribute (Eichenlaub et al., 2020).”

      Though this is not a biophysical model per se, can the authors speak to the neuromodulatory milieus that give rise to the different types of replay?

      Our work aligns with the perspective proposed by Hasselmo (1999), which suggests that waking and sleep states differ in the degree to which hippocampal activity is driven by external inputs. Specifically, high acetylcholine levels during waking bias activity to flow into the hippocampus, while low acetylcholine levels during sleep allow hippocampal activity to influence other brain regions. Consistent with this view, our model posits that wake replay is more biased toward items associated with the current resting location due to the presence of external input during waking states. In the Discussion section, we have added a comment on this point:

      “Our view aligns with the theory proposed by Hasselmo (1999), which suggests that the degree of hippocampal activity driven by external inputs differs between waking and sleep states: High acetylcholine levels during wakefulness bias activity into the hippocampus, while low acetylcholine levels during slow-wave sleep allow hippocampal activity to influence other brain regions.”

      Reviewer #3 (Public Review):

      In this manuscript, Zhou et al. present a computational model of memory replay. Their model (CMR-replay) draws from temporal context models of human memory (e.g., TCM, CMR) and claims replay may be another instance of a context-guided memory process. During awake learning, CMR replay (like its predecessors) encodes items alongside a drifting mental context that maintains a recency-weighted history of recently encoded contexts/items. In this way, the presently encoded item becomes associated with other recently learned items via their shared context representation - giving rise to typical effects in recall such as primacy, recency, and contiguity. Unlike its predecessors, CMR-replay has built-in replay periods. These replay periods are designed to approximate sleep or wakeful quiescence, in which an item is spontaneously reactivated, causing a subsequent cascade of item-context reactivations that further update the model's item-context associations.

      Using this model of replay, Zhou et al. were able to reproduce a variety of empirical findings in the replay literature: e.g., greater forward replay at the beginning of a track and more backward replay at the end; more replay for rewarded events; the occurrence of remote replay; reduced replay for repeated items, etc. Furthermore, the model diverges considerably (in implementation and predictions) from other prominent models of replay that, instead, emphasize replay as a way of predicting value from a reinforcement learning framing (i.e., EVB, expected value backup).

      Overall, I found the manuscript clear and easy to follow, despite not being a computational modeller myself. (Which is pretty commendable, I'd say). The model also was effective at capturing several important empirical results from the replay literature while relying on a concise set of mechanisms - which will have implications for subsequent theory-building in the field.

      With respect to weaknesses, additional details for some of the methods and results would help the readers better evaluate the data presented here (e.g., explicitly defining how the various 'proportion of replay' DVs were calculated).

      For example, for many of the simulations, the y-axis scale differs from the empirical data despite using comparable units, like the proportion of replay events (e.g., Figures 1B and C). Presumably, this was done to emphasize the similarity between the empirical and model data. But, as a reader, I often found myself doing the mental manipulation myself anyway to better evaluate how the model compared to the empirical data. Please consider using comparable y-axis ranges across empirical and simulated data wherever possible.

      We appreciate this point. As in many replay modeling studies, our primary goal is to provide a qualitative fit that demonstrates the general direction of differences between our model and empirical data, without engaging in detailed parameter fitting for a precise quantitative fit. Still, we agree that where possible, it is useful to better match the axes. We have updated figures 2B and 2C so that the y-axis scales are more directly comparable between the empirical and simulated data. 

      In a similar vein to the above point, while the DVs in the simulations/empirical data made intuitive sense, I wasn't always sure precisely how they were calculated. Consider the "proportion of replay" in Figure 1A. In the Methods (perhaps under Task Simulations), it should specify exactly how this proportion was calculated (e.g., proportions of all replay events, both forwards and backwards, combining across all simulations from Pre- and Post-run rest periods). In many of the examples, the proportions seem to possibly sum to 1 (e.g., Figure 1A), but in other cases, this doesn't seem to be true (e.g., Figure 3A). More clarity here is critical to help readers evaluate these data. Furthermore, sometimes the labels themselves are not the most informative. For example, in Figure 1A, the y-axis is "Proportion of replay" and in 1C it is the "Proportion of events". I presumed those were the same thing - the proportion of replay events - but it would be best if the axis labels were consistent across figures in this manuscript when they reflect the same DV.

      We appreciate these useful suggestions. We have revised the Methods section to explain in detail how DVs are calculated for each simulation. The revisions clarify the differences between related measures, such as those shown in Figures 1A and 1C, so that readers can more easily see how the DVs are defined and interpreted in each case. 

      Reviewer #4/Reviewing Editor (Public Review):

      Summary:

      With their 'CMR-replay' model, Zhou et al. demonstrate that the use of spontaneous neural cascades in a context-maintenance and retrieval (CMR) model significantly expands the range of captured memory phenomena.

      Strengths:

      The proposed model compellingly outperforms its CMR predecessor and, thus, makes important strides towards understanding the empirical memory literature, as well as highlighting a cognitive function of replay.

      Weaknesses:

      Competing accounts of replay are acknowledged but there are no formal comparisons and only CMR-replay predictions are visualized. Indeed, other than the CMR model, only one alternative account is given serious consideration: A variant of the 'Dyna-replay' architecture, originally developed in the machine learning literature (Sutton, 1990; Moore & Atkeson, 1993) and modified by Mattar et al (2018) such that previously experienced event-sequences get replayed based on their relevance to future gain. Mattar et al acknowledged that a realistic Dyna-replay mechanism would require a learned representation of transitions between perceptual and motor events, i.e., a 'cognitive map'. While Zhou et al. note that the CMR-replay model might provide such a complementary mechanism, they emphasize that their account captures replay characteristics that Dyna-replay does not (though it is unclear to what extent the reverse is also true).

      We thank the reviewer for these thoughtful comments and appreciate the opportunity to clarify our approach. Our goal in this work is to contrast two dominant perspectives in replay research: replay as a mechanism for learning reward predictions and replay as a process for memory consolidation. These models were chosen as representatives of their classes of models because they use simple and interpretable mechanisms that can simulate a wide range of replay phenomena, making them ideal for contrasting these two perspectives.

      Although we implemented CMR-replay as a straightforward example of the memory-focused view, we believe the proposed mechanisms could be extended to other architectures, such as recurrent neural networks, to produce similar results. We now discuss this possibility in the revised manuscript (see below). However, given our primary goal of providing a broad and qualitative contrast of these two broad perspectives, we decided not to undertake simulations with additional individual models for this paper.

      Regarding the Mattar & Daw model, it is true that a mechanistic implementation would require a mechanism that avoids precomputing priorities before replay. However, the "need" component of their model already incorporates learned expectations of transitions between actions and events. Thus, the model's limitations are not due to the absence of a cognitive map.

      In contrast, while CMR-replay also accumulates memory associations that reflect experienced transitions among events, it generates several qualitatively distinct predictions compared to the Mattar & Daw model. As we note in the manuscript, these distinctions make CMR-replay a contrasting rather than complementary perspective.

      Another important consideration, however, is how CMR replay compares to alternative mechanistic accounts of cognitive maps. For example, Recurrent Neural Networks are adept at detecting spatial and temporal dependencies in sequential input; these networks are being increasingly used to capture psychological and neuroscientific data (e.g., Zhang et al, 2020; Spoerer et al, 2020), including hippocampal replay specifically (Haga & Fukai, 2018). Another relevant framework is provided by Associative Learning Theory, in which bidirectional associations between static and transient stimulus elements are commonly used to explain contextual and cue-based phenomena, including associative retrieval of absent events (McLaren et al, 1989; Harris, 2006; Kokkola et al, 2019). Without proper integration with these modeling approaches, it is difficult to gauge the innovation and significance of CMR-replay, particularly since the model is applied post hoc to the relatively narrow domain of rodent maze navigation.

      First, we would like to clarify our principal aim in this work is to characterize the nature of replay, rather than to model cognitive maps per se. Accordingly, CMR‑replay is not designed to simulate head‐direction signals, perform path integration, or explain the spatial firing properties of neurons during navigation. Instead, it focuses squarely on sequential replay phenomena, simulating classic rodent maze reactivation studies and human sequence‐learning tasks. These simulations span a broad array of replay experimental paradigms to ensure extensive coverage of the replay findings reported across the literature. As such, the contribution of this work is in explaining the mechanisms and functional roles of replay, and demonstrating that a model that employs simple and interpretable memory mechanisms not only explains replay phenomena traditionally interpreted through a value-based lens but also accounts for findings not addressed by other memory-focused models.

      As the reviewer notes, CMR-replay shares features with other memory-focused models. However, to our knowledge, none of these related approaches have yet captured the full suite of empirical replay phenomena, suggesting the combination of mechanisms employed in CMR-replay is essential for explaining these phenomena. In the Discussion section, we now discuss the similarities between CMR-replay and related memory models and the possibility of integrating these approaches:

      “Our theory builds on a lineage of memory-focused models, demonstrating the power of this perspective in explaining phenomena that have often been attributed to the optimization of value-based predictions. In this work, we focus on CMR-replay, which exemplifies the memory-centric approach through a set of simple and interpretable mechanisms that we believe are broadly applicable across memory domains. Elements of CMR-replay share similarities with other models that adopt a memory-focused perspective. The model learns distributed context representations whose overlaps encodes associations among items, echoing associative learning theories in which overlapping patterns capture stimulus similarity and learned associations (McLaren & Mackintosh 2002). Context evolves through bidirectional interactions between items and their contextual representations, mirroring the dynamics found in recurrent neural networks (Haga & Futai 2018; Levenstein et al., 2024). However, these related approaches have not been shown to account for the present set of replay findings and lack mechanisms—such as reward-modulated encoding and experience-dependent suppression—that our simulations suggest are essential for capturing these phenomena. While not explored here, we believe these mechanisms could be integrated into architectures like recurrent neural networks (Levenstein et al., 2024) to support a broader range of replay dynamics.”

      Recommendations For The Authors

      Reviewer #1 (Recommendations For The Authors):

      (1) Lines 94-96: These lines may be better positioned earlier in the paragraph.

      We now introduce these lines earlier in the paragraph.

      (2) Line 103 - It's unclear to me what is meant by the statement that "the current context contains contexts associated with previous items". I understand why a slowly drifting context will coincide and therefore link with multiple items that progress rapidly in time, so multiple items will be linked to the same context and each item will be linked to multiple contexts. Is that the idea conveyed here or am I missing something? I'm similarly confused by line 129, which mentions that a context is updated by incorporating other items' contexts. How could a context contain other contexts?

      In the model, each item has an associated context that can be retrieved via Mfc. This is true even before learning, since Mfc is initialized as an identity matrix. During learning and replay, we have a drifting context c that is updated each time an item is presented. At each timestep, the model first retrieves the current item’s associated context cf by Mfc, and incorporates it into c. Equation #2 in the Methods section illustrates this procedure in detail. Because of this procedure, the drifting context c is a weighted sum of past items’ associated contexts. 

      We recognize that these descriptions can be confusing. We have updated the Results section to better distinguish the drifting context from items’ associated context. For example, we note that:

      “We represent the drifting context during learning and replay with c and an item's associated context with cf.”

      We have also updated our description of the context drift procedure to distinguish these two quantities: 

      “During awake encoding of a sequence of items, for each item f, the model retrieves its associated context cf via Mfc. The drifting context c incorporates the item's associated context cf and downweights its representation of previous items' associated contexts (Figure 1c). Thus, the context layer maintains a recency weighted sum of past and present items' associated contexts.”

      (3) Figure 1b and 1d - please clarify which axis in the association matrices represents the item and the context.

      We have added labels to show what the axes represent in Figure 1.

      (4) The terms "experience" and "item" are used interchangeably and it may be best to stick to one term.

      We now use the term “item” wherever we describe the model results. 

      (5) The manuscript describes Figure 6 ahead of earlier figures - the authors may want to reorder their figures to improve readability.

      We appreciate this suggestion. We decided to keep the current figure organization since it allows us to group results into different themes and avoid redundancy. 

      (6) Lines 662-664 are repeated with a different ending, this is likely an error.

      We have fixed this error.

      Reviewer #3 (Recommendations For The Authors):

      Below, I have outlined some additional points that came to mind in reviewing the manuscript - in no particular order.

      (1) Figure 1: I found the ordering of panels a bit confusing in this figure, as the reading direction changes a couple of times in going from A to F. Would perhaps putting panel C in the bottom left corner and then D at the top right, with E and F below (also on the right) work?

      We agree that this improves the figure. We have restructured the ordering of panels in this figure. 

      (2) Simulation 1: When reading the intro/results for the first simulation (Figure 2a; Diba & Buszaki, 2007; "When animals traverse a linear track...", page 6, line 186). It wasn't clear to me why pre-run rest would have any forward replay, particularly if pre-run implied that the animal had no experience with the track yet. But in the Methods this becomes clearer, as the model encodes the track eight times prior to the rest periods. Making this explicit in the text would make it easier to follow. Also, was there any reason why specifically eight sessions of awake learning, in particular, were used?

      We now make more explicit that the animals have experience with the track before pre-run rest recording:

      “Animals first acquire experience with a linear track by traversing it to collect a reward. Then, during the pre-run rest recording, forward replay predominates.”

      We included eight sessions of awake learning to match with the number of sessions in Shin et al. (2017), since this simulation attempts to explain data from that study. After each repetition, the model engages in rest. We have revised the Methods section to indicate the motivation for this choice: 

      “In the simulation that examines context-dependent forward and backward replay through experience (Figs. 2a and 5a), CMR-replay encodes an input sequence shown in Fig. 7a, which simulates a linear track run with no ambiguity in the direction of inputs, over eight awake learning sessions (as in Shin et al. 2019)”

      (3) Frequency of remote replay events: In the simulation based on Gupta et al, how frequently overall does remote replay occur? In the main text, the authors mention the mean frequency with which shortcut replay occurs (i.e., the mean proportion of replay events that contain a shortcut sequence = 0.0046), which was helpful. But, it also made me wonder about the likelihood of remote replay events. I would imagine that remote replay events are infrequent as well - given that it is considerably more likely to replay sequences from the local track, given the recency-weighted mental context. Reporting the above mean proportion for remote and local replay events would be helpful context for the reader.

      In Figure 4c, we report the proportion of remote replay in the two experimental conditions of Gupta et al. that we simulate. 

      (4) Point of clarification re: backwards replay: Is backwards replay less likely to occur than forward replay overall because of the forward asymmetry associated with these models? For example, for a backwards replay event to occur, the context would need to drift backwards at least five times in a row, in spite of a higher probability of moving one step forward at each of those steps. Am I getting that right?

      The reviewer’s interpretation is correct: CMR-replay is more likely to produce forward than backward replay in sleep because of its forward asymmetry. We note that this forward asymmetry leads to high likelihood of forward replay in the section titled “The context-dependency of memory replay”: 

      “As with prior retrieved context models (Howard & Kahana 2002; Polyn et al., 2009), CMR-replay encodes stronger forward than backward associations. This asymmetry exists because, during the first encoding of a sequence, an item's associated context contributes only to its ensuing items' encoding contexts. Therefore, after encoding, bringing back an item's associated context is more likely to reactivate its ensuing than preceding items, leading to forward asymmetric replay (Fig. 6d left).”

      (5) On terminating a replay period: "At any t, the replay period ends with a probability of 0.1 or if a task-irrelevant item is reactivated." (Figure 1 caption; see also pg 18, line 635). How was the 0.1 decided upon? Also, could you please add some detail as to what a 'task-irrelevant item' would be? From what I understood, the model only learns sequences that represent the points in a track - wouldn't all the points in the track be task-relevant?

      This value was arbitrarily chosen as a small value that allows probabilistic stopping. It was not motivated by prior modeling or a systematic search. We have added: “At each timestep, the replay period ends either with a stop probability of 0.1 or if a task-irrelevant item becomes reactivated. (The choice of the value 0.1 was arbitrary; future work could explore the implications of varying this parameter).” 

      In addition, we now explain in the paper that task irrelevant items “do not appear as inputs during awake encoding, but compete with task-relevant items for reactivation during replay, simulating the idea that other experiences likely compete with current experiences during periods of retrieval and reactivation.”

      (6) Minor typos:

      Turn all instances of "nonlocal" into "non-local", or vice versa

      "For rest at the end of a run, cexternal is the context associated with the final item in the sequence. For rest at the end of a run, cexternal is the context associated with the start item." (pg 20, line 663) - I believe this is a typo and that the second sentence should begin with "For rest at the START of a run".

      We have updated the manuscript to correct these typos. 

      (7) Code availability: I may have missed it, but it doesn't seem like the code is currently available for these simulations. Including the commented code in a public repository (Github, OSF) would be very useful in this case.

      We now include a Github link to our simulation code: https://github.com/schapirolab/CMR-replay.

    1. eLife Assessment

      This study combines genetic, cell biological, and interaction data to propose a model of meiotic double-strand break regulation in C. elegans. Solid evidence supports the main conclusions, while by nature of a screening-type study, more may be needed to solidify speculations in future studies. Yet, comprehensive cataloging of the physical and genetic interactions of factors required for meiotic double-strand break is useful information for the field.

    2. Joint Public Review:

      Meiotic recombination begins with DNA double-strand breaks (DSBs) generated by the conserved enzyme Spo11, which relies on several accessory factors that vary widely across eukaryotes. In C. elegans, multiple proteins have been implicated in promoting DSB formation, but their functional relationships and how they collectively recruit the DSB machinery to chromosome axes have remained unclear.

      In this study, Raices et al. investigate the biochemical and genetic interactions among known DSB-promoting factors in C. elegans meiosis. Using yeast two-hybrid assays and co-immunoprecipitation, they map pairwise protein interactions and identify a connection between the chromatin-associated protein HIM-17 and the transcription factor XND-1. They also confirm the established interaction between DSB-1 and SPO-11 and show that DSB-1 associates with the nematode-specific factor HIM-5, which is required for X-chromosome DSB formation.

      The authors extend these findings with genetic analyses, placing these factors into four epistasis groups based on single- and double-mutant phenotypes. Together, these biochemical and genetic data support a model describing how these proteins engage chromatin loops and localize to chromosome axes. The work provides a clearer view of how C. elegans assembles its DSB-forming machinery and how this process compares to mechanisms in other organisms.

      Comment from the Reviewing Editor on the revised version:

      The authors have adequately addressed the prior review comments. At this point, after going through multiple rounds of reviews and revisions, the community will be better served by having this paper out in public. This version was assessed by the editors without further input from the reviewers.

    3. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      The manuscript by Raices et al., provides some novel insights into the role and interactions between SPO-11 accessory proteins in C. elegans. The authors propose a model of meiotic DSBs regulation, critical to our understanding of DSB formation and ultimately crossover regulation and accurate chromosome segregation. The work also emphasizes the commonalities and species-specific aspects of DSB regulation. 

      Strengths: 

      This study capitalizes on the strengths of the C. elegans system to uncover genetic interactions between a lSPO-11 accessory proteins. In combination with physical interactions, the authors synthesize their findings into a model, which will serve as the basis for future work, to determine mechanisms of DSB regulation. 

      Weaknesses: 

      The methodology, although standard, still lacks some rigor, especially with the IPs. 

      Reviewer #2 (Public review): 

      Summary: 

      Meiotic recombination initiates with the formation of DNA double-strand break (DSB) formation, catalyzed by the conserved topoisomerase-like enzyme Spo11. Spo11 requires accessory factors that are poorly conserved across eukaryotes. Previous genetic studies have identified several proteins required for DSB formation in C. elegans to varying degrees; however, how these proteins interact with each other to recruit the DSB-forming machinery to chromosome axes remains unclear. 

      In this study, Raices et al. characterized the biochemical and genetic interactions among proteins that are known to promote DSB formation during C. elegans meiosis. The authors examined pairwise interactions using yeast two-hybrid (Y2H) and co-immunoprecipitation and revealed an interaction between a chromatin-associated protein HIM-17 and a transcription factor XND-1. They further confirmed the previously known interaction between DSB-1 and SPO-11 and showed that DSB-1 also interacts with a nematodespecific HIM-5, which is essential for DSB formation on the X chromosome. They also assessed genetic interactions among these proteins, categorizing them into four epistasis groups by comparing phenotypes in double vs. single mutants. Combining these results, the authors proposed a model of how these proteins interact with chromatin loops and are recruited to chromosome axes, offering insights into the process in C. elegans compared to other organisms. 

      Weaknesses: 

      This work relies heavily on Y2H, which is notorious for having high rates of false positives and false negatives. Although the interactions between HIM-17 and XND-1 and between DSB-1 and HIM-5 were validated by co-IP, the significance of these interactions was not tested in vivo. Cataloging Y2H and genetic interactions does not yield much more insight. The model proposed in Figure 4 is also highly speculative. 

      Reviewer #3 (Public review): 

      The goal of this work is to understand the regulation of double-strand break formation during meiosis in C. elegans. The authors have analyzed physical and genetic interactions among a subset of factors that have been previously implicated in DSB formation or the number of timing of DSBs: CEP-1, DSB-1, DSB-2, DSB-3, HIM-5, HIM-17, MRE-11, REC-1, PARG-1, and XND-1. 

      The 10 proteins that are analyzed here include a diverse set of factors with different functions, based on prior analyses in many published studies. The term "Spo11 accessory factors" has been used in the meiosis literature to describe proteins that directly promote Spo11 cleavage activity, rather than factors that are important for the expression of meiotic proteins or that influence the genome-wide distribution or timing of DSBs. Based on this definition, the known SPO-11 accessory factors in C. elegans include DSB-1, DSB2, DSB-3, and the MRN complex (at least MRE-11 and RAD-50). These are all homologs of proteins that have been studied biochemically and structurally in other organisms. DSB-1 & DSB-2 are homologs of Rec114, while DSB-3 is a homolog of Mei4. Biochemical and structural studies have shown that Rec114 and Mei4 directly modulate Spo11 activity by recruiting Spo11 to chromatin and promoting its dimerization, which is essential for cleavage. The other factors analyzed in this study affect the timing, distribution, or number of RAD-51 foci, but they likely do so indirectly. As elaborated below, XND-1 and HIM-17 are transcription factors that modulate the expression of other meiotic genes, and their role in DSB formation is parsimoniously explained by this regulatory activity. The roles of HIM-5 and REC-1 remain unclear; the reported localization of HIM-5 to autosomes is consistent with a role in transcription (the autosomes are transcriptionally active in the germline, while the X chromosome is largely silent), but its loss-of-function phenotypes are much more limited than those of HIM-17 and XND-1, so it may play a more direct role in DSB formation. The roles of CEP-1 (a Rad53 homolog) and PARG-1 are also ambiguous, but their homologs in other organisms contribute to DNA repair rather than DSB formation. 

      We appreciate the reviewer’s clarification. However, the definition of Spo11 accessory factors varies across the literature. Only Keeney and colleagues define these as proteins that physically associate with and activate Spo11 to catalyze DSB formation (Keeney, Lange & Mohibullah, 2014; Lam & Keeney, 2015). In contrast, other authors have used the term more broadly to refer to proteins that promote or regulate Spo11-dependent DSB formation, without necessarily implying a direct interaction with Spo11 (e.g., Panizza et al., 2011; Robert et al., 2016; Stanzione et al., 2016; Li et al., 2021; Lange et al., 2016). Thus, our usage of the term follows this broader functional definition.

      An additional significant limitation of the study, as stated in my initial review, is that much of the analysis here relies on cytological visualization of RAD-51 foci as a proxy for DSBs. RAD-51 associates transiently with DSB sites as they undergo repair and is thus limited in its ability to reveal details about the timing or abundance of DSBs since its loading and removal involve additional steps that may be influenced by the factors being analyzed. 

      We agree with the reviewer that counting RAD-51 foci provides only an indirect measure of SPO-11–dependent DSBs, as RAD-51 marks sites of repair rather than the breaks themselves. However, we would like to clarify that our current study does not rely on RAD51 foci quantification for any of the analyses or conclusions presented. None of the figures or datasets in this manuscript are based on RAD-51 cytology. Instead, our conclusions are drawn from genetic interactions, biochemical assays, and protein–protein interaction analyses.

      The paper focuses extensively on HIM-5, which was previously shown through genetic and cytological analysis to be important for breaks on the X chromosome. The revised manuscript still claims that "HIM-5 mediates interactions with the different accessory factors sub-groups, providing insights into how components on the DNA loops may interact with the chromosome axis." The weak interactions between HIM-5 and DSB-1/2 detected in the Y2H assay do not convincingly support such a role. The idea that HIM-5 directly promotes break formation is also inconsistent with genetic data showing that him5 mutants lack breaks on the X chromosomes, while HIM-5 has been shown to be is enriched on autosomes. Additionally, as noted in my comment to the authors, the localization data for HIM-5 shown in this paper are discordant with prior studies; this discrepancy should be addressed experimentally. 

      We appreciate the reviewer’s concerns regarding the interpretation of HIM-5 function.  The weak Y2H interactions between HIM-5 and DSB-1 are not interpreted as direct biochemical evidence of a strong physical interaction, but rather as a potential point of regulatory connection between these pathways. Importantly, these Y2H data are further supported by co-immunoprecipitation experiments, genetic interactions, and the observed mislocalization of HIM-5 in the absence of DSB-1. Together, these complementary results strengthen our conclusion that HIM-5 functionally associates with DSB-promoting complexes.

      Regarding HIM-5 localization, the pattern we observe using both anti-GFP staining of the eaIs4 transgene (Phim-5::him-5::GFP) and anti-HA staining of the HIM-5::HA strain is consistent with that reported by McClendon et al. (2016), who validated the same eaIs4 transgene. Although the pattern difers slightly from Meneely et al. (2012), that used a HIM5 antibody that is no longer functional and that has been discontinued by the commercial source. In this prior study, a weak signal was detected in the mitotic region and late pachytene, but stronger signal was seen in early to mid-pachytene. Our imaging— optimized for low background and stable signal—similarly shows robust HIM-5 localization in early and mid-pachytene, supporting the reliability of our GFP and HA-tagged analyses.

      The recent analysis of DSB formation in C. elegans males (Engebrecht et al; PloS Genetics; PMID: 41124211) shows that in absence of him-5 there is a significant reduction of CO designation (measured as COSA-1 foci) on autosomes. This study strongly supports a direct and general role for HIM-5 in crossover formation— on both autosomes and on the hermaphrodite X.

      This paper describes REC-1 and HIM-5 as paralogs, based on prior analysis in a paper that included some of the same authors (Chung et al., 2015; DOI 10.1101/gad.266056.115). In my initial review I mentioned that this earlier conclusion was likely incorrect and should not be propagated uncritically here. Since the authors have rebutted this comment rather than amending it, I feel it is important to explain my concerns about the conclusions of previous study. Chung et al. found a small region of potential homology between the C. elegans rec-1 and him-5 genes and also reported that him-5; rec-1 double mutants have more severe defects than either single mutant, indicative of a stronger reduction in DSBs. Based on these observations and an additional argument based on microsynteny, they concluded that these two genes arose through recent duplication and divergence. However, as they noted, genes resembling rec-1 are absent from all other Caenorhabditis species, even those most closely related to C. elegans. The hypothesis that two genes are paralogs that arose through duplication and divergence is thus based on their presence in a single species, in the absence of extensive homology or evidence for conserved molecular function. Further, the hypothesis that gene duplication and divergence has given rise to two paralogs that share no evident structural similarity or common interaction partners in the few million years since C. elegans diverged from its closest known relatives is implausible. In contrast, DSB-1 and DSB-2 are both homologs of Rec114 that clearly arose through duplication and divergence within the Caenorhabditis lineage, but much earlier than the proposed split between REC-1 and HIM-5. Two genes that can be unambiguously identified as dsb-1 and dsb-2 are present in genomes throughout the Elegans supergroup and absent in the Angaria supergroup, placing the duplication event at around 18-30 MYA, yet DSB-1 and DSB-2 share much greater similarity in their amino acid sequence, predicted structure, and function than HIM-5 and REC-1. Further, Raices place HIM-5 and REC-1 in different functional complexes (Figure 3B). 

      We respectfully disagree with the reviewer’s characterization of the relationship between HIM-5 and REC-1. Our use of the term “paralog” follows the conclusions of Chung et al. (2015), a peer-reviewed study that provided both sequence and microsynteny evidence supporting this relationship. While we acknowledge that the degree of sequence conservation is limited, the evolutionary scenario proposed by Chung et al. remains the only published framework addressing this question. Further the degree of homology between either HIM-5 or REC-1 and the ancestral locus are similar to that observed for DSB-1 and DSB-2 with REC-114 (Hinman et al., 2021). We therefore retain the use of the term “paralog” in reference to these genes. Importantly, our conclusions regarding their distinct molecular and functional roles are independent of this classification.

      The authors acknowledge that HIM-17 is a transcription factor that regulates many meiotic genes. Like HIM-17, XND-1 is cytologically enriched along the autosomes in germline nuclei, suggestive of a role in transcription. The Reinke lab performed ChIP-seq in a strain expressing an XND-1::GFP fusion protein and showed that it binds to promoter regions, many of which overlap with the HIM-17-regulated promoters characterized by the Ahringer lab (doi: 10.1126/sciadv.abo4082). Work from the Yanowitz lab has shown that XND-1 influences the transcription of many other genes involved in meiosis (doi: 10.1534/g3.116.035725) and work from the Colaiacovo lab has shown that XND-1 regulates the expression of CRA-1 (doi: 10.1371/journal.pgen.1005029). Additionally, loss of HIM-17 or XND-1 causes pleiotropic phenotypes, consistent with a broad role in gene regulation. Collectively, these data indicate that XND-1 and HIM-17 are transcription factors that are important for the proper expression of many germline-expressed genes. Thus, as stated above, the roles of HIM-17 and XND-1 in DSB formation, as well as their effects on histone modification, are parsimoniously explained by their regulation of the expression of factors that contribute more directly to DSB formation and chromatin modification. I feel strongly that transcription factors should not be described as "SPO-11 accessory factors." 

      The ChIP analysis of XND-1 binding sites (using the XND-1::GFP transgene we provided to the Reinke lab) was performed, and Table S3 in the Ahringer paper suggests it is found at germline promoters, although the analysis is not actually provided. We completely agree that at least a subset of XND-1 functions is explained by its regulation of transcriptional targets (as we previously showed for HIM-5). However, like the MES proteins, a subset of which are also autosomal and impact X chromosome gene expression, XND-1 could also be directly regulating chromatin architecture which could have profound effects on DSB formation.  As stated in our prior comments, precedent for the involvement of a chromatin factor in DSB formation is provided by yeast Spp1. 

      Recommendations for the authors: 

      Editor comments: 

      As you can see, the reviewers have additional comments, and the authors can include revisions to address those points prior to publicizing 'a version of record' (e.g. hatching rate assay mentioned by reviewer #1). This type of study, trying to catalog interactions of many factors, inevitably has loose ends, but in my opinion, it does not reduce the value of the study, as long as statements are not misleading. I suggest that the authors address issues by making changes to the main text. After the next round of adjustments by authors, I feel that it will be ready for a version of record, based on the spirit of the current eLife publication model. 

      Reviewer #1 (Recommendations for the authors): 

      I still have concerns about the HIM-17 IP and immunoblot probing with XND-1 antibodies. While the newly provided whole extract immunoblot clearly shows a XND-1 specific band that goes away in the mutant extracts, there is additional bands that are recognized - the pattern looks different than in the input in Figure 1B. Additionally, there is still a band of the corresponding size in the IPs from extracts not containing the tagged allele of HIM-17, calling into question whether XND-1 is specifically pulled down. 

      The authors did not include the hatching rate as pointed out in the original reviews. In the rebuttal: 

      "Great question. I guess we need to do this while back out for review. If anyone has suggestions of what to say here. Clearly we overlooked this point but do have the strain." 

      We thank the reviewer for this suggestion. We had intended to include a hatching analysis; however, during the course of this work we discovered that our him-17 stock had acquired an additional linked mutation(s) that altered its phenotype and led to inconsistent results. This strain was used to rederive the him-17; eaIs4 double mutant after our original did not survive freeze/thaw. Given the abnormal behavior observed in this line, we concluded that proceeding with the hatching assays could yield unreliable data. We are currently reestablishing a verified him-17 strain, but in the interest of accuracy and reproducibility, we have restricted our analysis in this manuscript to validated datasets derived from confirmed strains.

      Reviewer #2 (Recommendations for the authors): 

      The authors have addressed most of the previous concerns and substantially improved the manuscript. The new data demonstrate that HIM-5 localization depends on DSB-1, and together with the Y2H and Co-PI results, strengthen the link between HIM-5 and the DSBforming machinery in C. elegans. The remaining points are outlined below: 

      Specific comments: 

      The font size of texts and labels in the Figure is very small and is hardly legible. Please enlarge them and make them clearly visible (Fig 1A, 1B, 2A, 2B, 2C, 2D, 2E, 3A, 3B, 3C, 3D, 3F)

      Done

      Although the authors have addressed the specificity of the XND-1 antibody, it remains unclear whether the boxed band is specific to the him-17::3xHA IP, since the same band appears in the control IP, albeit with lower intensity (Fig 1B). Is the ~100 kDa band in the him-17::3xHA IP a modified form XND-1? While antibody specificity was previously demonstrated by IF using xnd-1 mutants, it would be ideal to confirm this on a western blot as well. 

      A Western Blot performed using whole cell extracts and probed with the anti- XND-1 antibody has been provided in the revised version of the manuscript (Fig. S1A). This confirms that the antibody specifically recognizes XND-1 protein. We believe that the ~100 kDa band mentioned by the reviewer is likely to be a non-specific cross reaction band detected by the antibody, since an identical band of the same mW was also detected in xnd-1 null mutants (Fig. S1A).

      Regarding the IP negative controls, we are firmly convinced the boxed band to be specific, and the fact that a (very) low intensity band is also found in the negative control should not infringe the validity of the HIM-17-XND-1 specific interaction. There is a constellation of similar examples present across the literature, as it is widely acknowledged amongst biochemists that some proteins may “stick” to the beads due their intrinsic biochemical properties despite usage of highly stringent IP buffers. However, the high level of enrichment detected in the IP (as also underlined by the reviewer) corroborates that XND-1 specifically immunoprecipitates with HIM-17 despite a low, non-specific binding to the HA beads is present. If interaction between XND-1 and HIM-17 was non-specific, we logically would have found the band in the IP and the band in the negative control to be of very similar intensity, which is clearly not the case. 

      Although co-IP assays are generally considered not a strictly quantitative assay, we want to emphasize that a comparable amount of nuclear extract was employed in both samples as also evidenced by the inputs, in which it is also possible to see that if anything, slightly less  nuclear extracts were employed in the him-17::3xHA; him-5::GFP::3xFLAG vs. the him5::GFP::3xFLAG negative control, corroborating the above mentioned points.

      Lastly, it is crucial to mention that mass spectrometry analyses performed on HIM17::3xHA pulldowns show XND-1 as a highly enriched interacting protein (Blazickova et al.; 2025 Nature Comms.), which strongly supports our co-IP results.

      The subheading "HIM-5 is the essential factor for meiotic breaks in the X chromosome" does not accurately represent the work described in the Results or in Figure 1. I disagree with the authors' response to the earlier criticism. The issue is not merely semantic. The data do not demonstrate that HIM-5 is required for DSB formation on the X chromosome - this conclusion can only be inferred. What Figure 1 shows is that XND-1 and HIM-17 interact, and that pie-1p-driven HIM-5 expression can partially rescue meiotic defects of him-17 mutants. This supports the conclusion that him-5 is a target of HIM-17/XND-1 in promoting CO formation on the X chromosome. However, the data provide no direct evidence for the claim stated in the subheading. I strongly encourage authors to revise the subheading to more accurately represent the findings presented in the paper. 

      After considering the reviewer’s comments, we have revised the subheading to more accurately describe our findings.

      In Fig1C, please fix the typo in the last row - "pie1p::him5-::GFP" to "pie-1p::him- 5::GFP".

      Done

      In Fig 2C, "p" is missing from the label on the right for Phim-5::him-5::GFP.

      Done

      In Fig 3I, bring the labels (DSB-1/2/3) at the lower right to the front.

      Done

      In Concluding Remarks, please fix the typo "frequently".

      Done

      Reviewer #3 (Recommendations for the authors): 

      The experiments that analyze HIM-5 in dsb-1 mutants should be repeated using antibodies against the endogenous HIM-5 antibody, and localization of the HIM-5::HA and HIM-5::GFP proteins should be compared directly to antibody staining. This work uses an epitopetagged protein and a GFP-tagged protein to analyze the localization of HIM-5, while prior work (Meneely et al., 2012) used an antibody against the endogenous protein. In Figures 2 and S4 of this paper, neither HIM-5::HA nor HIM-5::GFP appears to localize strongly to chromatin, and autosomal enrichment of HIM-5, as previously reported for the endogenous protein based on antibody staining, is not evident. Moreover, HIM-5::GFP and HIM-5::HA look different from each other, and neither resembles the low-resolution images shown in Figure 6 in Meneely et al 2012, which showed nuclear staining throughout the germline, including in the mitotic zone, and also in somatic sheath cells. Given the differences in localization between the tagged transgenes and the endogenous protein, it is important to analyze the behavior of the endogenous, untagged protein. A minor issue: a wild-type control should also be shown for HIM-5::HA in Figure S4. 

      Wild type control added to figure S4

      Evidence that XND-1 and HIM-17 form a complex is weak; it is supported by the Y2H and co-IP data but opposed by functional analysis or localization. The diversity of proteins found in the Co-IP of HIM-17::GFP (Table S2) indicate that these interactions are unlikely to be specific. The independent localization of these proteins to chromatin is clear evidence that they do not form an obligate complex; additionally, they have been found to regulate distinct (although overlapping) sets of genes. The predicted structure generated by Alphafold3 has very low confidence and should not be taken as evidence for an interaction.The newly added argument about the lack of apparently overlap between HIM-17 and XND1 due to the distance between the HA tag on HIM-17 and XND-1 is flawed and should be removed - the extended C-terminus in the predicted AlphaFold3 C-terminus of HIM-17 has been interpreted as if it were a structured domain. Moreover, the predicted distance of 180 Å (18 nm) is comparable to the distance between a fluorophore on a secondary antibody and the epitope recognized by the primary antibody (~20-25 nm) and is far below than the resolution limit of light microscopy. 

      We appreciate the reviewer’s thoughtful comment. The evidence supporting a physical interaction between XND-1 and HIM-17 is not only shown by our co-IP experiments, but it has also been recently shown in an independent study where MS analyses were conducted on HIM-17::3xHA pull downs to identify novel HIM-17 interactors (Blazickova et al.; 2025 Nature Comms). As shown in the data provided in this study, also under these experimental settings XND-1 was identified as a highly enriched putative HIM-17 interactor. We do acknowledge that their chromatin localization patterns are distinct and they regulate overlapping but not identical sets of genes, however, it is worth noting that protein–protein interactions in meiosis are often transient or context-dependent, and may not necessarily result in co-localization detectable by microscopy. In line with this, in the same work cited above, a similar situation for BRA-2 and HIM-17 was reported, as they were shown to interact biochemically despite the absence of overlapping staining patterns. 

      Minor issues: 

      The images shown in Panel D in Figure 1 seem to have very different resolutions; the HTP3/HIM-17 colocalization image is particularly blurry/low-resolution and should be replaced. The contrast between blue and green cannot be seen clearly; colors with stronger contrast should be used, and grayscale images should also be shown for individual channels. High-resolution images should probably be included for all of the factors analyzed here to facilitate comparisons.

    1. eLife Assessment

      This study reports important advances in our understanding of how enteropathogenic E. coli (EPEC) interacts at the intestinal interface. Solid data describe a novel model of spatially coordinated calcium signaling to modulate NF-kB activation; additional data and clarification of methods would improve the strength of these conclusions. These findings, which integrate imaging, genetics, and computational modeling, provide a new way to consider host-pathogen interactions in EPEC infections that may lead to improved therapies.

    2. Reviewer #1 (Public review):

      Summary:

      In their article, Guo and coworkers investigate the Ca²⁺ signaling responses induced by Enteropathogenic Escherichia coli (EPEC) in epithelial cells and how these responses regulate NF-κB activation. The authors show that EPEC induces rapid, spatially coordinated Ca²⁺ transients mediated by extracellular ATP released through the type III secretion system (T3SS). Using high-speed Ca²⁺ imaging and stochastic modeling, they propose that low ATP levels trigger "Coordinated Ca²⁺ Responses from IP₃R Clusters" (CCRICs) via fast Ca²⁺ diffusion and Ca²⁺-induced Ca²⁺ release. These responses may dampen TNF-α-induced NF-κB activation through Ca²⁺-dependent modulation of O-GlcNAcylation of p65. The interdisciplinary work suggests a new perspective on calcium-mediated immune response by combining quantitative imaging, bacterial genetics, and computational modeling.

      Strengths:

      The study provides a new concept for host responses to bacterial infections and introduces the concept of Coordinated Ca²⁺ Responses from IP₃R Clusters (CCRICs) as synchronized, whole-cell-scale Ca²⁺ transients with the fast kinetics typical of local events. This is elegantly done by an interdisciplinary approach using quantitative measurements and mechanistic modelling.

      Weaknesses:

      (1) The effect of coordination by fast diffusion for small eATP concentrations is explained by the resulting low Ca2+ concentration that is not as strongly affected by calcium buffers compared to higher concentrations. While I agree with this statement on the relative level, CICR is based on the resulting absolute concentration at neighboring IP3Rs (to activate them). Thus, I do not fully agree with the explanation, or at least would expect to use the modelling approach to demonstrate this effect. Simulations for different activation and buffer concentrations could strengthen this point and exclude potential inhibition of channels at higher stimulation levels.

      In this respect, I would also include the details of the modelling, such as implementation environment, parameters, and benchmarking. The description in the Supplementary Methods is very similar to the description in the main text. In terms of reproducibility, it would be important to at least provide simulation parameters, and providing the code would align with the emerging standards for reproducible science.

      (2) Quantitative characterization of CCRICs:

      The paper would benefit from a clearer definition of the term CCRICs and quantitative descriptors like duration, amplitude distribution, frequency, and spatial extent (also in relation to the comment on the EGTA measurements below). Furthermore, it remains unclear to me whether CCRICs represent a population of rapidly propagating micro-waves or truly simultaneous events. Maybe kymographs or wave-front propagation analyses (at least from simulations if experimental resolution is too bad) would strengthen this point.

      (3) Specificity of pharmacological tools:

      Suramin and U73122 are known to have off-target effects. Control experiments using alternative P2 receptor antagonists like PPADS or inactive U73343 analogs would strengthen the causal link.

    3. Reviewer #2 (Public review):

      Summary:

      The authors of this study are trying to resolve how cellular infection by enteropathogenic E. coli (EPEC) subverts cellular signaling pathways to promote infection and dampen immune responses. Specifically, alteration in calcium dynamics has been evidenced in the prior literature as a potential initiator of these adaptations, and this study provides ideas and mechanistic detail as to how cellular calcium dynamics may be subverted by pathogens.

      Strengths:

      The clear strengths of this paper relate to the new ideas inherent in the proposed hypothesis and their support from the experimental approaches used. Overall, the proposed work provides new ideas in this area, which will benefit from further investigation. Certainly, this is an interesting and challenging paradigm to pick apart mechanistically, and is important for improving treatments from intestinal infections.

      Weaknesses:

      Additional insight is needed in three specific areas to convincingly support the conclusions drawn by the authors. These three areas are: first, a better description of the infection-associated calcium signals. Second, a mechanistic definition of the relevant purinoceptors versus other pathways to increase cellular calcium. Third, an effort to show that the proposed pathways have relevance in a polarized epithelial cell.

  2. Dec 2025
    1. Author response:

      Reviewer #1:

      We thank the reviewer for this important point. Beyond long reaction times, we did not originally exclude participants based on low EMA variability. We agree this is a relevant concern, particularly given the need to add small random noise to some EMA series for model convergence. In the revised manuscript, we will assess additional indicators of careless responding, including within-person EMA variability (e.g., standard deviation or proportion of modal responses) following Jaso et al., 2022 criteria. We will conduct sensitivity analyses excluding low-variability responses or participants and report whether these checks affect the robustness of the results. We will also clarify in the Discussion that minimal EMA variance may reflect either true affective stability or reduced engagement, and discuss how this ambiguity may affect interpretation.

      Reviewer #2:

      We thank the reviewer for raising this fundamental conceptual concern. We agree that more research is needed to fully understand the processes captured by DQRT. In the revised manuscript, we will more clearly reference and summarize prior validation work from our lab providing strong support for a cognitive characterization of DQRT as a measure of cognitive processing speed, while also explicitly acknowledging potential confounds and limitations (Teckentrup et al., 2025). We will clarify that our DQRT computation followed those validated procedures, including exclusion of extreme values above the sample-specific median + 2 SD. In addition, consistent with Reviewer #1’s comment, we will expand the Discussion of how potential careless responding and non-cognitive factors may influence DQRT. We will further tone down language implying causal inference.

      References

      Jaso, B. A., Kraus, N. I., & Heller, A. S. (2022). Identification of careless responding in ecological momentary assessment research: From posthoc analyses to real-time data monitoring. Psychological Methods, 27(6), 958.

      Teckentrup, V., Rosická, A. M., Donegan, K. R., Gallagher, E., Hanlon, A. K., & Gillan, C. M. (2025). Digital questionnaire response time (DQRT): A ubiquitous and low-cost digital assay of cognitive processing speed. Behavior Research Methods, 57(7), 200.

    1. eLife Assessment

      This useful manuscript reports findings indicating that cell cycle progression and cytokinesis both contribute to the transition from early to late neural stem cell fates. Although orthogonal approaches would help confirm the findings, which are based on loss-of-function, the experimental evidence is convincing. Lastly, an investigation of the underlying mechanisms linking the cell cycle to temporal factor expression is still needed.

    2. Reviewer #1 (Public review):

      Summary:

      Drosophila larval type II neuroblasts generate diverse types of neurons by sequentially expressing different temporal identity genes during development. Previous studies have shown that transition from early temporal identity genes (such as Chinmo and Imp) to late temporal identity genes (such as Syp and Broad) depends on the activation of the expression of EcR by Seven-up (Svp) and progression through the G1/S transition of the cell cycle. In this study, Chaya and Syed examined if the expression of Syp and EcR is regulated by cell cycle and cytokinesis by knocking down CDK1 or Pav, respectively, throughout development or at specific developmental stages. They find that knocking down CDK1 or Pav either in all type II neuroblasts throughout the development or in single type neuroblast clones after larval hatching consistently leads to failure to activate late temporal identity genes Syp and EcR. To determine whether the failure of the activation of Syp and EcR is due to impaired Svp expression, they also examined Svp expression using a Svp-lacZ reporter line. They find that Svp is expressed normally in CDK1 RNAi neuroblasts. Further, knocking down CDK1 or Pav after Svp activation still leads to loss of Syp and EcR expression. Finally, they also extended their analysis to type I neuroblasts. They find that knocking down CDK1 or Pav, either at 0 hours or at 42 hours after larval hatching, also results in loss of Syp and EcR expression in type I neuroblasts. Based on these findings, the authors conclude that cycle and cytokinesis are required for the transition from early to late late temporal identity genes in both types of neuroblasts. These findings add mechanistic details to our understanding of the temporal patterning of Drosophila larval neuroblasts.

      Strengths:

      The data presented in the paper are solid and largely support their conclusion. Images are of high quality. The manuscript is well-written and clear.

      Weaknesses:

      The authors have addressed all the weaknesses in this revision.

    3. Reviewer #2 (Public review):

      Summary:

      Neural stem cells produce a wide variety of neurons during development. The regulatory mechanisms of neural diversity are based on the spatial and temporal patterning of neural stem cells. Although the molecular basis of spatial patterning is well-understood, the temporal patterning mechanism remains unclear. In this manuscript, the authors focused on the roles of cell cycle progression and cytokinesis in temporal patterning and found that both are involved in this process.

      Strengths:

      They conducted RNAi-mediated disruption on cell cycle progression and cytokinesis. As they expected, both disruptions affected temporal patterning in NSCs.

      Weaknesses:

      Although the authors showed clear results, they needed to provide additional data to support their conclusion sufficiently.

      For example, they can examine the effects of cell cycle acceleration on the temporal patterning.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Chaya and Syed focuses on understanding the link between cell cycle and temporal patterning in central brain type II neural stem cells (NSCs). To investigate this, the authors perturb the progression of the cell cycle by delaying the entry into M phase and preventing cytokinesis. Their results convincingly show that temporal factor expression requires progression of the cell cycle in both Type 1 and Type 2 NSCs in the Drosophila central brain. Overall, this study establishes an important link between the two timing mechanisms of neurogenesis.

      Strengths:

      The authors provide solid experimental evidence for the coupling of cell cycle and temporal factor progression in Type 2 NSCs. The quantified phenotype shows an all-or-none effect of cell cycle block on the emergence of subsequent temporal factors in the NSCs, strongly suggesting that both nuclear division and cytokinesis are required for temporal progression. The authors also extend this phenotype to Type 1 NSCs in the central brain, providing a generalizable characterization of the relationship between cell cycle and temporal patterning.

      Weaknesses:

      One major weakness of the study is that the authors do not explore the mechanistic relationship between cell cycle and temporal factor expression. Although their results are quite convincing, they do not provide an explanation as to why Cdk1 depletion affects Syp and EcR expression but not the onset of svp. This result suggests that at least a part of the temporal cascade in NSCs is cell-cycle independent which isn't addressed or sufficiently discussed.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Drosophila larval type II neuroblasts generate diverse types of neurons by sequentially expressing different temporal identity genes during development. Previous studies have shown that the transition from early temporal identity genes (such as Chinmo and Imp) to late temporal identity genes (such as Syp and Broad) depends on the activation of the expression of EcR by Seven-up (Svp) and progression through the G1/S transition of the cell cycle. In this study, Chaya and Syed examined whether the expression of Syp and EcR is regulated by cell cycle and cytokinesis by knocking down CDK1 or Pav, respectively, throughout development or at specific developmental stages. They find that knocking down CDK1 or Pav either in all type II neuroblasts throughout development or in single-type neuroblast clones after larval hatching consistently leads to failure to activate late temporal identity genes Syp and EcR. To determine whether the failure of the activation of Syp and EcR is due to impaired Svp expression, they also examined Svp expression using a Svp-lacZ reporter line. They find that Svp is expressed normally in CDK1 RNAi neuroblasts. Further, knocking down CDK1 or Pav after Svp activation still leads to loss of Syp and EcR expression. Finally, they also extended their analysis to type I neuroblasts. They find that knocking down CDK1 or Pav, either at 0 hours or at 42 hours after larval hatching, also results in loss of Syp and EcR expression in type I neuroblasts. Based on these findings, the authors conclude that cycle and cytokinesis are required for the transition from early to late temporal identity genes in both types of neuroblasts. These findings add mechanistic details to our understanding of the temporal patterning of Drosophila larval neuroblasts.

      Strengths:

      The data presented in the paper are solid and largely support their conclusion. Images are of high quality. The manuscript is well-written and clear.

      We appreciate the reviewer’s detailed summary and recognition of the study’s strengths.

      Weaknesses:

      The quantifications of the expression of temporal identity genes and the interpretation of some of the data could be more rigorous.

      (1) Expression of temporal identity genes may not be just positive or negative. Therefore, it would be more rigorous to quantify the expression of Imp, Syp, and EcR based on the staining intensity rather than simply counting the number of neuroblasts that are positive for these genes, which can be very subjective. Or the authors should define clearly what qualifies as "positive" (e.g., a staining intensity at least 2x background).

      We thank the reviewer for this helpful suggestion. In the new version, we have now clarified how positive expression was defined and added more details of our quantification strategy to the Methods section (page 11, lines 380-388; lines 426-434 in tracked changes file). Fluorescence intensity for each neuroblast was normalized to the mean intensity of neighboring wild-type neuroblasts imaged in the same field. A neuroblast was considered positive for a given factor when its normalized nuclear intensity was at least 2× the local background. This scoring criterion was applied uniformly across all genotypes and time points. All quantifications were performed on the raw LSM files in Fiji prior to assembling the figure panels.

      (2) The finding that inhibiting cytokinesis without affecting nuclear divisions by knocking down Pav leads to the loss of expression of Syp and EcR does not support their conclusion that nuclear division is also essential for the early-late gene expression switch in type II NSCs (at the bottom of the left column on page 5). No experiments were done to specifically block the nuclear division in this study specifically. This conclusion should be revised.

      We blocked both cell cycle progression and cytokinesis, and both these manipulations affected temporal gene transitions, suggesting that both cell cycle and cytokinesis are essential. To our knowledge, no mechanism/tool exists that selectively blocks nuclear division while leaving cell cycle progression intact. We have added more clarification on page 4, line 123 onwards (lines 126 onwards in tracked changes file).

      (3) Knocking down CDK1 in single random neuroblast clones does not make the CDK1 knockdown neuroblast develop in the same environment (except still in the same brain) as wild-type neuroblast lineages. It does not help address the concern whether "type 2 NSCS with cell cycle arrest failed to undergo normal temporal progression is indirectly due to a lack of feedback signaling from their progeny", as discussed (from the bottom of the right column on page 9 to the top of the left column on page 10). The CDK1 knockdown neuroblasts do not divide to produce progeny and thus do not receive a feedback signal from their progeny as wild-type neuroblasts do. Therefore, it cannot be ruled out that the loss of Syp and EcR expression in CDK1 knockdown neuroblasts is due to the lack of the feedback signal from their progeny. This part of the discussion needs to be clarification.

      Thanks to the reviewer for raising this critical point. We agree and have added more clarification of our interpretations and limitations to our studies in the revised text on page 8, line 278-282 (lines 296-300 in tracked changes file)

      (4) In Figure 2I, there is a clear EcR staining signal in the clone, which contradicts the quantification data in Figure 2J that EcR is absent in Pav RNAi neuroblasts. The authors should verify that the image and quantification data are consistent and correct.

      When cytokinesis is blocked using pav-RNAi, the neuroblasts become extremely large and multinucleated. In some large pav RNAi clones, we observed a weak EcR signal near the cell membrane. However, more importantly, none of the nuclear compartments showed detectable EcR staining, where EcR is typically localized. We selected a representative nuclear image for the figure panel. To clarify this observation, we have now added an explanatory note to the discussion section on page 8, lines 283-291 (lines 301-309 in tracked changes file).

      Reviewer #2 (Public review):

      Summary:

      Neural stem cells produce a wide variety of neurons during development. The regulatory mechanisms of neural diversity are based on the spatial and temporal patterning of neural stem cells. Although the molecular basis of spatial patterning is well-understood, the temporal patterning mechanism remains unclear. In this manuscript, the authors focused on the roles of cell cycle progression and cytokinesis in temporal patterning and found that both are involved in this process.

      Strengths:

      They conducted RNAi-mediated disruption on cell cycle progression and cytokinesis. As they expected, both disruptions affected temporal patterning in NSCs.

      We appreciate the reviewer’s positive assessment of our experimental results.

      Weaknesses:

      Although the authors showed clear results, they needed to provide additional data to support their conclusion sufficiently.

      For example, they need to identify type II NSCs using molecular markers (Ase/Dpn).The authors are encouraged to provide a more detailed explanation of each experiment. The current version of the manuscript is difficult for non-expert readers to understand.

      Thanks for your feedback. We have now included a detailed description of how we identify type II NSCs in both wild-type and mutant clones. We have also added a representative Asense staining to clearly distinguish type 1 (Ase<sup>+</sup>) from type 2 (Ase<sup>-</sup>) NSCs see Figure S1. We have also added a resources table explaining the genotypes associated with each figure, which was omitted due to an error in the previous version of the manuscript. 

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chaya and Syed focuses on understanding the link between cell cycle and temporal patterning in central brain type II neural stem cells (NSCs). To investigate this, the authors perturb the progression of the cell cycle by delaying the entry into M phase and preventing cytokinesis. Their results convincingly show that temporal factor expression requires progression of the cell cycle in both Type 1 and Type 2 NSCs in the Drosophila central brain. Overall, this study establishes an important link between the two timing mechanisms of neurogenesis.

      Strengths:

      The authors provide solid experimental evidence for the coupling of cell cycle and temporal factor progression in Type 2 NSCs. The quantified phenotype shows an all-ornone effect of cell cycle block on the emergence of subsequent temporal factors in the NSCs, strongly suggesting that both nuclear division and cytokinesis are required for temporal progression. The authors also extend this phenotype to Type 1 NSCs in the central brain, providing a generalizable characterization of the relationship between cell cycle and temporal patterning.

      We thank the reviewer for recognizing the robustness of our data linking the cell cycle to temporal progression.

      Weaknesses:

      One major weakness of the study is that the authors do not explore the mechanistic relationship between the cell cycle and temporal factor expression. Although their results are quite convincing, they do not provide an explanation as to why Cdk1 depletion affects Syp and EcR expression but not the onset of svp. This result suggests that at least a part of the temporal cascade in NSCs is cell-cycle independent, which isn't addressed or sufficiently discussed.

      Thank you for bringing up this important point. We are equally interested in uncovering the mechanism by which the cell cycle regulates temporal gene transitions; however, such mechanistic exploration is beyond the scope of the present study. Interestingly, while the temporal switching factor Svp is expressed independently of the cell cycle, the subsequent temporal transitions are not. We have expanded our discussion on this intriguing finding (page 9, line 307-315; lines 345-355 in tracked changes file). Specifically, we propose that svp activation marks a cell-cycle–independent phase, whereas EcR/Syp induction likely depends on cell-cycle–coupled mechanisms, such as mitosis-dependent chromatin remodeling or daughter-cell feedback. Although further dissection of this mechanism lies beyond the current study, our findings establish a foundation for future work aimed at identifying how developmental timekeeping is molecularly coupled to cell-cycle progression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Figure 1 C and D, it would be better to put a question mark to indicate that these are hypotheses to be tested. 

      We appreciate this suggestion and have added question marks in Figure 1C and 1D to clearly indicate that these panels represent hypotheses under investigation clearly.

      (2) Figure 2A-I, Figure 4A-I, Figure 5A-I and K-S, in addition to enlarged views of single type II neuroblasts, it would be more convincing to include zoomed-out images of the entire larval brain or at least a portion of the brain to include neighboring wild-type type II neuroblasts as internal controls. Also, it would be ideal to show EcR staining from the same neuroblasts as IMP and Syp staining. 

      We thank the reviewer for this valuable input. In our imaging setup, the number of available antibody channels was limited to four (anti-Ase, anti-GFP, anti-Syp, and antiImp). Adding EcR in the same sample was therefore not technically possible, we performed EcR staining separately. 

      (3) The authors cited "Syed et al., 2024" (in the middle of the right column on page 5), but this reference is missing in the "References" section and should be added. 

      The missing citation has been added to the reference section.  

      (4) It would be better to include Ase staining in the relevant figure to indicate neuroblast identity as type I or type II. 

      We agree and now include representative Ase staining for both type 1 and type 2 NSC clones in Figure S1, along with corresponding text updates that describe these markers.

      Reviewer #2 (Recommendations for the authors): 

      Major comments 

      (1) The present conclusion relies on the results using Cdk1 RNAi and pav RNAi. It is still possible that Cdk1 and Pav are involved in the regulation of temporal patterning independent of the regulation of cell cycle or cytokinesis, respectively. To avoid this possibility, the authors need to inhibit cell cycle progression or cytokinesis in another alternative manner. 

      We thank the reviewer for raising this important point. While we cannot completely exclude gene-specific, cell-cycle-independent roles for Cdk1 or Pav, we observe consistent phenotypes across several independent manipulations that slow or block the cell cycle. Also, earlier studies using orthogonal approaches that delay G1/S (Dacapo/Rbf) or impair mitochondrial OxPhos (which lengthens G1/S; van den Ameele & Brand, 2019) produce similar temporal delays. These concordant phenotypes strongly support the interpretation that altered cell-cycle progression—rather than specific roles of a single gene—is the primary cause of the defect. While we cannot exclude additional, gene-specific effects of Cdk1 or Pav, the concordant phenotypes across independent perturbations make the cell-cycle disruption model the most parsimonious interpretation. We have clarified this reasoning in the discussion section on pages 8-9, lines 293-305 (lines 311-343 in tracked changes file).

      (2) To reach the present conclusion, the authors need to address the effects of acceleration of cell cycle progression or cytokinesis on temporal patterning. 

      We thank the reviewer for this insightful suggestion. To our knowledge, there are currently no established genetic tools that can specifically accelerate cell-cycle progression in Drosophila neuroblasts. However, our results demonstrate that blocking the cell cycle impairs the transition from early to late temporal gene expression. These findings suggest that proper cell-cycle progression is essential for the transition from early to late temporal identity in neuroblasts.

      Minor comments 

      (3) P3L2 (right), ... we blocked the NSC cell cycle...

      How did they do it? 

      Which fly lines were used?

      Why did they use the line? 

      These details are now included in the Materials and Methods and the Resource Table (pages 11-13). We used Wor-Gal4, Ase-Gal80 to drive UAS-Cdk1RNAi and UASpavRNAi in type 2 NSCs 

      (4) P5L1(left), ... we used the flip-out approach...

      Why did they conduct it? 

      Probably, the authors have reasons other than "to further ensure." 

      We have clarified in the text on page 4, lines 137-139, that the flip-out approach was used to generate random single-cell clones, enabling quantitative analysis of type 2 NSCs within an otherwise wild-type brain. 

      (5) P5L8(left), ... type 2 hits were confirmed by lack of the type 1 Asense...  The authors must examine Deadpan (Dpn) expression as well. Because there are a lot of Asense (Ase) negative cells in the brain (neurons, glial cell, and neuroepithelial cells). 

      Type II NSCs can be identified as Dpn+/Ase- cells.

      We agree that Dpn is a helpful marker. However, we reliably distinguished type II NSCs by their lack of Ase and larger cell size relative to surrounding neurons and glia, which are smaller in size and located deeper within the clone. These differences, together with established lineage patterns, allow unambiguous identification of type 2 NSCs across all genotypes. We have now added representative type I and type 2 NSC clones to the supplemental figure S1 (E-G’) with Asense stains to demonstrate how we differentiate type I from type II NSCs. 

      (6) P5L32(left), To do this, we induced... 

      This sentence should be made more concise.

      Please rephrase it. 

      The sentence has been rewritten for clarity and concision.

      (7)  P5L42(left), ...lack of EcR/Syp expression (Figure 2).  However, EcR expression is still present (Figure 2I). 

      In some large pavRNAi clones, a weak EcR signal can be observed near the cell membrane; however, none of the nuclear compartments—where EcR is typically localized—show detectable staining. We selected a representative nuclear image for the figure and addressed this observation on page 8, lines 283-291 (lines 301-309 in tracked changes file).

      (8) P7L29(left), ......had persistent Imp expression...

      Imp expression is faint compared to that in Figure 2G.

      The differences between Figures 2G and 3G should be discussed. 

      We thank the reviewer for this comment. We have added a note in the Methods section clarifying that brightness and contrast were adjusted per panel for optimal visualization; thus, apparent differences in signal intensity do not reflect biological variation. Fluorescence intensity for each neuroblast was normalized to the mean intensity of neighboring wild-type neuroblasts imaged in the same field. A neuroblast was considered Imp-positive when its normalized nuclear intensity was at least 2× the local background. This scoring criterion was applied uniformly across all genotypes and time points. All quantifications were performed on the raw LSM files in Fiji prior to assembling the figure panels.

      (9) P8 (Figure 5)

      The Imp expression is faint compared to that in Figure 5Q.

      The difference between Figure 5G and 5Q should be discussed further. 

      As mentioned above, we have clarified our image processing approach in the Methods section to explain any differences in signal appearance between these figures.

      (10) P10 Materials and Methods

      The authors did not mention the fly lines used. This is very important for the readers. 

      We thank the reviewer for bringing this oversight to our attention. The Resource Table was inadvertently omitted from the initial submission. The complete list of fly lines and reagents used in this study is now provided in the updated Resource Table.

      Reviewer #3 (Recommendations for the authors): 

      Major points 

      (1) The authors mention that the heat-shock induction at 42ALH is well after svp temporal window and therefore the cell cycle block independently affects Syp and EcR expression. However, Figure 3 shows svp-LacZ expression at 48ALH. If svp expression is indeed transient in Type 2 NSCs, then this must be validated using an immunostaining of the svp-LacZ line with svp antibody. This is crucial as the authors claim that cell cycle block doesn't affect does affect svp expression and is required independently. 

      We thank the reviewer for bringing this important issue to our attention. As noted, Svp protein is expressed transiently and stochastically in type 2 NSCs (Syed et al., 2017), making direct antibody quantification challenging upon cell cycle block. Consistent with previous work (Syed et al., 2017), we used the svp-LacZ reporter line to visualize stabilized Svp expression, which reliably captures Svp expression in type 2 NSCs (Syed et al., 2017 https://doi.org/10.7554/eLife.26287, and Dhilon et al., 2024 https://doi.org/10.1242/dev.202504).

      (2) The authors have successfully slowed down the cell cycle and showed that it affects temporal progression. However, a converse experiment where the cell cycle is sped up in NSCs would be an important test for the direct coupling of temporal factor expression and cell cycle, wherein the expectation would be the precocious expression of late temporal factors in faster cycle NSCs. 

      We agree that such an experiment would be ideal. However, as noted above (Reviewer #2 comment 2), to our knowledge, no suitable tools currently exist to accelerate neuroblast cell-cycle progression without pleiotropic effects.

      Minor point 

      The authors must include Ray and Li (https://doi.org/10.7554/eLife.75879) in the references when describing that "...cell cycle has been shown to influence temporal patterning in some systems,...".  

      We thank the reviewer for this helpful suggestion. The cited reference (Ray and Li, eLife, 2022) has now been included and appropriately referenced in the revised manuscript.

    1. eLife Assessment

      The authors investigate arrestin2-mediated CCR5 endocytosis in the context of clathrin and AP2 contributions. Using an extensive set of NMR experiments, and supported by microscopy and other biophysical assays, the authors provide compelling data on the roles of AP2 and clathrin in CCR5 endocytosis. This important work will appeal to an audience beyond those studying chemokine receptors, including those studying GPCR regulation and trafficking. The distinct role of AP2 and not clathrin will be of particular interest to those studying GPCR internalization mechanisms.

    2. Reviewer #1 (Public review):

      Petrovic et al. investigate CCR5 endocytosis via arrestin2, with a particular focus on clathrin and AP2 contributions. The study is thorough and methodologically diverse. The NMR titration data clearly demonstrate chemical shift changes at the canonical clathrin-binding site (LIELD), present in both the 2S and 2L arrestin splice variants. To assess the effect of arrestin activation on clathrin binding, the authors compare: truncated arrestin (1-393), full-length arrestin, and 1-393 incubated with CCR5 phosphopeptides. All three bind clathrin comparably, whereas controls show no binding. These findings are consistent with prior crystal structures showing peptide-like binding of the LIELD motif, with disordered flanking regions. The manuscript also evaluates a non-canonical clathrin binding site specific to the 2L splice variant. Though this region has been shown to enhance beta2-adrenergic receptor binding, it appears not to affect CCR5 internalization.

      Similar analyses applied to AP2 show a different result. AP2 binding is activation-dependent and influenced by the presence and level of phosphorylation of CCR5-derived phosphopeptides. These findings are reinforced by cellular internalization assays.

      In sum, the results highlight splice-variant-dependent effects and phosphorylation-sensitive arrestin-partner interactions. The data argue against a (rapidly disappearing) one-size-fits-all model for GPCR-arrestin signaling and instead support a nuanced, receptor-specific view, with one example summarized effectively in the mechanistic figure.

    3. Reviewer #2 (Public review):

      Summary:

      Based on extensive live cell assays, SEC, and NMR studies of reconstituted complexes, these authors explore the roles of clathrin and the AP2 protein in facilitating clathrin mediated endocytosis via activated arrestin-2. NMR, SEC, proteolysis, and live cell tracking confirm a strong interaction between AP2 and activated arrestin using a phosphorylated C-terminus of CCR5. At the same time a weak interaction between clathrin and arrestin-2 is observed, irrespective of activation.

      These results contrast with previous observations of class A GPCRs and the more direct participation by clathrin. The results are discussed in terms of the importance of short and long phosphorylated bar codes in class A and class B endocytosis.

      Strengths:

      The 15N,1H and 13C,methyl TROSY NMR and assignments represent a monumental amount of work on arrestin-2, clathrin, and AP2. Weak NMR interactions between arrestin-2 and clathrin are observed irrespective of activation of arrestin. A second interface, proposed by crystallography, was suggested to be a possible crystal artifact. NMR establishes realistic information on the clathrin and AP2 affinities to activated arrestin with both kD and description of the interfaces.

    4. Reviewer #3 (Public review):

      Summary:

      Overall, this is a well-done study, and the conclusions are largely supported by the data, which will be of interest to the field.

      Strengths:

      Strengths of this study include experiments with solution NMR that can resolve high-resolution interactions of the highly flexible C-terminal tail of arr2 with clathrin and AP2. Although mainly confirmatory in defining the arr2 CBL 376LIELD380 as the clathrin binding site, the use of the NMR is of high interest (Fig. 1). The 15N-labeled CLTC-NTD experiment with arr2 titrations reveals a span from 39-108 that mediates an arr2 interaction, which corroborates previous crystal data, but does not reveal a second area in CLTC-NTD that in previous crystal structures was observed to interact with arr2.

      SEC and NMR data suggest that full-length arr2 (1-418) binding with 2-adaptin subunit of AP2 is enhanced in the presence of CCR5 phospho-peptides (Fig. 3). The pp6 peptide shows the highest degree of arr2 activation, and 2-adaptin binding, compared to less phosphorylated peptide or not phosphorylated at all. It is interesting that the arr2 interaction with CLTC NTD and pp6 cannot be detected using the SEC approach, further suggesting that clathrin binding is not dependent on arrestin activation. Overall, the data suggest that receptor activation promotes arrestin binding to AP2, not clathrin, suggesting the AP2 interaction is necessary for CCR5 endocytosis.

      To validate the solid biophysical data, the authors pursue validation experiments in a HeLa cell model by confocal microscopy. This requires transient transfection of tagged receptor (CCR5-Flag) and arr2 (arr2-YFP). CCR5 displays a "class B"-like behavior in that arr2 is rapidly recruited to the receptor at the plasma membrane upon agonist activation, which forms a stable complex that internalizes onto endosomes (Fig. 4). The data suggest that complex internalization is dependent on AP2 binding not clathrin (Fig. 5).

      The addition of the antagonist experiment/data adds rigor to the study.

      Overall, this is a solid study that will be of interest to the field.

    5. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Petrovic et al. investigate CCR5 endocytosis via arrestin 2, with a particular focus on clathrin and AP2 contributions. The study is thorough and methodologically diverse. The NMR titration data clearly demonstrate chemical shift changes at the canonical clathrin-binding site (LIELD), present in both the 2S and 2L arrestin splice variants. 

      To assess the effect of arrestin activation on clathrin binding, the authors compare: truncated arrestin (1-393), full-length arrestin, and 1-393 incubated with CCR5 phosphopeptides. All three bind clathrin comparably, whereas controls show no binding. These findings are consistent with prior crystal structures showing peptide-like binding of the LIELD motif, with disordered flanking regions. The manuscript also evaluates a non-canonical clathrin binding site specific to the 2L splice variant. Though this region has been shown to enhance beta2-adrenergic receptor binding, it appears not to affect CCR5 internalization. 

      Similar analyses applied to AP2 show a different result. AP2 binding is activation-dependent and influenced by the presence and level of phosphorylation of CCR5-derived phosphopeptides. These findings are reinforced by cellular internalization assays. 

      In sum, the results highlight splice-variant-dependent effects and phosphorylation-sensitive arrestin-partner interactions. The data argue against a (rapidly disappearing) one-size-fitsall model for GPCR-arrestin signaling and instead support a nuanced, receptor-specific view, with one example summarized effectively in the mechanistic figure.

      We thank the referee for this positive assessment of our manuscript. Indeed, by stepping away from the common receptor models for understanding internalization (b2AR and V2R), we revealed the phosphorylation level of the receptor as a key factor in driving the sequestration of the receptor from the plasma membrane. We hope that the proposed mechanistic model will aid further studies to obtain an even more detailed understanding of forces driving receptor internalization.

      Weaknesses: 

      Figure 1 shows regions alphaFold model that are intrinsically disordered without making it clear that this is not an expected stable position. The authors NMR titration data are n=1. Many figure panels require that readers pinch and zoom to see the data.

      In the “Recommendations for the Authors” section, we addressed the reviewer’s stated weaknesses. In short, for the AlphaFold representation in Figure 1A, we added explicit labeling and revised the legend and main text to clearly state that the depicted loops are intrinsically disordered, absent from crystal structures due to flexibility, and shown only for visualization of their location. We also clarified that the NMR titration experiments inherently have n = 1 due to technical limitations, and that this is standard practice in the field, while ensuring individual data points remain visible. The supplementary NMR figures now have more vibrant coloring, allowing easier data assessment. However, we have not changed the zooming of the microscopy and NMR spectra. We believe that the presentation of microscopy data, which already show zoomed-in regions of interest, follow standard practices in the field. Furthermore, we strongly believe that we should display full NMR spectra in the supplementary figures to allow the reader to assess the overall quality and behavior. As indicated previously, the reader can zoom in to very high resolution, since the spectra are provided by vector graphics. Zoomed regions of the relevant details are provided in the main figures.

      Reviewer #2 (Public review): 

      Summary: 

      Based on extensive live cell assays, SEC, and NMR studies of reconstituted complexes, these authors explore the roles of clathrin and the AP2 protein in facilitating clathrin mediated endocytosis via activated arrestin-2. NMR, SEC, proteolysis, and live cell tracking confirm a strong interaction between AP2 and activated arrestin using a phosphorylated C-terminus of CCR5. At the same time a weak interaction between clathrin and arrestin-2 is observed, irrespective of activation. 

      These results contrast with previous observations of class A GPCRs and the more direct participation by clathrin. The results are discussed in terms of the importance of short and long phosphorylated bar codes in class A and class B endocytosis. 

      Strengths: 

      The 15N,1H and 13C,methyl TROSY NMR and assignments represent a monumental amount of work on arrestin-2, clathrin, and AP2. Weak NMR interactions between arrestin-2 and clathrin are observed irrespective of activation of arrestin. A second interface, proposed by crystallography, was suggested to be a possible crystal artifact. NMR establishes realistic information on the clathrin and AP2 affinities to activated arrestin with both kD and description of the interfaces.

      We sincerely thank the referee for this encouraging evaluation of our work and appreciate the recognition of the NMR efforts and insights into the arrestin–clathrin–AP2 interactions.

      Weaknesses: 

      This reviewer has identified only minor weaknesses with the study. 

      (1) I don't observe two overlapping spectra of Arrestin2 (1393) +/- CLTC NTD in Supp Figure 1

      We believe the referee is referring to Figure 1 – figure supplement 2. We have now made the colors of the spectra more vibrant and used different contouring to make the differences between the two spectra clearer. The spectra are provided as vector graphics, which allows zooming in to the very fine details.

      (2) Arrestin-2 1-418 resonances all but disappear with CCR5pp6 addition. Are they recovered with Ap2Beta2 addition and is this what is shown in Supp Fig 2D

      We believe the reviewer is referring to Figure 3 - figure supplement 1. In this figure, the panels E and F show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline) disappear upon the addition of CCR5pp6 (arrestin2<sup>1-418</sup>•CCR5pp6 complex spectrum in red). The panels C and D show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline), which remain unchanged upon addition of AP2b2 <sup>701-937</sup> (orange), indicating no complex formation. We also recorded a spectrum of the arrestin2<sup>1-418</sup>•CCR5pp6 complex under addition of AP2b2 <sup>701-937</sup> (not shown), but the arrestin2 resonances in the arrestin2<sup>1-418</sup> •CCR5pp6 complex were already too broad for further analysis. This had been already explained in the text.

      “In agreement with the AP2b2 NMR observations, no interaction was observed in the arrestin2 methyl and backbone NMR spectra upon addition of AP2b2 in the absence of phosphopeptide (Figure 3-figure supplement 1C, D). However, the significant line broadening of the arrestin2 resonances upon phosphopeptide addition (Figure 3-figure supplement 1E, F) precluded a meaningful assessment of the effect of the AP2b2 addition on arrestin2 in the presence of phosphopeptide”.

      (3) I don't understand how methyl TROSY spectra of arrestin2 with phosphopeptide could look so broadened unless there are sample stability problems?

      We thank the referee for this comment. We would like to clarify that in general a broadened spectrum beyond what is expected from the rotational correlation time does not necessarily correlate with sample stability problems. It is rather evidence of conformational intermediate exchange on the micro- to millisecond time scale.

      The displayed <sup>1</sup>H-<sup>15</sup>N spectra of apo arrestin2 already suffer from line broadening due to such intrinsic mobility of the protein. These spectra were recorded with acquisition times of 50 ms (<sup>15</sup>N) and 55 ms (<sup>1</sup>H) and resolution-enhanced by a 60˚-shifted sine-bell filter for <sup>15</sup>N and a 60˚-shifted squared sine-bell filter for <sup>1</sup>H, respectively, which leads to the observed resolution with still reasonable sensitivity. The <sup>1</sup>H-<sup>15</sup>N resonances in Fig. 1b (arrestin2<sup>1-393</sup>) look particularly narrow. However, this region contains a large number of flexible residues. The full spectrum, e.g. Figure 1-figure supplement 2, shows the entire situation with a clear variation of linewidths and intensities. The linewidth variation becomes stronger when omitting the resolution enhancement filters.

      The addition of the CCR5pp6 phosphopeptide does not change protein stability, which we assessed by measuring the melting temperature of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup>•CCR5pp6 complex (Tm = 57°C in both cases). We believe that the explanation for the increased broadening of the arrestin2 resonances is that addition of the CCR5pp6, possibly due to the release of the arrestin2 strand b20, amplifies the mentioned intermediate timescale protein dynamics. This results in the disappearance of arrestin2 resonances.

      We have now included the assessment of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup>•CCR5pp6 stability in the manuscript:

      “The observed line broadening of arrestin2 in the presence of phosphopeptide must be a result of increased protein motions and is not caused by a decrease in protein stability, since the melting temperature of arrestin2 in the absence and presence of phosphopeptide are identical (56.9 ± 0.1 °C)”.

      (4) At one point the authors added excess fully phosphorylated CCR5 phosphopeptide (CCR5pp6). Does the phosphopeptide rescue resolution of arrestin2 (NH or methyl) to the point where interaction dynamics with clathrin (CLTC NTD) are now more evident on the arrestin2 surface?

      Unfortunately, when we titrate arrestin2 with CCR5pp6 (please see Isaikina & Petrovic et. al, Mol. Cell, 2023 for more details), the arrestin2 resonances undergo fast-to-intermediate exchange upon binding. In the presence of phosphopeptide excess, very few resonances remain, the majority of which are in the disordered region, including resonances from the clathrin-binding loop. Due to the peak overlap, we could not unambiguously assign arrestin2 resonances in the bound state, which precluded our assessment of the arrestin2-clathrin interaction in the presence of phosphopeptide. We have made this now clearer in the paragraph ‘The arrestin2-clathrin interaction is independent of arrestin2 activation’

      “Due to significant line broadening and peak overlap of the arrestin2 resonances upon phosphopeptide addition, the influence of arrestin activation on the clathrin interaction could not be detected on either backbone or methyl resonances “.

      (5) Once phosphopeptide activates arrestin-2 and AP2 binds can phosphopeptide be exchanged off? In this case, would it be possible for the activated arrestin-2 AP2 complex to re-engage a new (phosphorylated) receptor?

      This would be an interesting mechanism. In principle, this should be possible as long as the other (phosphorylated) receptor outcompetes the initial phosphopeptide with higher affinity towards the binding site. However, we do not have experiments to assess this process directly. Therefore, we rather wish not to further speculate.

      (6) I'd be tempted to move the discussion of class A and class B GPCRs and their presumed differences to the intro and then motivate the paper with specific questions. 

      We appreciate the referee’s suggestion and had a similar idea previously. However, as we do not have data on other class-A or class-B receptors, we rather don’t want to motivate the entire manuscript by this question.

      (7) Did the authors ever try SEC measurements of arrestin-2 + AP2beta2+CCR5pp6 with and without PIP2, and with and without clathrin (CLTC NTD? The question becomes what the active complex is and how PIP2 modulates this cascade of complexation events in class B receptors.

      We thank the referee for this question. Indeed, we tested whether PIP2 can stabilize the arrestin2•CCR5pp6•AP2 complex by SEC experiments. Unfortunately, the addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. The resolution of SEC experiments was not sufficient to distinguish arrestin2 in oligomeric form or in arrestin2•CCR5pp6•AP2 complex. We now mention this in the text:

      “We also attempted to stabilize the arrestin2-AP2b2-phosphopetide complex through the addition of PIP2, which can stabilize arrestin complexes with the receptor (Janetzko et al., 2022). The addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. Unfortunately, the resolution of the SEC experiments was not sufficient to separate the arrestin2 oligomers from complexes with AP2b2”.

      Reviewer #3 (Public review): 

      Summary: 

      Overall, this is a well-done study, and the conclusions are largely supported by the data, which will be of interest to the field. 

      Strengths: 

      Strengths of this study include experiments with solution NMR that can resolve high-resolution interactions of the highly flexible C-terminal tail of arr2 with clathrin and AP2. Although mainly confirmatory in defining the arr2 CBL376LIELD380 as the clathrin binding site, the use of the NMR is of high interest (Fig. 1). The 15N-labeled CLTC-NTD experiment with arr2 titrations reveals a span from 39-108 that mediates an arr2 interaction, which corroborates previous crystal data, but does not reveal a second area in CLTC-NTD that in previous crystal structures was observed to interact with arr2. 

      SEC and NMR data suggest that full-length arr2 (1-418) binding with 2-adaptin subunit of AP2 is enhanced in the presence of CCR5 phospho-peptides (Fig. 3). The pp6 peptide shows the highest degree of arr2 activation, and 2-adaptin binding, compared to less phosphorylated peptide or not phosphorylated at all. It is interesting that the arr2 interaction with CLTC NTD and pp6 cannot be detected using the SEC approach, further suggesting that clathrin binding is not dependent on arrestin activation. Overall, the data suggest that receptor activation promotes arrestin binding to AP2, not clathrin, suggesting the

      AP2 interaction is necessary for CCR5 endocytosis. 

      To validate the solid biophysical data, the authors pursue validation experiments in a HeLa cell model by confocal microscopy. This requires transient transfection of tagged receptor (CCR5-Flag) and arr2 (arr2-YFP). CCR5 displays a "class B"-like behavior in that arr2 is rapidly recruited to the receptor at the plasma membrane upon agonist activation, which forms a stable complex that internalizes onto endosomes (Fig. 4). The data suggest that complex internalization is dependent on AP2 binding not clathrin (Fig. 5). 

      The addition of the antagonist experiment/data adds rigor to the study. 

      Overall, this is a solid study that will be of interest to the field.

      We thank the referee for the careful and encouraging evaluation of our work. We appreciate the recognition of the solidity of our data and the support for our conclusions regarding the distinct roles of AP2 and clathrin in arrestin-mediated receptor internalization.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors): 

      I believe that the authors have made efforts to improve the accessibility to a broader audience. In a few cases, I believe that the authors response either did not truly address the concern or made the problem worse. I am grouping these as 'very strong opinions' and 'sticking point'. 

      Very strong opinion 1: 

      While data presentation is somewhat at the authors discretion, there were several figures where the presentation did not make the work approachable, including microscopy insets and NMR spectra. A suggestion to 'pinch and zoom' does not really address this. For the overlapping NMR spectra in supporting Figure 1, I actually -can- see this on zooming, but I did not recognize this on first pass because the colors are almost identical for the two spectra. This is an easy fix. Changing the presentation by coloring these distinctly would alleviate this. The Supplemental figure to Fig. 2 looks strange with pinch and zoom. But at the end of the day, data presentation where the reader is to infer that they must zoom in is not very approachable and may prevent readers from being able to independently assess the data. In this case, there doesn't seem to be a strong rationale to not make these panels easier to see at 100% size. 

      We appreciate the reviewer’s thoughtful comments regarding figure accessibility and agree that data presentation should be clear and interpretable without requiring readers to zoom in extensively. However, we must note that the presentation of the microscopy data follows standard practices in the field and that the panels already include zoomed-in regions, which enable easier access to key results and observations.

      Regarding the NMR data, we have revised Figure 1—figure supplement 2 and Figure 2— figure supplement 1 to match the presentation style of Figure 3—figure supplement 1, which the reviewer apparently found more accessible. We also made the colors of the spectra more vibrant, as the referee suggested. We would like to emphasize that it is absolutely necessary to display the full NMR spectra in order to allow independent assessment of signal assignment, data quality, and overall protein behavior. Zoomed regions of the relevant details are provided in the main figures.

      Very strong opinion 2: 

      The author's response to lack of individual data points and error bars is that this is an n=1 experiment. I do not believe this meets the minimum standard for best practices in the field.

      We respectfully disagree with the reviewer’s assessment. The Figure already displays individual data points, as shown already in the initial submission. Performing NMR titrations with isotopically labeled protein samples is inherently resource-intensive, and single-sample (n = 1) experiments are widely accepted and routinely reported in the field. Numerous studies have used the same approach, including Rosenzweig et al., Science (2013); Nikolaev et al., Nat. Methods (2019); and Hobbs et al., J. Biomol. NMR (2022), as well as our own recent work (Isaikina & Petrovic et al., Mol. Cell, 2023). These studies demonstrate that such NMR-based affinity measurements, even when performed on a single sample, are highly reproducible, precise, and consistent with orthogonal evidence and across different sample conditions.

      Sticking point:

      Figure 1A - the alphaFold model of arrestin2L depicts the disordered loops as ordered. The depiction is misleading at best, and inaccurate in truth. To use an analogy, what the authors depict is equivalent to publishing an LLM hallucination in the text. Unlike LLMs, alphaFold will actually flag its hallucination with the confidence (pLDDT) in the output. Both for LLMs and for alphaFold, we are spending much time teaching our students in class how to use computation appropriately - both to improve efficiency but also to ensure accuracy by removing hallucinations.

      The original review indicated that confidences needed to be shown and that this needed to be depicted in a way that clarifies that this is NOT a structural state of the loops. The newly added description ("The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 Cdomain, which are not detected in the available crystal structures...) worsens the concern because it even more strongly implies that a 0 confidence computational output is a likely structural state. It also indicates that these regions were 'not detected' in crystal structures. These regions of arrestin are intrinsically disordered. AlphaFold (by it's nature) must put out something in terms of coordinates, even if the pLDDT suggests that the region cannot be predicted or is not in a stable position, which is the case here. In crystal structures, these regions are not associated with interpretable electron density, meaning that coordinates are omitted in these regions because adding them would imply that under the conditions used, the protein adopts a low energy structural state in this region. This region is instead intrinsically disordered. 

      A good description of why showing disordered loops in a defined position is incorrect and how to instead depict disorder correctly is in Brotzakis et al. Nat communications 16, 1632 (2025) "AlphaFold prediction of structural ensembles of disordered proteins", where figures 3A, 4A, and 5A show one AlphaFold prediction colored by confidence and 3B, 4B and 5B are more accurate depictions of the structural ensemble. 

      Coming back to the original comment "The AlphaFold model could benefit from a more transparent discussion of prediction confidence and caveats. The younger crowd (part of the presumed intended readership) tends to be more certain that computational output is 'true'...." Right now, the authors are still showing in Fig 1A a depiction of arrestin with models for the loops that are untrue. They now added text indicating that these loops are visualized in an AlphaFold prediction and 'true' but 'not detected in crystal structures'. There is no indication in the text that these are intrinsically disordered. The lack of showing the pLDDT confidence and the lack of any indication that these are disordered regions is simply incorrect. 

      We appreciate the concern of the reviewer towards AlphaFold models. As NMR spectroscopists we are highly aware of intrinsic biomolecular motions. However, our AlphaFold2 model is used as a graphical representation to display the interaction sites of loops; it is not intended to depict the loops as fixed structural states. The flexibility of the loops had been clearly described in the main text before:

      “Arrestin2 consists of two consecutive (N- and C-terminal) β-sandwich domains (Figure 1A), followed by the disordered clathrin-binding loop (CBL, residues 353–386), strand b20 (residues 386–390), and a disordered C-terminal tail after residue 393”.

      and

      “Figure 1B depicts part of a 1H-15N TROSY spectrum (full spectrum in Figure 1-figure supplement 2A) of the truncated 15N-labeled arrestin2 construct arrestin21-393 (residues 1393), which encompasses the C-terminal strand β20, but lacks the disordered C-terminal tail. Due to intrinsic microsecond dynamics, the assignment of the arrestin21-393 1H-15N resonances by triple resonance methods is largely incomplete, but 16 residues (residues 367381, 385-386) within the mobile CBL could be assigned. This region of arrestin is typically not visible in either crystal or cryo-EM structures due to its high flexibility”.

      as well as in the legend to Figure 1:

      “The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available crystal structures of apo arrestin2 [bovine: PDB 1G4M (Han et al., 2001), human: PDB 8AS4 (Isaikina et al., 2023)]. In the other structured regions, the model is virtually identical to the crystal structures”.

      We have now further added a label ‘AlphaFold2 model’ to Figure 1A and amended the respective Figure legend to

      “The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available crystal structures of apo arrestin2 [bovine: PDB 1G4M (Han et al., 2001), human: PDB 8AS4 (Isaikina et al., 2023)] due to flexibility. In the other structured regions, the model is virtually identical to the crystal structures”.

      Reviewer #2 (Recommendations for the authors): 

      I appreciated the response by the authors to all of my questions. I have no further comments

      We thank the referee for the raised questions, which we believe have improved the quality of the manuscript.

    1. eLife Assessment

      This fundamental work by Yamamoto and colleagues advances our understanding of how positional information is coordinated between axes during limb outgrowth and patterning. They provide convincing evidence that the dorsal-ventral axis feeds into anterior-posterior signaling, and identify the responsible molecules by combining transplantations with molecular manipulations. This work will be of broad interest to regeneration, tissue engineering, and evolutionary biologists.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript by Yamamoto et al. presents a model by which the four main axes of the limb are required for limb regeneration to occur in the axolotl. A longstanding question in regeneration biology is how existing positional information is used to regenerate the correct missing elements. The limb provides an accessible experimental system by which to study the involvement of the anteroposterior, dorsoventral, and proximodistal axes in the regenerating limb. Extensive experimentation has been performed in this area using grafting experiments. Yamamoto et al. use the accessory limb model and some molecular tools to address this question. There are some interesting observations in the study. In particular, one strength the potent induction of accessory limbs in the dorsal axis with BMP2+Fgf2+Fgf8 is very interesting. Although interesting, the study makes bold claims about determining the molecular basis of DV positional cues, but the experimental evidence is not definitive and does not take into account the previous work on DV patterning in the amniote limb. Also, testing the hypothesis on blastemas after limb amputation would be needed to support the strong claims in the study.

      Strengths:

      The manuscript presents some novel new phenotypes generated in axolotl limbs due to Wnt signaling. This is generally the first example in which Wnt signaling has provided a gain of function in the axolotl limb model. They also present a potent way of inducing limb patterning in the dorsal axis by the addition of just beads loaded with Bmp2+Fgf8+Fgf2.

      Comments on revised version:

      Re-evaluation: The authors have significantly improved the manuscript and their conclusions reflect the current state of knowledge in DV patterning of tetrapod limbs. My only point of consideration is their claim of mesenchymal and epithelial expression of Wnt10b and the finding that Fgf2 and Wnt10b are lowly expressed. It is based upon the failed ISH, but this doesn't mean they aren't expressed. In interpreting the Li et al. scRNAseq dataset, conclusions depend heavily on how one analyzes and interprets it. The 7DPA sample shows a very low representation of epithelial cells compared to other time points, but this is likely a technical issue. Even the epithelial marker, Krt17, and the CT/fibroblast marker show some expression elsewhere. If other time points are included in the analysis, Wnt10b, would be interpreted as relatively highly expressed almost exclusively in the epithelium. By selecting the 7dpa timepoint, which may or may not represent the MB stage as it wasn't shown in the paper, the conclusions may be based upon incomplete data. I don't expect the authors to do more work, but it is worth mentioning this possibility. The authors have considered and made efforts to resolve previous concerns.

    3. Reviewer #2 (Public review):

      Summary:

      This study explores how signals from all sides of a developing limb, front/back and top/bottom, work together to guide the regrowth of a fully patterned limb in axolotls, a type of salamander known for its impressive ability to regenerate limbs. Using a model called the Accessory Limb Model (ALM), the researchers created early staged limb regenerates (called blastemas) with cells from different sides of the limb. They discovered that successful limb regrowth only happens when the blastema contains cells from both the top (dorsal) and bottom (ventral) of the limb. They also found that a key gene involved in front/back limb patterning, called Shh (Sonic hedgehog), is only turned on when cells from both the dorsal and ventral sides come into contact. The study identified two important molecules, Wnt10B and FGF2, that help activate Shh when dorsal and ventral cells interact. Finally, the authors propose a new model that explains how cells from all four sides of a limb, dorsal, ventral, anterior (front), and posterior (back), contribute at both the cellular and molecular level to rebuilding a properly structured limb during regeneration

      Strengths:

      The techniques used in this study, like delicate surgeries, tissue grafting, and implanting tiny beads soaked with growth factors, are extremely difficult, and only a few research groups in the world can do them successfully. These methods are essential for answering important questions about how animals like axolotls regenerate limbs with the correct structure and orientation. To understand how cells from different sides of the limb communicate during regeneration, the researchers used a technique called in situ hybridization, which lets them see where specific genes are active in the developing limb. They clearly showed that the gene Shh, which helps pattern the front and back of the limb, only turns on when cells from both the top (dorsal) and bottom (ventral) sides are present and interacting. The team also took a broad, unbiased approach to figure out which signaling molecules are unique to dorsal and ventral limb cells. They tested these molecules individually and discovered which could substitute for actual dorsal and ventral cells, providing the same necessary signals for proper limb development. Overall, this study makes a major contribution to our understanding of how complex signals guide limb regeneration, showing how different regions of the limb work together at both the cellular and molecular levels to rebuild a fully patterned structure.

      Weaknesses:

      Because the expressional analyses are performed on thin sections of regenerating tissue, in the original manuscript, they provided only a limited view of the gene expression patterns in their experiments, opening the possibility that they could be missing some expression in other regions of the blastema. Additionally, the quantification method of the expressional phenotypes in most of the experiments did not appear to be based on a rigorous methodology. The authors' inclusion of an alternate expression analysis, qRT-PCR, on the entire blastema helped validate that the authors are not missing something in the revised manuscript.

      Overall, the number of replicates per sample group in the original manuscript was quite low (sometimes as low as 3), which was especially risky with challenging techniques like the ones the authors employ. The authors have improved the rigor of the experiment in the revised manuscript by increasing the number of replicates. The authors have not performed a power analysis to calculate the number of animals used in each experiment that is sufficient to identify possible statistical differences between groups. However, the authors have indicated that there was not sufficient preliminary data to appropriately make these quantifications.

      Likewise, in the original manuscript, the authors used an AI-generated algorithm to quantify symmetry on the dorsal/ventral axis, and my concern was that this approach doesn't appear to account for possible biases due to tissue sectioning angles. They also seem to arbitrarily pick locations in each sample group to compare symmetry measurements. There are other methods, which include using specific muscle groups and nerve bundles as dorsal/ventral landmarks, that would more clearly show differences in symmetry. The authors have now sufficiently addressed this concern by including transverse sections of the limbs annd have explained the limitations of using a landmark-based approach in their quantification strategy.

    4. Reviewer #3 (Public review):

      Summary:

      After salamander limb amputation, the cross-section of the stump has two major axes: anterior-posterior and dorsal-ventral. Cells from all axial positions (anterior, posterior, dorsal, ventral) are necessary for regeneration, yet the molecular basis for this requirement has remained unknown. To address this gap, Yamamoto et al. took advantage of the ALM assay, in which defined positional identities can be combined on demand and their effects assessed through the outgrowth of an ectopic limb. They propose a compelling model in which dorsal and ventral cells communicate by secreting Wnt10b and Fgf2 ligands respectively, with this interaction inducing Shh expression in posterior cells. Shh was previously shown to induce limb outgrowth in collaboration with anterior Fgf8 (PMID: 27120163). Thus, this study completes a concept in which four secreted signals from four axial positions interact for limb patterning. Notably, this work firmly places dorsal-ventral interactions upstream of anterior-posterior, which is striking for a field that has been focussed on anterior-posterior communication. The ligands identified (Wnt10b, Fgf2) are different to those implicated in dorsal-ventral patterning in the non-regenerative mouse and chick models. The strength of this study is in the context of ALM/ectopic limb engineering. Although the authors attempt to assay the expression of Wnt10b and Fgf2 during limb regeneration after amputation, they were unable to pinpoint the precise expression domains of these genes beyond 'dorsal' and 'ventral' blastema. Given that experimental perturbations were not performed in regenerating limbs - almost exclusively under ALM conditions - this author finds the title "Dorsoventral-mediated Shh induction is required for axolotl limb regeneration" a little misleading.

      Strengths:

      (1) The ALM and use of GFP grafts for lineage tracing (Figures 1-3) take full advantage of the salamander model's unique ability to outgrow patterned limbs under defined conditions. As far as I am aware, the ALM has not been combined with precise grafts that assay 2 axial positions at once, as performed in Figure 3. The number of ALMs performed in this study deserves special mention, considering the challenging surgery involved.

      (2) The authors identify that posterior Shh is not expressed unless both dorsal and ventral cells are present. This echoes previous work in mouse limb development models (AER/ectoderm-mesoderm interaction) but this link between axes was not known in salamanders. The authors elegantly reconstitute dorsal-ventral communication by grafting, finding that this is sufficient to trigger Shh expression (Figure 3 - although see also section on Weaknesses).

      (3) Impressively, the authors discovered two molecules sufficient to substitute dorsal or ventral cells through electroporation into dorsal- or ventral- depleted ALMs (Figure 5). These molecules did not change the positional identity of target cells. The same group previously identified the ventral factor (Fgf2) to be a nerve-derived factor essential for regeneration. In Figure 6, the authors demonstrate that nerve-derived factors, including Fgf2, are alone sufficient to grow out ectopic limbs from a dorsal wound. Limb induction with a 3-factor cocktail without supplementing with other cells is conceptually important for regenerative engineering.

      (4) The writing style and presentation of results is very clear.

      Overall appraisal:

      This is a logical and well-executed study that creatively uses the axolotl model to advance an important framework for understanding limb patterning. The relevance of the mechanisms to normal limb regeneration are not yet substantiated, in the opinion of this reviewer. Additionally, Wnt10b and Fgf2 should be considered molecules sufficient to substitute dorsal and ventral identity (solely in terms of inducing Shh expression). It is not yet clear whether these molecules are truly necessary (loss of function would address this).

      Comments on revisions:

      Congratulations - I still find this an elegant and easy-to-read study with significant implications for the field! Linking your mechanisms to normal limb regeneration (i.e. regenerating blastema, not ALM), as well as characterising the cell populations involved, will be interesting directions for the future.

    5. Author response:

      The following is the authors’ response to the current reviews.

      We sincerely thank all three reviewers for their constructive comments. We deeply appreciate the reviewers’ efforts in summarizing our study, highlighting its strengths, and providing constructive suggestions. To enhance the quality and clarity of our work, we plan to address the concerns raised by the reviewers.

      First, as Reviewer #1 suggested, we will note that clearer expression patterns of Wnt10b and Fgf2 may be detectable in scRNA-seq analyses at other stages, and we will also clarify that low-level signals of epithelial and CT/fibroblast markers outside their expected clusters may reflect technical bias. In addition, we agree with the reviewer’s point that our unsuccessful ISH experiments and the low abundance detected by RT-qPCR do not demonstrate absence of expression, and that conclusions from reanalyzing the Li et al. scRNA-seq dataset can depend strongly on analytical choices; therefore, while we focused on the 7 dpa sample because our RT-qPCR data suggested that Wnt10b and Fgf2 may be most enriched around the MB stage (the original study refers to 7 dpa as MB), we will explicitly acknowledge that analyzing a single time point—especially one with a low representation of epithelial cells—may yield incomplete or stage-biased interpretations, and that inclusion of additional time points could reveal clearer and potentially different expression patterns. We will also temper our wording regarding the inferred cellular sources to avoid over-interpretation based on the current data.

      Second, to mitigate the concerns raised by Reviewer #3 regarding the generalization of our conclusions to amputation-induced (normal) limb regeneration, we will cite a previous study suggesting that ALM was used as the alternative experimental system for studying limb regeneration (Nacu et al., 2016, Nature, PMID: 27120163; Satoh et al., 2007, Developmental Biology, PMID: 17959163). We are confident that our ALM-based data provide a reasonable basis for understanding limb regeneration. We agree that there are important remaining questions—such as which cell populations express Wnt10b and Fgf2 and how endogenous WNT10B and FGF2 signals induce Shh expression in normal regeneration—which should be investigated in future studies to deepen our understanding of limb regeneration.

      We also appreciate Reviewer #2’s careful evaluation of the technical rigor and quantification. We have benefited from the reviewer’s earlier feedback, which guided revisions that improved the manuscript’s rigor and presentation.

      We are grateful for the reviewers’ insights and are confident that these revisions will significantly strengthen our manuscript.


      The following is the authors’ response to the original reviews.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The authors should be commended for addressing this gap - how cues from the DV axis interact with the AP axis during limb regeneration. Overall, the concept presented in this manuscript is extremely interesting and could be of high value to the field. However, the manuscript in its current form is lacking a few important data and resolution to fully support their conclusions, and the following needs to be addressed before publication:

      (1) ISH data on Wnt10b and FGF2 from various regeneration time points are essential to derive the conclusion. Preferably multiplex ISH of Wnt10b/Fgf2/Shh or at least canonical ISH on serial sections to demonstrate their expression in dermis/epidermis and order of gene expression i.e. Shh is only expressed after expression of Wnt10b/FGF2. It would certainly help if this can also be shown in regular blastema.

      We are grateful for the constructive suggestion on assessing Wnt10b and Fgf2 expression during regular regeneration, and we agree that clarifying their expression patterns in regular blastemas is important for strengthening the conclusions of our study. Because we cannot currently ensure sufficient sensitivity with multiplex FISH in our laboratory—partly due to high background—, we conducted conventional ISH on serial sections of regular blastemas at several time points (Fig. S5A). However, the expression patterns of Wnt10b and Fgf2 were not clear. To complement the ISH results, we performed RT-qPCR on microdissected dorsal and ventral halves of regular blastemas at the MB stage (Fig. S5B). We found that Wnt10b and Fgf2 were expressed at significantly higher levels in the dorsal and ventral halves, respectively, compared to the opposite half. This dorsal/ventral biased expression of Wnt10b/Fgf2 is consistent with our RNA-seq data. We further quantified expression levels of Wnt10b, Fgf2, and Shh across stages (intact, EB, MB, LB, and ED) and found that Wnt10b and Fgf2 peaked at the MB stage, whereas Shh peaked at the LB stage—consistent with the editor’s request regarding the order of gene expression (Fig. S5C). This temporal offset in upregulation supports our model. These results are now included in the revised manuscript (Line 294‒306).

      To identify the cell types expressing Wnt10b or Fgf2, we analyzed published single-cell RNA-seq data (7 dpa blastema (MB), Li et al., 2021). As a result, Fgf2 expression was observed in the mesenchymal cluster, whereas Wnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. The apparent low abundance likely contributes to the weak ISH signals and reflects current technical limitations. In addition, Wnt10b and Fgf2 expression did not follow Lmx1b expression (Fig. S6J, K), and Wnt10b and Fgf2 themselves were not exclusive (Fig. S6L). These results are now included in the revised manuscript (Line 307‒321). Together with the RT-qPCR data (Fig. S5B), these results suggest that Wnt10b and Fgf2 are not exclusively confined to purely dorsal or ventral cells at the single-cell level, even though they show dorsoventral bias when assessed in bulk tissue. These results suggest that Wnt10b/Fgf2 expression is not restricted to dorsal/ventral cells but mediated by dorsal/ventral cells, and co-existence of both signals should provide a permissive environment for Shh induction. Defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will therefore be an important goal for future work.  

      (2) Validation of the absence of gene expression via qRT PCR in the given sample will increase the rigor, as suggested by reviewers.

      We thank for this important suggestion and agree that validation by qRT-PCR increases the rigor of our study. Accordingly, we performed RT-qPCR on AntBL, PostBL, DorBL, and VentBL to corroborate the ISH results. The results are now included in Fig. 2. We also verified by RT-qPCR that Shh expression following electroporation and the quantitative results are now provided in Fig. 5.

      (3) Please increase n for experiments where necessary and mention n values in the figures.

      We thank for this helpful comment and agree on the importance of providing sufficient sample sizes. Accordingly, we increased the n for the relevant experiments and have indicated the n values in the corresponding figure legends.

      (4) Most comments by all three reviewers are constructive and largely focus on improving the tone and language of the manuscript, and I expect that the authors should take care of them.

      We thank the reviewers for their constructive feedback on the tone and language of the manuscript. We have carefully revised the text according to each comment, and we hope these modifications have improved both clarity and readability.

      In addition, in revising the manuscript we also refined the conceptual framework. Our new analysis of Wnt10b and Fgf2 expression during normal regeneration suggests that these genes are not expressed in a strictly dorsal- or ventral-specific manner at the single-cell level. When these observations are considered together with (i) the RNA-seq comparison of dorsally and ventrally induced ALM blastemas, (ii) RT-qPCR of microdissected dorsal and ventral halves of regenerating blastemas, and (iii) the functional electroporation experiments, our interpretation is that Wnt10b and Fgf2 act as dorsal- and ventral-mediated signals, respectively: their production is regulated by dorsal or ventral cells, and the presence of both signals is required to induce Shh expression. Given those, we now think our conclusion might be explained without using the confusing term, “positional cue”. Because the distinction between “positional cue” and “positional information” could be confusing as noted by the reviewers, we rewrote our manuscript without using “positional cue.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 61: More explanation for what a double-half limb means is needed.

      We thank the reviewer for this suggestion. We have revised the manuscript (Line 73‒76). Specifically, we now explain that a double-dorsal limb, for example, is a chimeric limb generated by excising the ventral half and replacing it with a dorsal half from the contralateral limb while preserving the anteroposterior orientation.

      (2) Line 63-65: "Such blastemas form hypomorphic, spike-like structures or fail to regenerate entirely." This statement does not represent the breadth of work on the APDV axis in limb regeneration. The cited Bryant 1976 reference tested only double-posterior and double-anterior newt limbs, demonstrating the importance of disposition along the AP axis, not DV. Others have shown that the regeneration of double-half limbs depends upon the age of the animal and the length of time between the grafting of double-half limbs and amputation. Also, some double-dorsal or double-ventral limbs will regenerate complete AP axes with symmetrical DV duplications (Burton, Holder, and Jesani, 1986). Also, sometimes half dorsal stylopods regenerate half dorsal and half ventral, or regenerate only half ventral, suggesting there are no inductive cues across the DV axis as there are along the AP axis. Considering this is the basis of the study under question, more is needed to convince that the DV axis is necessary for the generation of the AP axis.

      We thank the reviewer for this detailed and constructive comment. We acknowledge that previous studies have reported a range of outcomes for double-half limbs. For example, Burton et al. (1986) described regeneration defects in double-dorsal (DD) and double-ventral (VV) limbs, although limb patterning did occur in some cases (Burton et al., 1986, Table 1). As the reviewer notes, regenerative outcomes depend on variables such as animal age and the interval between construction of the double-half limb and amputation, sometimes called the effect of healing time (Tank and Holder, 1978). Moreover, variability has been reported not only in DD/VV limbs but also in double-anterior (AA) and double-posterior (PP) limbs (e.g., Bryant, 1976; Bryant and Baca, 1978; Burton et al., 1986). In the revised manuscript, we have therefore modified the statement to avoid over-generalization and to emphasize that regeneration can be incomplete under these conditions (Line 76‒82). Importantly, in order to provide the additional evidence requested and to directly re-evaluate whether dorsal and ventral cells are required for limb patterning, we performed the ALM experiments shown in Fig. 1. The ALM system allows us to assess this question in a binary manner (regeneration vs. non-regeneration), thereby strengthening the rationale for our conclusions regarding the necessity of the APDV orientations. We also revised a sentence at the beginning of the Results section to emphasize this point (Line 139‒140).

      (3) Line 71: These findings suggest that specific signals from all four positional domains must be integrated for successful limb patterning, such that the absence of any one of them leads to failure." I was under the impression that half posterior limbs can grow all elements, but half anterior can only grow anterior elements.

      We thank the reviewer for this helpful clarification. As summarized by Stocum, half-limb experiments show that while some digit formation can occur, limb patterning remains incomplete in both anterior-half and posterior-half limbs in some cases (Stocum, 2017). We see this point as closely related to the broader question of whether proper limb patterning requires the integration of signals from all four positional domains. As noted in our response above, our ALM experiments in Fig. 1 were designed to test this point directly, and our data support the interpretation that cells from all four orientations are necessary for correct limb patterning.

      (4) Line 79-81: This is stated later in lines 98-105. I suggest expanding here or removing it here.

      We thank the reviewer for this suggestion. In the original version, lines 79–81 introduced our use of the terms “positional cue” and “positional information,” and this content partially overlapped with what later appeared in lines 98–105. In the revised manuscript, we have substantially rewritten this section (Line 82‒84), including the sentences corresponding to lines 79–81 in the original version, to remove the term “positional cue,” as explained in our response to the Editor’s comment (4); our revision reflects new analyses indicating that Wnt10b and Fgf2 appear not be strictly restricted to dorsal or ventral cell populations, and we now describe these factors as dorsal- or ventral-mediated signals that act across dorsoventral domains to induce Shh expression. Accordingly, we no longer maintain the original use of “positional cue” and “positional information.”

      (5) Line 92 - 93: "Similarly, an ALM blastema can be induced in a position-specific manner along the limb axes. In this case, the induced ALM blastema will lack cells from the opposite side." This sentence is difficult to follow. Isn't it the same thing stated in lines 88-90?

      We thank the reviewer for this comment. We revised the sentence to improve readability and to avoid redundancy with original Lines 88–90 (Line 104‒106).

      (6) Line 107: I think the appropriate reference is McCusker et al., 2014 (Position-specific induction of ectopic limbs in non-regenerating blastemas on axolotl forelimbs), although Vieira et al., 2019 can be included here. In addition, Ludolph et al 1990 should be cited.

      We thank the reviewer for this suggestion. We have added McCusker et al. (2014) and Ludolph et al. (1990) as references in the revised manuscript (Line 120‒121).

      (7) Line 107-109: A missing point is how the ventral information is established in the amniote limb. From what I remember, it is the expression of Engrailed 1, which inhibits the ventral expression of Wnt7a, and hence Lmx1b. This would suggest that there is no secreted ventral cue. This is a relatively large omission in the manuscript.

      We thank the reviewer for this comment. We agree that ventral fate in amniotes is specified by En1 in the ventral ectoderm, which represses Wnt7a and thereby prevents induction of Lmx1b; accordingly, a secreted ventral morphogen analogous to dorsal Wnt7a has not been established. We added this point to the revised Introduction (Line 61‒64).

      By contrast, in axolotl limb regeneration, our previous work on Lmx1b expression suggests that DV identities reflect the original positional identity rather than being re-specified during regeneration (Yamamoto et al., 2022). Within this framework, our original use of the term “ventral positional cue” does not imply a ventral patterning morphogen in the amniote sense; rather, it denotes downstream signals induced by cells bearing ventral identity that are required for the blastema to form a patterned limb. This interpretation is consistent with classic studies on double-half chimeras and ectopic contacts between opposite regions (Iten & Bryant, 1975; Bryant & Iten, 1976; Maden, 1980; Stocum, 1982) as well as with our ALM data (Fig. 1). For this reason, we intentionally used the term “positional cues” to refer to signals provided by cells bearing ventral identity, which can be considered separable from the DV patterning mechanism itself, in the original text. As explained in our response to the Editor’s comment (4), we describe these signals as “signals mediated by dorsal/ventral cells,” rather than “positional cues” in the revised manuscript.

      The necessity of dorsal- and ventral-mediated signals is supported by classic studies on the double-half experiment. In the non-regenerating cases, structural patterns along the anteroposterior axis appear to be lost even though both anterior and posterior cells should, in principle, be present in a blastema induced from a double-dorsal or double-ventral limbs. In limb development of amniotes, Wnt7a/Lmx1b or En-1 mutants show that limbs can exhibit anteroposterior patterning even when tissues are dorsalized or ventralized—that is, in the relative absence of ventral or dorsal cells, respectively (Riddle et al., 1995; Chen et al., 1998; Loomis et al., 1996). Taken together, axolotl limb regeneration, in which the presence of both dorsal and ventral cells plays a role in anteroposterior patterning, should differ from other model organisms. It is reasonable to predict the dorsal- and ventral-mediated signals in axolotl limb regeneration. We included this point in the revised manuscript (Line 82‒89). However, there is no evidence that these signals are secreted molecules. For this reason, we have carefully used the term “dorsal-/ventral-mediated signals” in the Introduction without implying secretion.

      (8) Introduction - In general, the argument is a bit misleading. It is written as if it is known that a ventral cue is necessary, but the evidence from other animal models is lacking, from what I know. I may be wrong, but further argument would strengthen the reasoning for the study.

      We thank the reviewer for this thoughtful comment. We agree that it should not read as if it is known that a ventral cue is necessary. In the revised Introduction, we have addressed this in several ways. First, as described in our response to comment (7), we now explicitly note that in amniote limb development ventral identity is specified by En1-mediated repression of Wnt7a, and that a secreted ventral morphogen equivalent to dorsal Wnt7a has not been established. Second, we removed the term “positional cue” and no longer present “ventral positional cue” as a defined entity. Instead, we use mechanistic phrasing such as “signals mediated by ventral cells” and “signals mediated by dorsal cells,” which does not assume that such signals are secreted morphogens or universally conserved. Third, we have reframed the role of dorsal- and ventral-mediated signals as a working hypothesis specific to axolotl limb regeneration, rather than as a general conclusion across model systems.

      (9) Line 129: Remove "As mentioned before".

      We thank the reviewer for this suggestion. We have removed the phrase “As mentioned before” in the revised manuscript (Line 143).

      (10) Figure 1: Are Lmx1, Fgf8, and Shh mutually exclusive? Multiplexed FISH would provide this information, and is a relatively important question considering the strong claims in the study.

      We thank the reviewer for raising this important point. As noted in our response to the editor’s comment, we cannot currently ensure sufficiently high detection sensitivity with multiplex FISH in our laboratory. However, based on previous reports (Nacu et al., 2016), Fgf8 and Shh should be mutually exclusive. In contrast, with respect to Lmx1b, our analysis suggests that its expression is not mutually exclusive with either Fgf8 or Shh, at least their expression domains. To confirm this, we analyzed the published scRNA-seq data and the results were added to the supplemental figure 6. Fgf8 and Shh were expressed in both Lmx1b-positive and Lmx1b-negative cells (Fig. S6H, I), but Fgf8 and Shh themselves were mutually exclusive (Fig. S6M). This point is now included in the revised manuscript (Line 314‒317).

      (11) Results section and Figure 2: More evidence is needed for the lack of Shh expression ISH in tissue sections. Demonstrating the absence of something needs some qPCR or other validation to make such a claim.

      We thank the reviewer for this suggestion. We performed qRT-PCR on ALM blastemas to complement the ISH data (Fig. 2).

      (12) Line 179: I think they are likely leucistic d/d animals and not wild-type animals based upon the images.

      We thank the reviewer for this observation. In the revised manuscript, we have corrected the description to “leucistic animals” (Line 194).

      (13) Line 183-186: I'm a bit confused about this interpretation. If Shh turns on in just a posterior blastema, wouldn't it turn on in a grafted posterior tissue into a dorsal or ventral region? Isn't this independent of environment, meaning Shh turns on if the cells are posterior, regardless of environment?

      Our interpretation is that only posterior-derived cells possess the competency to express Shh. In other words, whether a cell is capable of expressing Shh depends on its original positional identity (Iwata et al., 2020), but whether it actually expresses Shh depends on the environment in which the cell is placed. The results of Fig. 3E and G indicate that Shh activation is dependent on environment and that the posterior identity is not sufficient to activate Shh expression. We have revised the manuscript to emphasize this distinction more clearly (Line 198‒203).

      (14) Figure 4: Do the limbs have an elbow, or is it just a hand?

      We thank the reviewer for this thoughtful question. From the appearance, an elbow-like structure can occasionally be seen; however, we did not examine the skeletal pattern in detail because all regenerated limbs used for this analysis were sectioned for the purpose of symmetry evaluation, and we therefore cannot state this conclusively. While this is indeed an important point, analyzing proximodistal patterning would require a very large number of additional experiments, which falls outside the main focus of the present study. For this reason, and also to minimize animal use in accordance with ethical considerations, we did not pursue further experiments here. In response to this point, we have added a description of the skeletal morphology of ectopic limbs induced by BMP2+FGF2+FGF8 bead implantation (Fig. 6). In these experiments, multiple ectopic limbs were induced along the same host limb. In most cases, these ectopic limbs did not show fusion with the proximal host skeleton, similar to standard ALM-induced limbs, although in one case we observed fusion at the stylopod level. We now note this observation in the revised manuscript (Line 347‒354).

      We regard the relationship between APDV positional information and proximodistal patterning as an important subject for future investigation.

      (15) Line 203 - 237: I appreciate the symmetry score to estimate the DV axis. Are there landmarks that would better suggest a double-dorsal or double-ventral phenotype, like was done in the original double-half limb papers?

      We thank the reviewer for this thoughtful comment. In most cases, the limbs induced by the ALM exhibit abnormal and highly variable morphologies compared to normal limbs, making it difficult to apply consistent morphological landmarks as used in the original double-half limb studies. For this reason, we focused our analysis on “morphological symmetry” as a quantitative measure of DV axis patterning, and we have added this explanation to the manuscript (Line 232‒235). Additionally, we provided transverse sections along the proximodistal axis as supplemental figures (Figs. S2 and S4). In addition to reporting the symmetry score, we have explicitly stated in the text that symmetry was also assessed by visual inspection of these sections.

      (16) Line 245-247: The experiment was done using bulk sequencing, so both the epithelium and mesenchyme were included in the sample. The posterior (Shh) and anterior (Fgf8) patterning cues are mesenchymally expressed. In amniotes, the dorsal cue has been thought to be Wnt7a from the epithelium. Can ISH, FISH, or previous scRNAseq data be used to identify genes expressed in the mesenchyme versus epithelium? This is very important if the authors want to make the claim for defining "The molecular basis of the dorsal and ventral positional cues" as was stated by the authors.

      We thank the reviewer for highlighting this important point. As the reviewer notes, our bulk RNA-seq data do not distinguish between epithelial and mesenchymal expression domains. As noted in our response to the editor’s comment, we performed ISH and qPCR on regular blastemas. However, these approaches did not provide definitive information regarding the specific cell types expressing Wnt10b and Fgf2. To complement this, we re-analyzed publicly available single-cell RNA-seq data (from Li et al., 2021). As a results, Fgf2 was expressed mainly by the mesenchymal cells, and Wnt10b expression was observed in both mesenchymal and epithelial cells. These results are now included in the revised manuscript (Line 294‒321) and in supplemental figures (Fig. S6, S7).

      (17) Was engrailed 1, lmx1b, or Wnt7a differentially expressed along the DV axis, suggesting similar signaling between? Are these expressed in mesenchyme? Previous work suggests Wnt7a is expressed throughout the mesenchyme, but publicly available scRNAseq suggests that it is expressed in the epithelium.

      We thank the reviewer for this important comment. As noted, the reported expression patterns of DV-related genes are not consistent across studies, which likely reflects the technical difficulty of detecting these genes with high sensitivity. In our own experiments, expression of DV markers other than Lmx1b has been very weak or unclear by ISH. Whether these genes are expressed in the epithelium or mesenchyme also appears to vary depending on the detection method used. In our RNA-seq dataset, Wnt7a expression was detected at very low levels and showed no significant difference along the DV axis, while En1 expression was nearly absent. We have clarified these results in the revised manuscript (Line 437‒441). Our reanalysis of the published scRNA-seq likewise detected Wnt7a in only a very small fraction of cells. Accordingly, we consider it premature to reach a definitive conclusion—such as whether Wnt7a is broadly mesenchymal or restricted to epithelium—as suggested in prior reports. We also note that whether Wnt7a is epithelial or mesenchymal does not affect the conclusions or arguments of the present study. Although the roles of Wnt7a and En1 in axolotl DV patterning are certainly important, we feel that drawing a definitive conclusion on this issue lies beyond the scope of the present study, and we have therefore limited our description to a straightforward presentation of the data.

      (18) Line 247-249: The sentence suggests that all the ligands were tried. This should be included in the supplemental data.

      We thank the reviewer for this clarification. In fact, we tested only Wnt4, Wnt10b, Fgf2, Fgf7, and Tgfb2, and all of these results are presented in the figures. To avoid misunderstanding, we have revised the text to explicitly state that our analysis focused on these five genes (Line 272‒274).

      (19) Line 249: An n =3 seems low and qPCR would be a more sensitive means of measuring gene induction compared to ISH. The ISH would confirm the qPCR results. Figure 5C is also not the most convincing image of Shh induction without support from a secondary method.

      We have increased the sample size for these experiments (Line 277‒280). In addition, to complement the ISH results, we confirmed Shh induction by qPCR following electroporation of Wnt10b and Fgf2 (Fig. 5D, E). In addition, because Shh signal in the Wnt10b-electroporated VentBL images was particularly weak and difficult to discern, we replaced that panel with a representative example in which Shh signal is more clearly visible. These data are now included in the revised manuscript (Line 280‒282).

      (20) Line 253: It is confusing why Wnt10b, but not Wnt4 would work? As far as I know, both are canonical Wnt ligands. Was Wnt7a identified as expressed in the RNAseq, but not dorsally localized? Would electroporation of Wnt7a do the same thing as Wnt10b and hence have the same dorsalizing patterning mechanisms as amniotes?

      We thank the reviewer for raising this challenging but important question. Wnt10b was identified directly from our bulk RNA-seq analysis, as was Wnt4. The difference in the ability of Wnt10b and Wnt4 to induce Shh expression in VentBL may reflect differences in how these ligands activate downstream WNT signaling programs. WNT10B is a potent activator of the canonical WNT/β-catenin pathway (Bennett et al., 2005), although WNT10B has also been reported to trigger a β-catenin–independent pathway (Lin et al., 2021). By contrast, WNT4 can signal through both canonical and non-canonical (β-catenin–independent) pathways, and the balance between these outputs is known to depend on cellular context (Li et al., 2013; Li et al., 2019). Consistent with a requirement for canonical WNT signaling, we found that pharmacological activation of canonical WNT signaling with BIO (a GSK3 inhibitor) was also sufficient to induce Shh expression in VentBL. However, despite this, it is still unclear why Wnt10b, but not Wnt4, was able to induce Shh under our experimental conditions. One possible explanation is that different WNT ligands can engage the same receptors (e.g., Frizzled/LRP6) yet can drive distinct downstream transcriptional programs (This may depend on the state of the responding cells, as Voss et al. predicted), resulting in ligand-specific outputs (Voss et al., 2025). This point is now included in the revised discussion section (Line 402‒412). At present, we cannot distinguish between these possibilities experimentally, and we therefore refrain from making a stronger mechanistic claim.

      With respect to Wnt7a, we detected Wnt7a expression at very low levels, and without a clear dorsoventral bias, in our RNA-seq analysis of ALM blastemas (we describe this point in Line 437‒440). This is consistent with previous work suggesting that axolotl Wnt7a is not restricted to the dorsal region in regeneration. Because of this low and unbiased expression, and because our data already implicated Wnt10b as a dorsal-mediated signal that can act across dorsoventral domains to permit Shh induction, we did not prioritize Wnt7a electroporation in the present study. We therefore cannot conclude whether Wnt7a would behave similarly to Wnt10b in this context.

      Importantly, these uncertainties about ligand-specific mechanisms do not alter our main conclusion. Our data support the idea that a dorsal-mediated WNT signal (represented here by WNT10B and canonical WNT activation) and a ventral-mediated FGF signal (FGF2) must act together to permit Shh induction, and that the coexistence of these dorsal- and ventral-mediated signals is required for patterned limb formation in axolotl limb regeneration.

      (21) Is canonical Wnt signaling induced after electroporation of Wnt10b or Wnt4? qPCR of Lef1 and axin is the most common way of showing this.

      We thank the reviewer for this helpful suggestion. In addition to examining Shh expression, we also assessed canonical WNT signaling by qPCR analysis of Axin2 and Lef1 following Wnt10b electroporation. The data is now included in Fig. 5.

      (22) Line 255-256: qPCR was presented for Figure 5D, but ISH was used for everything else. Is there a technical reason that just qPCR was used for the bead experiments?

      We thank the reviewer for this helpful comment. In the original submission, our goal was to test whether treatment with commercial FGF2 protein or BIO could reproduce the results obtained by electroporation. In the revised manuscript, to avoid confusion between distinct experimental aims, we removed the FGF2–bead data from this section and instead used RT-qPCR to quantitatively corroborate Shh induction after electroporation (Fig. 5D–E). RT-qPCR provided a sensitive, whole-blastema readout and allowed a paired design (left limb: factor; right limb: GFP control) that increased statistical power while minimizing animal use. To address the reviewer’s point more directly, we additionally performed ISH for the BIO treatment and now include those results in Supplementary Figure 3 (Line 287‒288).

      (23) Line 261-263: The authors did not show where Wnt10B or Fgf2 is expressed in the limb as claimed. The RNAseq was bulk, so ISH of these genes is needed to make this claim. Where are Wnt10b and Fgf2 expressed in the amputated limb? Do they show a dorsal (Wnt10b) and ventral (Fgf2) expression pattern?

      We thank the reviewer for raising this important point. As noted in our response to the editor’s comment, we performed ISH on serial sections of regular blastemas at several time points (Fig. S5A). However, the expression patterns of Wnt10b and Fgf2 along the dorsoventral axis were not clear. To complement the ISH results, we performed RT-qPCR on microdissected dorsal and ventral halves of regular blastemas at the MB stage (Fig. S5B). We found that Wnt10b and Fgf2 were expressed at significantly higher levels in the dorsal and ventral halves, respectively, compared to the opposite half. This dorsal/ventral biased expression of Wnt10b/Fgf2 is consistent with our RNA-seq data. To identify the cell types expressing Wnt10b or Fgf2, we analyzed published single-cell RNA-seq data (7 dpa blastema (MB), Li et al., 2021). As a result, Fgf2 expression was observed in the mesenchymal cluster, whereas Wnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. The apparent low abundance likely contributes to the weak ISH signals and reflects current technical limitations. In addition, Wnt10b and Fgf2 expression did not follow Lmx1b expression (Fig. S6J, K), and Wnt10b and Fgf2 themselves were not exclusive (Fig. S6L). Together with the RT-qPCR data (Fig. S5B), these results suggest that Wnt10b and Fgf2 are not exclusively confined to purely dorsal or ventral cells at the single-cell level, even though they show dorsoventral bias when assessed in bulk tissue, suggesting that Wnt10b/Fgf2 expression is not dorsal-/ventral-specific but mediated by dorsal/ventral cells. Defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will therefore be an important goal for future work. These points are now included in the revised manuscript (Line 485‒501).

      (24) Line 266-288: The formation of multiple limbs is impressive. Do these new limbs correspond to the PD location they are generated?

      We thank the reviewer for this interesting question. Interestingly, from our observations, there does appear to be a tendency for the induced limbs to vary in length depending on their PD location. The skeletal patterns of the induced multiple limbs are now included in Fig. 6. However, as noted earlier, the supernumerary limbs exhibit highly variable morphologies, and a rigorous analysis of PD correlation would require a large number of induced limbs. Since this lies outside the main focus of the present study, we have not pursued this point further in the manuscript.

      (25) Line 288: The minimal requirement for claiming the molecular basis for DV signaling was identified is to ISH or multiplexed FISH for Wnt10b and Fgf2 in amputated limb blastemas to show they are expressed in the mesenchyme or epithelium and are dorsally and ventrally expressed, respectively. In addition, the current understanding of DV patterning through Wnt7a, Lmx1b, and En1 shown not to be important in this model.

      We thank the reviewer for this comment and fully agree with the point raised. We would like to clarify that we are not claiming to have identified the molecular basis of DV patterning. As the reviewer notes, molecules such as Lmx1b, Wnt7a, and En1 are well identified in other animal models as key regulators of DV positional identity. There is no doubt that these molecules play central roles in DV patterning. However, in axolotl limb regeneration, clear DV-specific expression has not been demonstrated for these genes except for Lmx1b. Therefore, further studies will be required to elucidate the molecular basis of DV patterning in axolotls.

      Our focus here is more limited: we aim to identify the molecular basis for the mechanisms in which positional domain-mediated signals (FGF8, SHH, WNT10B, and FGF2) regulate the limb patterning process, rather than the molecular basis of DV patterning. In fact, our results on Wnt10b and Fgf2 suggest that these genes did not affect dorsoventral identities.

      We recognize that this distinction was not sufficiently clear in the original text, and we have revised the manuscript to describe DV patterning mechanisms in other animals and clarify that the dorsal- and ventral-mediated signals are distinct from DV patterning (Line 444‒450). At least, we avoid claiming that the molecular basis for DV signaling was identified.

      (26) Line 335: References are needed for this statement. From what I found, Wnt4 can be canonical or non-canonical.

      We thank the reviewer for this helpful comment. We have revised the manuscript (Line 404‒407). We added these citations at the relevant location and adjusted nearby wording to avoid implying pathway exclusivity, in alignment with our response to comment (20).

      (27) Line 337-338: The authors cannot claim "that canonical, but not non-canonical, WNT signaling contributes to Shh induction" as this was not thoroughly tested is based upon the negative result that Wnt4 electroporation did not induce Shh expression.

      We thank the reviewer for this important clarification. We agree that our data do not allow us to conclude that non-canonical WNT signaling in general does not contribute to Shh induction. Accordingly, we have removed the phrase “but not non-canonical” and revised the text to emphasize that, within the scope of our experiments, Shh induction was not observed following Wnt4 electroporation, whereas it was observed with Wnt10b.

      (28) Line 345: In order to claim "WNT10B via the canonical WNT pathway...appears to regulate Shh expression" needs at least qPCR to show WNT10B induces canonical signaling.

      We thank the reviewer for this comment. As noted in our response to comment (21), we also assessed canonical WNT signaling by qPCR analysis of Axin2 and Lef1 following Wnt10b electroporation (Line 282‒285).

      (29) Lines 361-372: A few studies have been performed on DV patterning of the mouse digit regeneration in regards to Lmx1b and En1. It may be good to discuss how the current study aligns with these findings.

      We appreciate the reviewer’s suggestion. As the reviewer refers, several studies have been performed on dorsoventral (DV) patterning in mouse digit tip regeneration in relation to Lmx1b and En1 (e.g., Johnson et al., 2022; Castilla-Ibeas et al., 2023). In the present study, however, our main conclusion is different in the scope of studies on mouse digit tip regeneration. We show that, in the axolotl, pre-existing dorsal and ventral identities (as reflected by dorsally derived and ventrally derived cells in the ALM blastema) are required together to induce Shh expression, and that this Shh induction in turn supports anteroposterior interaction at the limb level. This mechanism—dorsal-mediated and ventral-mediated signals acting in combination to permit Shh expression—does not have a clear direct counterpart in the mouse digit tip literature. Moreover, even with respect to Lmx1b, the two systems behave differently. In mouse digit tip regeneration, loss of Lmx1b during regeneration does not grossly affect DV morphology of the regenerate (Johnson et al., 2022). By contrast, in our axolotl ALM system, the presence or absence of Lmx1b-positive dorsal tissue correlates with the final dorsoventral organization of the induced limb-like structures (e.g., production of double-dorsal or double-ventral symmetric structures in the absence of appropriate dorsoventral contact). Thus, the role of dorsoventral identity in our model is directly tied to patterned limb outgrowth at the whole-limb scale, whereas in the mouse digit tip it has been reported primarily in the context of digit tip regrowth and bone regeneration competence, not robust DV repatterning (Johnson et al., 2022).

      For these reasons, we believe that an extended discussion of mouse digit tip regeneration would risk implying a mechanistic equivalence between axolotl limb regeneration and mouse digit tip regeneration that is not supported by current data. Because the regenerative contexts differ, and because Lmx1b does not appear to re-establish DV patterning in the mouse regenerates (Johnson et al., 2022), we have chosen not to include an explicit discussion of mouse digit tip regeneration in the main text.

      (30) Line 408-433: Although I appreciate generating a model, this section takes some liberties to tell a narrative that is not entirely supported by previous literature or this study. For example, lines 415-416 state "Wnt10b and Fgf2 are expressed at higher levels in dorsal and the ventral blastemal cells, respectively" which were not shown in the study or other studies.

      We thank the reviewer for this important comment. We agree that the original model based on RNA-seq data overstated the evidence. To address this point experimentally, we examined Wnt10b and Fgf2 expression in regular blastemas (Supplemental Figure 5 and 6). Accordingly, our model is now framed as an inductive mechanism for Shh expression—supported by results in ALM (WNT10B in VentBL; FGF2 in DorBL) and by DV-biased expression. Concretely, the sentence previously paraphrased as “Wnt10b and Fgf2 are expressed at higher levels in dorsal and ventral blastemal cells, respectively” has been replaced with wording that (i) avoids single-cell DV specificity and (ii) emphasizes dorsal-/ventral-mediated regulation and the requirement for both signals to allow Shh induction (Line 510‒511).

      Reviewer #2 (Recommendations for the authors):

      (1) Introduction:

      The authors' definitions of positional cues vs positional information are a little hard to follow, and do not appear to be completely accurate. From my understanding of what the authors explain, "positional information" is defined as a signal that generates positional identities in the regenerating tissue. This is a somewhat different definition than what I previously understood, which is the intrinsic (likely epigenetic) cellular identity associated with specific positional coordinates. On the other hand, the authors define "positional cues" as signals that help organize the cells according to the different axes, but don't actually generate positional identities in the regenerating cells. The authors provide two examples: Wnt7a as an example of positional information, and FGF8 as a positional cue. I think that coording to the authors definitions, FGF8 (and probobly Shh) are bone fide positional cues, since both signals work together to organize the regenerating limb cells - yet do not generate positional identities, because ectopic limbs formed from blastemas where these pathways have been activated do not regenerate (Nacu et al 2016). However, I am not sure Wnt7a constitutes an example of a "positional information" signal, since as far as I know, it has not been shown to generate stable dorsal limb identities (that remain after the signal has stopped) - at least yet. If it has, the authors should cite the paper that showed this. I think that some sort of diagram to help define these visually will be really helpful, especially to people who do not study regenerative patterning.

      We thank the reviewer for this thoughtful comment. We now agree with the reviewer that our use of “positional cue” and “positional information” may have been confusing. In the revision—and as noted in our response to the Editor’s comment (4)—we have removed the term “positional cue” and no longer attempt to contrast it with “positional information.” Instead, we adopt phrasing that reflects our data and hypothesis: during limb patterning, dorsal-mediated signals act on ventral cells and ventral-mediated signals act on dorsal cells to induce Shh expression. This wording avoids implying that these signals specify dorsoventral identity.

      Regarding WNT7A, we agree it has not been shown to generate a stable dorsal identity after signal withdrawal. In the revised Introduction we therefore describe WNT7A in amniote limb development as an extracellular regulator that induces Lmx1b in dorsal mesenchyme (with En1 repressing Wnt7a ventrally), rather than labeling it as “positional information” in a strict, identity-imprinting sense. We highlight this contrast because, in our axolotl experiments, WNT10B and FGF2 did not alter Lmx1b expression or dorsal–ventral limb characteristics when overexpressed, consistent with the idea that they act downstream of DV identity to enable Shh induction, not to establish DV identity.

      (2) Results:

      It would be helpful if the number of replicates per sample group were reported in the figure legends.

      We thank the reviewer for this suggestion. In accordance with the comment, we have added the number of replicates (n) for each sample group in the figure legends.

      Figure 2 shows ISH for A/P and D/V transcripts in different-positioned blastemas without tissue grafts. The images show interesting patterns, including the lack of Shh expression in all blastemas except in posterior-located blastemas, and localization of the dorsal transcript (Lmx1b) to the dorsal half of A or P located blastemas. My only concern about this data is that the expression patterns are described in only a small part of the ectopic blastema (how representative is it?) and the diagrams infer that these expression patterns are reflective of the entire blastema, which can't be determined by the limited field of view. It is okay if the expression patterns are not present in the entire blastema -in fact, that might be an important observation in terms of who is generating (and might be receiving) these signals.

      We thank the reviewer for this insightful comment. Because Fgf8 and Shh expression was detectable only in a limited subset of cells, the original submission included only high-magnification images. In response to the reviewer’s valid concern about representativeness, we have now added low-magnification overviews of the entire blastema as a supplemental figure (Fig. S1) and clarified in the figure legend that these expression patterns can be focal rather than pan-blastemal (Line 795‒796).

      In Figure 3, they look at all of these expression patterns in the grafted blastemas, showing that Shh expression is only visible when both D and V cells are present in the blastema. My only concern about this data is that the number of replicates is very low (some groups having only an N=3), and it is unclear how many sections the authors visualized for each replicate. This is especially important for the sample groups where they report no Shh expression -I agree that it is not observable in the single example sections they provide, but it is uncertain what is happening in other regions of the blastema.

      We thank the reviewer for this important comment. To increase the reliability of the results, we have increased the number of biological replicates in groups where n was previously low. For all samples, we collected serial sections spanning the entire blastema. For blastemas in which Shh expression was observed, we present representative sections showing the signal. For blastemas without detectable Shh expression, we selected a section from the central region that contains GFP-positive cells for the Figure. To make these points explicit, we have added the following clarification to the Fig. 3 legend (Line 811‒815).

      Figure 4: Shh overexpression in A/P/D/V blastemas - expression induces ectopic limbs in A/D/V locations. They analyzed the symmetry of these regenerates (assuming that Do and V located blastemas will exhibit D/V symmetry because they only contain cells from one side of that axis. I am a little concerned about how the symmetry assay is performed, since oblique sections through the digits could look asymmetric, while they are actually symmetric. It is also unclear how the angle of the boxes that the symmetry scores were based on was decided - I imagine that the score would change depending on the angle. It also appears that the authors picked different digits to perform this analysis on the different sample groups. I also admit that the logic of classification scheme that the authors used AI to perform their symmetry scoring analysis (both in Figures 4 and 5) is elusive to me. I think it would have been more informative if the authors leveraged the structural landmarks, like the localization of specific muscle groups. (If this experiment were performed in WT animals, the authors could have used pigment cell localization)... or generate more proximal sections to look at landmarks in the zeugopod.

      We thank the reviewer for these detailed comments regarding the symmetry analysis. Because reliance on a computed symmetry score alone could raise the concerns noted by the reviewer, we now provide transverse sections along the proximodistal axis as supplemental figures (Figs. S2 and S4). These include levels corresponding to the distal end of the zeugopod and the proximal end of the autopod. In addition to reporting the symmetry score, we have explicitly stated in the text that symmetry was also assessed by visual inspection of these sections.

      As also noted in our response to Reviewer #1 (comment 15), ALM-induced limbs frequently exhibit abnormal and highly variable morphologies, which makes it difficult to use consistent anatomical landmarks such as particular digits or muscle groups. For this reason, we focused our analysis on morphological symmetry rather than landmark-based metrics, and we emphasize this rationale in the revised text (Line 232‒235).

      Regarding the use of bounding boxes, this procedure was chosen to minimize the effects of curvature or fixation-induced distortion. For each section, the box angle was adjusted so that the outer contour (epidermal surface) was aligned symmetrically; this procedure was applied uniformly across all conditions to avoid bias. We analyzed multiple biological replicates in each group, which helps mitigate potential artifacts due to oblique sectioning. To further reduce bias, we increased the number of fields included in the analysis to n = 24 per group in the revised version.

      In addition, staining intensity varied among samples, such that a region identified as “muscle” in one sample could be assigned differently in another if classification were based solely on color. To avoid this problem, we used a machine-learning classifier trained separately for each sample, allowing us to group the same tissues consistently within that sample irrespective of intensity differences. In the context of ALM-induced limbs, where stable anatomical landmarks are not available, we consider this strategy the most appropriate. We have added this rationale to the revised manuscript for clarity (Line 239‒247).

      Figure 5: The number of replicates in sample groups is relatively low and is quite variable between groups (ranging between 3 and 7 replicates). Zoom in to visualize Shh expression is small relative to the blastema, and it is difficult to discern why the authors positioned the window where they did, and how they maintained consistency among their different sample groups. In the examples of positive Shh expression - the signal is low and hard to see. Validating these expression patterns using some sort of quantitative transcriptional assay (like qRTPCR) would increase the rigor of this experiment ... especially given that they will be able to analyze gene expression in the entire blastema as opposed to sections that might not capture localized expression.

      We thank the reviewer for this important comment. To increase the rigor of these experiments, we have increased the number of biological replicates in groups where n was previously low. In addition, because Shh signal in the Wnt10b-electroporated VentBL images was particularly weak and difficult to discern, we replaced that panel with a representative example in which Shh signal is more clearly visible. We also validated the Shh expression for Wnt10b–electroporated VentBL and Fgf2–electroporated DorBL by RT-qPCR, which assesses gene expression across the entire blastema. These results are now included in Fig. 5 and Line 280‒282. Finally, we clarified in the figure legend how the “window” for imaging was chosen: for samples with detectable Shh expression, the window was placed in the region where the signal was observed; for conditions without detectable Shh expression, the window was positioned in a comparable region containing GFP-positive cells (Line 836‒839). These revisions are included in the revised manuscript.

      Figure 6: They treat dorsal and ventral wounds with gelatin beads soaked in a combination of BMP2+FGF8 (nerve factors) and FGF2 proposed ventral factor). Remarkably, they observe ectopic limb expression in only dorsal wounds, further supporting the idea that FGF2 provides the "ventral" signal. They show examples of this impressive phenotype on limbs with multiple ectopic structures that formed along the Pr/Di axis. Including images of tubulin staining (as they have in Figures 1 and 2) to ensure that the blastemas (or final regenerates) are devoid of nerves. The authors' whole-mount skeletal staining which shows fusion of the ectopic humerus with the host humerus, is a phenotype associated with deep wounding, which could provide an opportunity for more cellular contribution from different limb axes.

      We thank the reviewer for these constructive comments. As noted in the prior study, when beads are used to induce blastemas without surgical nerve orientation, fine nerve ingrowth can still occur (Makanae et al., 2014), and the induced blastemas are not completely devoid of nerves. While it is still uncertain whether these recruited nerves are functional after blastema induction, it is an important point, and we added sentences about this in the revised manuscript (Line 341‒345).

      Regarding the skeletal phenotype, despite careful implantation to avoid injuring deep tissues, bead-induced ectopic limbs on the dorsal side occasionally displayed fusion of the stylopod with the host humerus—a phenotype associated with deep wounding, as the reviewer notes. This observation suggests that contributions from a broader cellular population cannot be excluded. However, because fusion was observed in only 1 of 16 induced limbs analyzed, and because ectopic limbs induced at the forearm (zeugopod) level did not exhibit such fusion (n=1/6 for stylopod-level inductions; n=0/10 for zeugopod-level inductions), we believe that our main conclusion remains valid. Because fusion is not a typical outcome, we now present representative non-fusion cases—including zeugopod-origin examples—in the figure (Fig. 6L1, L2), and we report the fusion incidence explicitly in the text (Line 350‒354). We also note in the revised manuscript that stylopod fusion can occur in a minority of cases (Line 347‒349).

      Figure 7 nicely summarizes their findings and model for patterning.

      We thank the reviewer for this positive comment.

      The table is cut off in the PDF, so it cannot be evaluated at this time.

      In our copy of the PDF, the table appears in full, so this may have been a formatting issue. We have carefully checked the file and ensured that the table is completely included in the revised submission.

      There is a supplemental figure that doesn't seem to be referenced in the text.

      The supplemental figure (Fig. S1 of the original manuscript) is referenced in the text, but it may have been overlooked. To improve clarity, we have expanded the description in the manuscript so that the supplemental figure is more clearly referenced (Line 285‒291).

      (3) Materials and Methods:

      No power analysis was performed to calculate sample group sizes. The authors have used these experimental techniques in the past and could have easily used past data to inform these calculations.

      We thank the reviewer for this important comment. We did not include a power analysis in the manuscript because this was the first time we compared Shh and other gene expression levels among ALM blastemas of different positional origins using RT-qPCR in our experimental system. As we did not have prior knowledge of the expected variability under these specific conditions, it was difficult to predetermine appropriate sample sizes.

      Reviewer #3 (Recommendations for the authors):

      General:

      Congratulations - I found this an elegant and easy-to-read study with significant implications for the field! If possible, I would urge you to consider adding some more characterisation of Wnt10b and Fgf2- which cell types are they expressed in? If you can link your mechanisms to normal limb regeneration too (i.e., regenerating blastema, not ALM), this would significantly elevate the interest in your study.

      We sincerely thank the reviewer for these encouraging comments. As also noted in our response to the editor’s comment, we have analyzed the expression patterns of Wnt10b and Fgf2 in regular blastemas (Line 294‒306). Although clear specific expression patterns along dorsoventral axis were not detected by ISH, likely due to technical limitations of sensitivity, RT-qPCR revealed significantly higher expression levels of Wnt10b in the dorsal half and Fgf2 in the ventral half of a regular blastema (Fig. S5). In addition, we analyzed published single-cell RNA-seq data (7 dpa blastema, Li et al., 2021) (Line 307‒321). As a result, Fgf2 expression was observed in the mesenchymal clusters, whereasWnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. Therefore, defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will be an important goal for future work.

      Data availability:

      I assume that the RNA-sequencing data will be deposited at a public repository.

      RNA-seq FASTQ files have been deposited in the DNA Data Bank of Japan (DDBJ; https://www.ddbj.nig.ac.jp/) under BioProject accession PRJDB38065. We have added a Data availability section to the revised manuscript.

      References

      Castilla-Ibeas, A., Zdral, S., Oberg, K. C., & Ros, M. A. (2024). The limb dorsoventral axis: Lmx1b’s role in development, pathology, evolution, and regeneration. Developmental Dynamics, 253(9), 798–814. https://doi.org/10.1002/dvdy.695

      Johnson, G. L., Glasser, M. B., Charles, J. F., Duryea, J., & Lehoczky, J. A. (2022). En1 and Lmx1b do not recapitulate embryonic dorsal-ventral limb patterning functions during mouse digit tip regeneration. Cell Reports, 41(8), 111701. https://doi.org/10.1016/j.celrep.2022.111701

      Stocum, D. (2017). Mechanisms of urodele limb regeneration. Regeneration, 4. https://doi.org/10.1002/reg2.92

      Tank, P. W., & Holder, N. (1978). The effect of healing time on the proximodistal organization of double-half forelimb regenerates in the axolotl, Ambystoma mexicanum. Developmental Biology, 66(1), 72–85. https://doi.org/10.1016/0012-1606(78)90274-9

    1. eLife Assessment

      This study examines an important question regarding the developmental trajectory of neural mechanisms supporting facial expression processing. Leveraging a rare intracranial EEG (iEEG) dataset including both children and adults, the authors reported that facial expression recognition mainly engaged the posterior superior temporal cortex (pSTC) among children, while both pSTC and the prefrontal cortex were engaged among adults. In terms of strength of evidence, the solid methods, data and analyses broadly support the claims with minor weaknesses.

    2. Reviewer #1 (Public review):

      Summary:

      This study investigates how the brain processes facial expressions across development by analyzing intracranial EEG (iEEG) data from children (ages 5-10) and post-childhood individuals (ages 13-55). The researchers used a short film containing emotional facial expressions and applied AI-based models to decode brain responses to facial emotions. They found that in children, facial emotion information is represented primarily in the posterior superior temporal cortex (pSTC)-a sensory processing area-but not in the dorsolateral prefrontal cortex (DLPFC), which is involved in higher-level social cognition. In contrast, post-childhood individuals showed emotion encoding in both regions. Importantly, the complexity of emotions encoded in the pSTC increased with age, particularly for socially nuanced emotions like embarrassment, guilt, and pride.The authors claim that these findings suggest that emotion recognition matures through increasing involvement of the prefrontal cortex, supporting a developmental trajectory where top-down modulation enhances understanding of complex emotions as children grow older.

      Strengths:

      (1) The inclusion of pediatric iEEG makes this study uniquely positioned to offer high-resolution temporal and spatial insights into neural development compared to non-invasive approaches, e.g., fMRI, scalp EEG, etc.

      (2) Using a naturalistic film paradigm enhances ecological validity compared to static image tasks often used in emotion studies.

      (3) The idea of using state-of-the-art AI models to extract facial emotion features allows for high-dimensional and dynamic emotion labeling in real time.

      Weaknesses:

      (1) The study has notable limitations that constrain the generalizability and depth of its conclusions. The sample size was very small, with only nine children included and just two having sufficient electrode coverage in the posterior superior temporal cortex (pSTC), which weakens the reliability and statistical power of the findings, especially for analyses involving age. Authors pointed out that a similar sample size has been used in previous iEEG studies, but the cited works focus on adults and do not look at the developmental perspectives. Similar work looking at developmental changes in iEEG signals usually includes many more subjects (e.g., n = 101 children from Cross ZR et al., Nature Human Behavior, 2025) to account for inter-subject variabilities.

      (2) Electrode coverage was also uneven across brain regions, with not all participants having electrodes in both the dorsolateral prefrontal cortex (DLPFC) and pSTC, making the conclusion regarding the different developmental changes between DLPFC and pSTC hard to interpret (related to point 3 below). It is understood that it is rare to have such iEEG data collected in this age group, and the electrode location is only determined by clinical needs. However, the scientific rigor should not be compromised by the limited data access. It's the authors' decision whether such an approach is valid and appropriate to address the scientific questions, here the developmental changes in the brain, given all the advantages and constraints of the data modality.

      (3) The developmental differences observed were based on cross-sectional comparisons rather than longitudinal data, reducing the ability to draw causal conclusions about developmental trajectories. Also, see comments in point 2.

      (4) Moreover, the analysis focused narrowly on DLPFC, neglecting other relevant prefrontal areas such as the orbitofrontal cortex (OFC) and anterior cingulate cortex (ACC), which play key roles in emotion and social processing. Agree that this might be beyond the scope of this paper, but a discussion section might be insightful.

      (5) Although the use of a naturalistic film stimulus enhances ecological validity, it comes at the cost of experimental control, with no behavioral confirmation of the emotions perceived by participants and uncertain model validity for complex emotional expressions in children. A non-facial music block that could have served as a control was available but not analyzed. The validation of AI model's emotional output needs to be tested. It is understood that we cannot collect these behavioral data retrospectively within the recorded subjects. Maybe potential post-hoc experiments and analyses could be done, e.g., collect behavioral, emotional perception data from age-matched healthy subjects.

      (6) Generalizability is further limited by the fact that all participants were neurosurgical patients, potentially with neurological conditions such as epilepsy that may influence brain responses. At least some behavioral measures between the patient population and the healthy groups should be done to ensure the perception of emotions is similar.

      (7) Additionally, the high temporal resolution of intracranial EEG was not fully utilized, as data were downsampled and averaged in 500-ms windows. It seems like the authors are trying to compromise the iEEG data analyses to match up with the AI's output resolution, which is 2Hz. It is not clear then why not directly use fMRI, which is non-invasive and seems to meet the needs here already. The advantages of using iEEG in this study are missing here.

      (8) Finally, the absence of behavioral measures or eye-tracking data makes it difficult to directly link neural activity to emotional understanding or determine which facial features participants attended to. Related to point 5 as well.

      Comments on revisions:

      A behavioral measurement will help address a lot of these questions. If the data continues collecting, additional subjects with iEEG recording and also behavioral measurements would be valuable.

    3. Reviewer #2 (Public review):

      Summary:

      In this paper, Fan et al. aim to characterize how neural representations of facial emotions evolve from childhood to adulthood. Using intracranial EEG recordings from participants aged 5 to 55, the authors assess the encoding of emotional content in high-level cortical regions. They report that while both the posterior superior temporal cortex (pSTC) and dorsolateral prefrontal cortex (DLPFC) are involved in representing facial emotions in older individuals, only the pSTC shows significant encoding in children. Moreover, the encoding of complex emotions in the pSTC appears to strengthen with age. These findings lead the authors to suggest that young children rely more on low-level sensory areas and propose a developmental shift from reliance on lower-level sensory areas in early childhood to increased top-down modulation by the prefrontal cortex as individuals mature.

      Strengths:

      (1) Rare and valuable dataset: The use of intracranial EEG recordings in a developmental sample is highly unusual and provides a unique opportunity to investigate neural dynamics with both high spatial and temporal resolution.

      (2 ) Developmentally relevant design: The broad age range and cross-sectional design are well-suited to explore age-related changes in neural representations.

      (3) Ecological validity: The use of naturalistic stimuli (movie clips) increases the ecological relevance of the findings.

      (4) Feature-based analysis: The authors employ AI-based tools to extract emotion-related features from naturalistic stimuli, which enables a data-driven approach to decoding neural representations of emotional content. This method allows for a more fine-grained analysis of emotion processing beyond traditional categorical labels.

      Weaknesses:

      (1) While the authors leverage Hume AI, a tool pre-trained on a large dataset, its specific performance on the stimuli used in this study remains unverified. To strengthen the foundation of the analysis, it would be important to confirm that Hume AI's emotional classifications align with human perception for these particular videos. A straightforward way to address this would be to recruit human raters to evaluate the emotional content of the stimuli and compare their ratings to the model's outputs.

      (2) Although the study includes data from four children with pSTC coverage-an increase from the initial submission-the sample size remains modest compared to recent iEEG studies in the field.

      (3) The "post-childhood" group (ages 13-55) conflates several distinct neurodevelopmental periods, including adolescence, young adulthood, and middle adulthood. As a finer age stratification is likely not feasible with the current sample size, I would suggest authors temper their developmental conclusions.

      (4) The analysis of DLPFC-pSTC directional connectivity would be significantly strengthened by modeling it as a continuous function of age across all participants, rather than relying on an unbalanced comparison between a single child and a (N=7) post-childhood group. This continuous approach would provide a more powerful and nuanced view of the developmental trajectory. I would also suggest including the result in the main text.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study examines a valuable question regarding the developmental trajectory of neural mechanisms supporting facial expression processing. Leveraging a rare intracranial EEG (iEEG) dataset including both children and adults, the authors reported that facial expression recognition mainly engaged the posterior superior temporal cortex (pSTC) among children, while both pSTC and the prefrontal cortex were engaged among adults. However, the sample size is relatively small, with analyses appearing incomplete to fully support the primary claims. 

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study investigates how the brain processes facial expressions across development by analyzing intracranial EEG (iEEG) data from children (ages 5-10) and post-childhood individuals (ages 13-55). The researchers used a short film containing emotional facial expressions and applied AI-based models to decode brain responses to facial emotions. They found that in children, facial emotion information is represented primarily in the posterior superior temporal cortex (pSTC) - a sensory processing area - but not in the dorsolateral prefrontal cortex (DLPFC), which is involved in higher-level social cognition. In contrast, post-childhood individuals showed emotion encoding in both regions. Importantly, the complexity of emotions encoded in the pSTC increased with age, particularly for socially nuanced emotions like embarrassment, guilt, and pride. The authors claim that these findings suggest that emotion recognition matures through increasing involvement of the prefrontal cortex, supporting a developmental trajectory where top-down modulation enhances understanding of complex emotions as children grow older.

      Strengths:

      (1) The inclusion of pediatric iEEG makes this study uniquely positioned to offer high-resolution temporal and spatial insights into neural development compared to non-invasive approaches, e.g., fMRI, scalp EEG, etc.

      (2) Using a naturalistic film paradigm enhances ecological validity compared to static image tasks often used in emotion studies.

      (3) The idea of using state-of-the-art AI models to extract facial emotion features allows for high-dimensional and dynamic emotion labeling in real time

      Weaknesses:

      (1) The study has notable limitations that constrain the generalizability and depth of its conclusions. The sample size was very small, with only nine children included and just two having sufficient electrode coverage in the posterior superior temporal cortex (pSTC), which weakens the reliability and statistical power of the findings, especially for analyses involving age

      We appreciated the reviewer’s point regarding the constrained sample size.

      As an invasive method, iEEG recordings can only be obtained from patients undergoing electrode implantation for clinical purposes. Thus, iEEG data from young children are extremely rare,  and rapidly increasing the sample size within a few years is not feasible. However, we are confident in the reliability of our main conclusions. Specifically, 8 children (53 recording contacts in total) and 13 control participants (99 recording contacts in total) with electrode coverage in the DLPFC are included in our DLPFC analysis. This sample size is comparable to other iEEG studies with similar experiment designs [1-3]. 

      For pSTC, we returned to the data set and found another two children who had pSTC coverage. After involving these children’s data, the group-level analysis using permutation test showed that children’s pSTC significantly encode facial emotion in naturalistic contexts (Figure 3B). Notably, the two new children’s (S33 and S49) responses were highly consistent with our previous observations. Moreover, the averaged prediction accuracy in children’s pSTC (r<sub>speech</sub>=0.1565) was highly comparable to that in post-childhood group (r<sub>speech</sub>=0.1515).

      (1) Zheng, J. et al. Multiplexing of Theta and Alpha Rhythms in the Amygdala-Hippocampal Circuit Supports Pafern Separation of Emotional Information. Neuron 102, 887-898.e5 (2019).

      (2) Diamond, J. M. et al. Focal seizures induce spatiotemporally organized spiking activity in the human cortex. Nat. Commun. 15, 7075 (2024).

      (3) Schrouff, J. et al. Fast temporal dynamics and causal relevance of face processing in the human temporal cortex. Nat. Commun. 11, 656 (2020).

      (2) Electrode coverage was also uneven across brain regions, with not all participants having electrodes in both the dorsolateral prefrontal cortex (DLPFC) and pSTC, and most coverage limited to the left hemisphere-hindering within-subject comparisons and limiting insights into lateralization.

      The electrode coverage in each patient is determined entirely by the clinical needs. Only a few patients have electrodes in both DLPFC and pSTC because these two regions are far apart, so it’s rare for a single patient’s suspected seizure network to span such a large territory. However, it does not affect our results, as most iEEG studies combine data from multiple patients to achieve sufficient electrode coverage in each target brain area. As our data are mainly from left hemisphere (due to the clinical needs), this study was not designed to examine whether there is a difference between hemispheres in emotion encoding. Nevertheless, lateralization remains an interesting question that should be addressed in future research, and we have noted this limitation in the Discussion (Page 8, in the last paragraph of the Discussion).

      (3) The developmental differences observed were based on cross-sectional comparisons rather than longitudinal data, reducing the ability to draw causal conclusions about developmental trajectories.  

      In the context of pediatric intracranial EEG, longitudinal data collection is not feasible due to the invasive nature of electrode implantation. We have added this point to the Discussion to acknowledge that while our results reveal robust age-related differences in the cortical encoding of facial emotions, longitudinal studies using non-invasive methods will be essential to directly track developmental trajectories (Page 8, in the last paragraph of Discussion). In addition, we revised our manuscript to avoid emphasis causal conclusions about developmental trajectories in the current study (For example, we use “imply” instead of “suggest” in the fifth paragraph of Discussion).

      (4) Moreover, the analysis focused narrowly on DLPFC, neglecting other relevant prefrontal areas such as the orbitofrontal cortex (OFC) and anterior cingulate cortex (ACC), which play key roles in emotion and social processing.

      We agree that both OFC and ACC are critically involved in emotion and social processing. However, we have no recordings from these areas because ECoG rarely covers the ACC or OFC due to technical constraints. We have noted this limitation in the Discussion(Page 8, in the last paragraph of Discussion). Future follow-up studies using sEEG or non-invasive imaging methods could be used to examine developmental patterns in these regions.

      (5) Although the use of a naturalistic film stimulus enhances ecological validity, it comes at the cost of experimental control, with no behavioral confirmation of the emotions perceived by participants and uncertain model validity for complex emotional expressions in children. A nonfacial music block that could have served as a control was available but not analyzed. 

      The facial emotion features used in our encoding models were extracted by Hume AI models, which were trained on human intensity ratings of large-scale, experimentally controlled emotional expression data[1-2]. Thus, the outputs of Hume AI model reflect what typical facial expressions convey, that is, the presented facial emotion. Our goal of the present study was to examine how facial emotions presented in the videos are encoded in the human brain at different developmental stages. We agree that children’s interpretation of complex emotions may differ from that of adults, resulting in different perceived emotion (i.e., the emotion that the observer subjectively interprets). Behavioral ratings are necessary to study the encoding of subjectively perceived emotion, which is a very interesting direction but beyond the scope of the present work. We have added a paragraph in the Discussion (see Page 8) to explicitly note that our study focused on the encoding of presented emotion.

      We appreciated the reviewer’s point regarding the value of non-facial music blocks. However,  although there are segments in music condition that have no faces presented, these cannot be used as a control condition to test whether the encoding model’s prediction accuracy in pSTC or DLPFC drops to chance when no facial emotion is present. This is because, in the absence of faces, no extracted emotion features are available to be used for the construction of encoding model (see Author response image 1 below).  Thus, we chose to use a different control analysis for the present work. For children’s pSTC, we shuffled facial emotion feature in time to generate a null distribution, which was then used to test the statistical significance of the encoding models (see Methods/Encoding model fitting for details).

      (1) Brooks, J. A. et al. Deep learning reveals what facial expressions mean to people in different cultures. iScience 27, 109175 (2024).

      (2) Brooks, J. A. et al. Deep learning reveals what vocal bursts express in different cultures. Nat. Hum. Behav. 7, 240–250 (2023).

      Author response image 1.

      Time courses of Hume AI extracted facial expression features for the first block of music condition. Only top 5 facial expressions were shown here to due to space limitation.

      (6) Generalizability is further limited by the fact that all participants were neurosurgical patients, potentially with neurological conditions such as epilepsy that may influence brain responses. 

      We appreciated the reviewer’s point. However, iEEG data can only be obtained from clinical populations (usually epilepsy patients) who have electrodes implantation.  Given current knowledge about focal epilepsy and its potential effects on brain activity, researchers believe that epilepsy-affected brains can serve as a reasonable proxy for normal human brains when confounding influences are minimized through rigorous procedures[1]. In our study, we took several steps to ensure data quality: (1) all data segments containing epileptiform discharges were identified and removed at the very beginning of preprocessing, (2) patients were asked to participate the experiment several hours outside the window of seizures. Please see Method for data quality check description (Page 9/ Experimental procedures and iEEG data processing). 

      (1) Parvizi J, Kastner S. 2018. Promises and limitations of human intracranial electroencephalography. Nat Neurosci 21:474–483. doi:10.1038/s41593-018-0108-2

      (7) Additionally, the high temporal resolution of intracranial EEG was not fully utilized, as data were down-sampled and averaged in 500-ms windows.  

      We agree that one of the major advantages of iEEG is its millisecond-level temporal resolution. In our case, the main reason for down-sampling was that the time series of facial emotion features extracted from the videos had a temporal resolution of 2 Hz, which were used for the modelling neural responses. In naturalistic contexts, facial emotion features do not change on a millisecond timescale, so a 500 ms window is sufficient to capture the relevant dynamics. Another advantage of iEEG is its tolerance to motion, which is excessive in young children (e.g., 5-year-olds). This makes our dataset uniquely valuable, suggesting robust representation in the pSTC but not in the DLPFC in young children. Moreover, since our method framework (Figure 1) does not rely on high temporal resolution method, so it can be transferred to non-invasive modalities such as fMRI, enabling future studies to test these developmental patterns in larger populations.

      (8) Finally, the absence of behavioral measures or eye-tracking data makes it difficult to directly link neural activity to emotional understanding or determine which facial features participants afended to.  

      We appreciated this point. Part of our rationale is presented in our response to (5) for the absence of behavioral measures. Following the same rationale, identifying which facial features participants attended to is not necessary for testing our main hypotheses because our analyses examined responses to the overall emotional content of the faces. However, we agree and recommend future studies use eye-tracking and corresponding behavioral measures in studies of subjective emotional understanding. 

      Reviewer #2 (Public review):

      Summary:

      In this paper, Fan et al. aim to characterize how neural representations of facial emotions evolve from childhood to adulthood. Using intracranial EEG recordings from participants aged 5 to 55, the authors assess the encoding of emotional content in high-level cortical regions. They report that while both the posterior superior temporal cortex (pSTC) and dorsolateral prefrontal cortex (DLPFC) are involved in representing facial emotions in older individuals, only the pSTC shows significant encoding in children. Moreover, the encoding of complex emotions in the pSTC appears to strengthen with age. These findings lead the authors to suggest that young children rely more on low-level sensory areas and propose a developmental shiZ from reliance on lower-level sensory areas in early childhood to increased top-down modulation by the prefrontal cortex as individuals mature.

      Strengths: 

      (1) Rare and valuable dataset: The use of intracranial EEG recordings in a developmental sample is highly unusual and provides a unique opportunity to investigate neural dynamics with both high spatial and temporal resolution. 

      (2) Developmentally relevant design: The broad age range and cross-sectional design are well-suited to explore age-related changes in neural representations. 

      (3) Ecological validity: The use of naturalistic stimuli (movie clips) increases the ecological relevance of the findings. 

      (4) Feature-based analysis: The authors employ AIbased tools to extract emotion-related features from naturalistic stimuli, which enables a data-driven approach to decoding neural representations of emotional content. This method allows for a more fine-grained analysis of emotion processing beyond traditional categorical labels. 

      Weaknesses: 

      (1) The emotional stimuli included facial expressions embedded in speech or music, making it difficult to isolate neural responses to facial emotion per se from those related to speech content or music-induced emotion. 

      We thank the reviewer for their raising this important point. We agree that in naturalistic settings, face often co-occur with speech, and that these sources of emotion can overlap. However, background music induced emotions have distinct temporal dynamics which are separable from facial emotion (See the Author response image 2 (A) and (B) below). In addition, face can convey a wide range of emotions (48 categories in Hume AI model), whereas music conveys far fewer (13 categories reported by a recent study [1]). Thus, when using facial emotion feature time series as regressors (with 48 emotion categories and rapid temporal dynamics), the model performance will reflect neural encoding of facial emotion in the music condition, rather than the slower and lower-dimensional emotion from music. 

      For the speech condition, we acknowledge that it is difficult to fully isolate neural responses to facial emotion from those to speech when the emotional content from faces and speech highly overlaps. However, in our study, (1) the time courses of emotion features from face and voice are still different (Author response image 2 (C) and (D)), (2) our main finding that DLPFC encodes facial expression information in postchildhood individuals but not in young children was found in both speech and music condition (Figure 2B and 2C). In music condition, neural responses to facial emotion are not affected by speech. Thus, we have included the DLPFC results from the music condition in the revised manuscript (Figure 2C), and we acknowledge that this issue should be carefully considered in future studies using videos with speech, as we have indicated in the future directions in the last paragraph of Discussion.

      (1) Cowen, A. S., Fang, X., Sauter, D. & Keltner, D. What music makes us feel: At least 13 dimensions organize subjective experiences associated with music across different cultures. Proc Natl Acad Sci USA 117, 1924–1934 (2020).

      Author response image 2.

      Time courses of the amusement. (A) and (B) Amusement conveyed by face or music in a 30-s music block. Facial emotion features are extracted by Hume AI. For emotion from music, we approximated the amusement time course using a weighted combination of low-level acoustic features (RMS energy, spectral centroid, MFCCs), which capture intensity, brightness, and timbre cues linked to amusement. Notice that music continues when there are no faces presented. (C) and (D) Amusement conveyed by face or voice in a 30-s speech block. From 0 to 5 seconds, a girl is introducing her friend to a stranger. The camera focuses on the friend, who appears nervous, while the girl’s voice sounds cheerful. This mismatch explains why the shapes of the two time series differ at the beginning. Such situations occur frequently in naturalistic movies

      (2) While the authors leveraged Hume AI to extract facial expression features from the video stimuli, they did not provide any validation of the tool's accuracy or reliability in the context of their dataset. It remains unclear how well the AI-derived emotion ratings align with human perception, particularly given the complexity and variability of naturalistic stimuli. Without such validation, it is difficult to assess the interpretability and robustness of the decoding results based on these features.  

      Hume AI models were trained and validated by human intensity ratings of large-scale, experimentally controlled emotional expression data [1-2]. The training process used both manual annotations from human raters and deep neural networks. Over 3000 human raters categorized facial expressions into emotion categories and rated on a 1-100 intensity scale. Thus, the outputs of Hume AI model reflect what typical facial expressions convey (based on how people actually interpret them), that is, the presented facial emotion. Our goal of the present study was to examine how facial emotions presented in the videos are encoded in the human brain at different developmental stages. We agree that the interpretation of facial emotions may be different in individual participants, resulting in different perceived emotion (i.e., the emotion that the observer subjectively interprets). Behavioral ratings are necessary to study the encoding of subjectively perceived emotion, which is a very interesting direction but beyond the scope of the present work. We have added text in the Discussion to explicitly note that our study focused on the encoding of presented emotion (second paragraph in Page 8).

      (1) Brooks, J. A. et al. Deep learning reveals what facial expressions mean to people in different cultures. iScience 27, 109175 (2024).

      (2) Brooks, J. A. et al. Deep learning reveals what vocal bursts express in different cultures. Nat. Hum. Behav. 7, 240–250 (2023).

      (3) Only two children had relevant pSTC coverage, severely limiting the reliability and generalizability of results.  

      We appreciated this point and agreed with both reviewers who raised it as a significant concern. As described in response to reviewer 1 (comment 1), we have added data from another two children who have pSTC coverage. Group-level analysis using permutation test showed that children’s pSTC significantly encode facial emotion in naturalistic contexts (Figure 3B). Because iEEG data from young children are extremely rare, rapidly increasing the sample size within a few years is not feasible. However, we are confident in the reliability of our conclusion that children’s pSTC can encode facial emotion. First,  the two new children’s responses (S33 and S49) from pSTC were highly consistent with our previous observations (see individual data in Figure 3B). Second, the averaged prediction accuracy in children’s pSTC (r<sub>speech</sub>=0.1565) was highly comparable to that in post-childhood group (r<sub>speech</sub>=0.1515).

      (4) The rationale for focusing exclusively on high-frequency activity for decoding emotion representations is not provided, nor are results from other frequency bands explored.   

      We focused on high-frequency broadband (HFB) activity because it is widely considered to reflect the responses of local neuronal populations near the recording electrode, whereas low-frequency oscillations in the theta, alpha, and beta ranges are thought to serve as carrier frequencies for long-range communication across distributed networks[1-2]. Since our study aimed to examine the representation of facial emotion in localized cortical regions (DLPFC and pSTC), HFB activity provides the most direct measure of the relevant neural responses. We have added this rationale to the manuscript (Page 3).

      (1) Parvizi, J. & Kastner, S. Promises and limitations of human intracranial electroencephalography. Nat. Neurosci. 21, 474–483 (2018).

      (2) Buzsaki, G. Rhythms of the Brain. (Oxford University Press, Oxford, 200ti).

      (5) The hypothesis of developmental emergence of top-down prefrontal modulation is not directly tested. No connectivity or co-activation analyses are reported, and the number of participants with simultaneous coverage of pSTC and DLPFC is not specified.  

      Directional connectivity analysis results were not shown because only one child has simultaneous coverage of pSTC and DLPFC. However, the  Granger Causality results from post-childhood group (N=7) clearly showed that the influence in the alpha/beta band from DLPFC to pSTC (top-down) is gradually increased above the onset of face presentation (Author response image 3, below left, plotted in red). By comparison, the influence in the alpha/beta band from pSTC to DLPFC (bottom-up) is gradually decreased after the onset of face presentation (Author response image 3, below left, blue curve). The influence in alpha/beta band from DLPFC to pSTC was significantly increased at 750 and 1250 ms after the face presentation (face vs nonface, paired t-test, Bonferroni  corrected P=0.005, 0.006), suggesting an enhanced top-down modulation in the post-childhood group during watching emotional faces. Interestingly, this top-down influence appears very different in the 8-year-old child at 1250 ms after the face presentation (Author response image 3, below left, black curve).

      As we cannot draw direct conclusions from the single-subject sample presented here, the top-down hypothesis is introduced only as a possible explanation for our current results. We have removed potentially misleading statements, and we plan to test this hypothesis directly using MEG in the future.

      Author response image 3.

      Difference of Granger causality indices (face – nonface) in alpha/beta and gamma band for both directions. We identified a series of face onset in the movie that paticipant watched. Each trial was defined as -0.1 to 1.5 s relative to the onset. For the non-face control trials, we used houses, animals and scenes. Granger causality was calculated for 0-0.5 s, 0.5-1 s and 1-1.5 s time window. For the post-childhood group, GC indices were averaged across participants. Error bar is sem.

      (6) The "post-childhood" group spans ages 13-55, conflating adolescence, young adulthood, and middle age. Developmental conclusions would benefit from finer age stratification.  

      We appreciate this insightful comment. Our current sample size does not allow such stratification. But we plan to address this important issue in future MEG studies with larger cohorts.

      (7) The so-called "complex emotions" (e.g., embarrassment, pride, guilt, interest) used in the study often require contextual information, such as speech or narrative cues, for accurate interpretation, and are not typically discernible from facial expressions alone. As such, the observed age-related increase in neural encoding of these emotions may reflect not solely the maturation of facial emotion perception, but rather the development of integrative processing that combines facial, linguistic, and contextual cues. This raises the possibility that the reported effects are driven in part by language comprehension or broader social-cognitive integration, rather than by changes in facial expression processing per se.  

      We agree with this interpretation. Indeed, our results already show that speech influences the encoding of facial emotion in the DLPFC differently in the childhood and post-childhood groups (Figure 2D), suggesting that children’s ability to integrate multiple cues is still developing. Future studies are needed to systematically examine how linguistic cues and prior experiences contribute to the understanding of complex emotions from faces, which we have added to our future directions section (last paragraph in Discussion, Page 8-9 ).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      In the introduction: "These neuroimaging data imply that social and emotional experiences shape the prefrontal cortex's involvement in processing the emotional meaning of faces throughout development, probably through top-down modulation of early sensory areas." Aren't these supposed to be iEEG data instead of neuroimaging? 

      Corrected.

      Reviewer #2 (Recommendations for the authors):

      This manuscript would benefit from several improvements to strengthen the validity and interpretability of the findings:

      (1) Increase the sample size, especially for children with pSTC coverage. 

      We added data from another two children who have pSTC coverage. Please see our response to reviewer 2’s comment 3 and reviewer 1’s comment 1.

      (2) Include directional connectivity analyses to test the proposed top-down modulation from DLPFC to pSTC. 

      Thanks for the suggestion. Please see our response to reviewer 2’s comment 5.

      (3) Use controlled stimuli in an additional experiment to separate the effects of facial expression, speech, and music. 

      This is an excellent point. However, iEEG data collection from children is an exceptionally rare opportunity and typically requires many years, so we are unable to add a controlled-stimulus experiment to the current study. We plan to consider using controlled stimuli to study the processing of complex emotion using non-invasive method in the future. In addition, please see our response to reviewer 2’s comment 1 for a description of how neural responses to facial expression and music are separated in our study.

    1. eLife Assessment

      This revised paper provides a valuable and novel neural network-based framework for parameterizing individual differences and predicting individual decision-making across task conditions. The methods and analyses are solid yet could benefit from further validation of the superiority of the proposed framework against other baseline models. With these concerns addressed, this study would offer a proof-of-concept neural network approach to scientists working on the generalization of cognitive skills across contexts.

    2. Reviewer #1 (Public review):

      Summary

      The manuscript presents EIDT, a framework that extracts an "individuality index" from a source task to predict a participant's behaviour in a related target task under different conditions. However, the evidence that it truly enables cross-task individuality transfer is not convincing.

      Strengths

      The EIDT framework is clearly explained, and the experimental design and results are generally well-described. The performance of the proposed method is tested on two distinct paradigms: a Markov Decision Process (MDP) task (comparing 2-step and 3-step versions) and a handwritten digit recognition (MNIST) task under various conditions of difficulty and speed pressure. The results indicate that the EIDT framework generally achieved lower prediction error compared to baseline models and that it was better at predicting a specific individual's behaviour when using their own individuality index compared to using indices from others.

      Furthermore, the individuality index appeared to form distinct clusters for different individuals, and the framework was better at predicting a specific individual's behaviour when using their own derived index compared to using indices from other individuals.

      Comments on revisions:

      I thank the author for the additional analyses. They have fully addressed all of my previous concerns, and I have no further recommendations.

    3. Reviewer #2 (Public review):

      This paper introduces a framework for modeling individual differences in decision-making by learning a low-dimensional representation (the "individuality index") from one task and using it to predict behaviour in a different task. The approach is evaluated on two types of tasks: a sequential value-based decision-making task and a perceptual decision task (MNIST). The model shows improved prediction accuracy when incorporating this learned representation compared to baseline models.

      The motivation is solid, and the modelling approach is interesting, especially the use of individual embeddings to enable cross-task generalization. That said, several aspects of the evaluation and analysis could be strengthened.

      (1) The MNIST SX baseline appears weak. RTNet isn't directly comparable in structure or training. A stronger baseline would involve training the GRU directly on the task without using the individuality index-e.g., by fixing the decoder head. This would provide a clearer picture of what the index contributes.

      (2) Although the focus is on prediction, the framework could offer more insight into how behaviour in one task generalizes to another. For example, simulating predicted behaviours while varying the individuality index might help reveal what behavioural traits it encodes.

      (3) It's not clear whether the model can reproduce human behaviour when acting on-policy. Simulating behaviour using the trained task solver and comparing it with actual participant data would help assess how well the model captures individual decision tendencies.

      (4) Figures 3 and S1 aim to show that individuality indices from the same participant are closer together than those from different participants. However, this isn't fully convincing from the visualizations alone. Including a quantitative presentation would help support the claim.

      (5) The transfer scenarios are often between very similar task conditions (e.g., different versions of MNIST or two-step vs three-step MDP). This limits the strength of the generalization claims. In particular, the effects in the MNIST experiment appear relatively modest, and the transfer is between experimental conditions within the same perceptual task. To better support the idea of generalizing behavioural traits across tasks, it would be valuable to include transfers across more structurally distinct tasks.

      (6) For both experiments, it would help to show basic summaries of participants' behavioural performance. For example, in the MDP task, first-stage choice proportions based on transition types are commonly reported. These kinds of benchmarks provide useful context.

      (7) For the MDP task, consider reporting the number or proportion of correct choices in addition to negative log-likelihood. This would make the results more interpretable.

      (8) In Figure 5, what is the difference between the "% correct" and "% match to behaviour"? If so, it would help to clarify the distinction in the text or figure captions.

      (9) For the cognitive model, it would be useful to report the fitted parameters (e.g., learning rate, inverse temperature) per individual. This can offer insight into what kinds of behavioural variability the individuality index might be capturing.

      (10) A few of the terms and labels in the paper could be made more intuitive. For example, the name "individuality index" might give the impression of a scalar value rather than a latent vector, and the labels "SX" and "SY" are somewhat arbitrary. You might consider whether clearer or more descriptive alternatives would help readers follow the paper more easily.

      (11) Please consider including training and validation curves for your models. These would help readers assess convergence, overfitting, and general training stability, especially given the complexity of the encoder-decoder architecture.

      Comments on revisions:

      Thank you to the authors for the updated manuscript. The authors have addressed the majority of my concerns, and the paper is now in a much better form.

      Regarding my previous Comment 6, I still believe it would be helpful to include a graph similar to what is typically reported for these tasks-specifically, a breakdown of choices based on rare versus common transitions (see Model-Based Influences on Humans' Choices and Striatal Prediction Errors, Figure 2). Presenting this for both the actual behaviour and the simulated data would strengthen the paper and allow for clearer comparison.

    4. Reviewer #3 (Public review):

      Summary:

      This work presents a novel neural network-based framework for parameterizing individual differences in human behavior. Using two distinct decision-making experiments, the author demonstrates the approach's potential and claims it can predict individual behavior (1) within the same task, (2) across different tasks, and (3) across individuals. While the goal of capturing individual variability is compelling and the potential applications are promising, the claims are weakly supported, and I find that the underlying problem is conceptually ill-defined.

      Strengths:

      The idea of using neural networks for parameterizing individual differences in human behavior is novel, and the potential applications can be impactful.

      Weaknesses:

      (1) To demonstrate the effectiveness of the approach, the authors compare a Q-learning cognitive model (for the MDP task) and RTNet (for the MNIST task) against the proposed framework. However, as I understand it, neither the cognitive model nor RTNet is designed to fit or account for individual variability. If that is the case, it is unclear why these models serve as appropriate baselines. Isn't it expected that a model explicitly fitted to individual data would outperform models that do not? If so, does the observed superiority of the proposed framework simply reflect the unsurprising benefit of fitting individual variability? I think the authors should either clarify why these models constitute fair control or validate the proposed approach against stronger and more appropriate baselines.

      (2) It's not very clear in the results section what it means by having a shorter within-individual distance than between-individual distances. Related to the comment above, is there any control analysis performed for this? Also, this analysis appears to have nothing to do with predicting individual behavior. Is this evidence toward successfully parameterizing individual differences? Could this be task-dependent, especially since the transfer is evaluated on exceedingly similar tasks in both experiments? I think a bit more discussion of the motivation and implications of these results will help the reader in making sense of this analysis.

      (3) The authors have to better define what exactly he meant by transferring across different "tasks" and testing the framework in "more distinctive tasks". All presented evidence, taken at face value, demonstrated transferring across different "conditions" of the same task within the same experiment. It is unclear to me how generalizable the framework will be when applied to different tasks.

      (4) Conceptually, it is also unclear to me how plausible it is that the framework could generalize across tasks spanning multiple cognitive domains (if that's what is meant by more distinctive). For instance, how can an individual's task performance on a Posner task predict task performance on the Cambridge face memory test? Which part of the framework could have enabled such a cross-domain prediction of task performance? I think these have to be at least discussed to some extent, since without it the future direction is meaningless.

      (5) How is the negative log-likelihood, which seems to be the main metric for comparison, computed? Is this based on trial-by-trial response prediction or probability of responses, as what usually performed in cognitive modelling?

      (6) None of the presented evidence is cross-validated. The authors should consider performing K-fold cross-validation on the train, test, and evaluation split of subjects to ensure robustness of the findings.

      (7) The authors excluded 25 subjects (20% of the data) for different reasons. This is a substantial proportion, especially by the standards of what is typically observed in behavioral experiments. The authors should provide a clear justification for these exclusion criteria and, if possible, cite relevant studies that support the use of such stringent thresholds.

      (8) The authors should do a better job of creating the figures and writing the figure captions. It is unclear which specific claim the authors are addressing with the figure. For example, what is the key message of Figure 2C regarding transfer within and across participants? Why are the stats presentation different between the Cognitive model and the EIDT framework plots? In Figure 3, it's unclear what these dots and clusters represent and how they support the authors' claim that the same individual forms clusters. And isn't this experiment have 98 subjects after exclusion, this plot has way less than 98 dots as far as I can tell. Furthermore, I find Figure 5 particularly confusing, as the underlying claim it is meant to illustrate is unclear. Clearer figures and more informative captions are needed to guide the reader effectively.

      (9) I also find the writing somewhat difficult to follow. The subheadings are confusing, and it's often unclear which specific claim the authors are addressing. The presentation of results feels disorganized, making it hard to trace the evidence supporting each claim. Also, the excessive use of acronyms (e.g., SX, SY, CG, EA, ES, DA, DS) makes the text harder to parse. I recommend restructuring the results section to be clearer and significantly reducing the use of unnecessary acronyms.

      Comments on revisions:

      The authors have addressed my previous comments with great care and detail. I appreciate the additional analyses and edits. I have no further comments.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Because the "source" and "target" tasks are merely parameter variations of the same paradigm, it is unclear whether EIDT achieves true crosstask transfer. The manuscript provides no measure of how consistent each participant's behaviour is across these variants (e.g., two- vs threestep MDP; easy vs difficult MNIST). Without this measure, the transfer results are hard to interpret. In fact, Figure 5 shows a notable drop in accuracy when transferring between the easy and difficult MNIST conditions, compared to transfers between accuracy-focused and speedfocused conditions. Does this discrepancy simply reflect larger withinparticipant behavioural differences between the easy and difficult settings? A direct analysis of intra-individual similarity for each task pair and how that similarity is related to EIDT's transfer performance is needed.

      Thank you for your insightful comment. We agree that the tasks used in our study are variations of the same paradigm. Accordingly, we have revised the manuscript to consistently frame our findings as demonstrating individuality transfer "across task conditions" rather than "across distinct tasks."

      In response to your suggestion, we have conducted a new analysis to directly investigate the relationship between individual behavioural patterns and transfer performance. As show in the new Figures 4, 11, S8, and S9, we found a clear relationship between the distance in the space of individual latent representation (called individuality index in the previous manuscript) and prediction performance. Specifically, prediction accuracy for a given individual's behaviour degrades as the latent representation of the model's source individual becomes more distant. This result directly demonstrates that our framework captures meaningful individual differences that are predictive of transfer performance across conditions.

      We have also expanded the Discussion (Lines 332--343) to address the potential for applying this framework to more structurally distinct tasks, hypothesizing that this would rely on shared underlying cognitive functions.

      Related to the previous comment, the individuality index is central to the framework, yet remains hard to interpret. It shows much greater within-participant variability in the MNIST experiment (Figure S1) than in the MDP experiment (Figure 3). Is such a difference meaningful? It is hard to know whether it reflects noisier data, greater behavioural flexibility, or limitations of the model.

      Thank you for raising this important point about interpretability. To enhance the interpretability of the individual latent representation, we have added a new analysis for the MDP task (see Figures 6 and S4). By applying our trained encoder to data from simulated Q-learning agents with known parameters, we demonstrate that the dimensions of the latent space systematically map onto the agents' underlying cognitive parameters (learning rate and inverse temperature). This analysis provides a clearer interpretation by linking our model's data-driven representation to established theoretical constructs.

      Regarding the greater within-participant variability observed in the MNIST task (visualized now in Figure S7), this could be attributed to several factors, such as greater behavioural flexibility in the perceptual task. However, disentangling these potential factors is complex and falls outside the primary scope of the current study, which prioritizes demonstrating robust prediction accuracy across different task conditions.

      The authors suggests that the model's ability to generalize to new participants "likely relies on the fact that individuality indices form clusters and individuals similar to new participants exist in the training participant pool". It would be helpful to directly test this hypothesis by quantifying the similarity (or distance) of each test participant's individuality index to the individuals or identified clusters within the training set, and assessing whether greater similarity (or closer proximity) to the clusters in the training set is associated with higher prediction accuracy for those individuals in the test set.

      Thank you for this excellent suggestion. We have performed the analysis you proposed to directly test this hypothesis. Our new results, presented in Figures 4, 11, S5, S8, and S9, quantify the distance between the latent representation of a test participant and that of the source participant used to generate the prediction model.

      The results show a significant negative correlation: prediction accuracy consistently decreases as the distance in the latent space increases. This confirms that generalization performance is directly tied to the similarity of behavioural patterns as captured by our latent representation, strongly supporting our hypothesis.

      Reviewer #2 (Public review):

      The MNIST SX baseline appears weak. RTNet isn't directly comparable in structure or training. A stronger baseline would involve training the GRU directly on the task without using the individuality index-e.g., by fixing the decoder head. This would provide a clearer picture of what the index contributes.

      We agree that a more direct baseline is crucial for evaluating the contribution of our transfer mechanism. For the Within-Condition Prediction scenario, the comparison with RTNet was intended only to validate that our task solver architecture could achieve average humanlevel task performance (Figure 7).

      For the critical Cross-Condition Transfer scenario, we have now implemented a stronger and more appropriate baseline, which we call ``task solver (source).'' This model has the same architecture as our EIDT task solver but is trained directly on the source task data of the specific test participant. As shown in revised Figure 9, our EIDT framework significantly outperforms this direct-training approach, clearly demonstrating the benefit of the individuality transfer mechanism.

      Although the focus is on prediction, the framework could offer more insight into how behaviour in one task generalizes to another. For example, simulating predicted behaviours while varying the individuality index might help reveal what behavioural traits it encodes.

      Thank you for this valuable suggestion. To provide more insight into the encoded behavioural traits, we have conducted a new analysis linking the individual latent representation to a theoretical cognitive model. As detailed in the revised manuscript (Figures 6 and S4), we applied our encoder to simulated data from Q-learning agents with varying parameters. The results show a systematic relationship between the latent space coordinates and the agents' learning rates and inverse temperatures, providing a clearer interpretation of what the representation captures.

      It's not clear whether the model can reproduce human behaviour when acting on-policy. Simulating behaviour using the trained task solver and comparing it with actual participant data would help assess how well the model captures individual decision tendencies.

      We have added the suggested on-policy evaluation (Lines 195--207). In the revised manuscript (Figure 5), we present results from simulations where the trained task solvers performed the MDP task. We compared their performance (total reward and rate of the highly-rewarding action selected) against their corresponding human participants. The strong correlations observed demonstrate that our model successfully captures and reproduces individual-specific behavioural tendencies in an onpolicy setting.

      Figures 3 and S1 aim to show that individuality indices from the same participant are closer together than those from different participants. However, this isn't fully convincing from the visualizations alone. Including a quantitative presentation would help support the claim.

      We agree that the original visualizations of inter- and intraparticipant distances was not sufficiently convincing. We have therefore removed that analysis. In its place, we have introduced a more direct and quantitative analysis that explicitly links the individual latent representation to prediction performance (see Figures 4, 11, S5, S8, and S9). This new analysis demonstrates that prediction error for an individual is a function of distance in the latent space, providing stronger evidence that the representation captures meaningful, individual-specific information.

      The transfer scenarios are often between very similar task conditions (e.g., different versions of MNIST or two-step vs three-step MDP). This limits the strength of the generalization claims. In particular, the effects in the MNIST experiment appear relatively modest, and the transfer is between experimental conditions within the same perceptual task. To better support the idea of generalizing behavioural traits across tasks, it would be valuable to include transfers across more structurally distinct tasks.

      We agree with this limitation and have revised the manuscript to be more precise. We now frame our contribution as "individuality transfer across task conditions" rather than "across tasks" to accurately reflect the scope of our experiments. We have also expanded the Discussion section (Line 332-343) to address the potential and challenges of applying this framework to more structurally distinct tasks, noting that it would likely depend on shared underlying cognitive functions.

      For both experiments, it would help to show basic summaries of participants' behavioural performance. For example, in the MDP task, first-stage choice proportions based on transition types are commonly reported. These kinds of benchmarks provide useful context.

      We have added behavioral performance summaries as requested. For the MDP task, Figure 5 now compares the total reward and rate of highlyrewarding action selected between humans and our model. For the MNIST task, Figure 7 shows the rate of correct responses for humans, RTNet, and our task solver across all conditions. These additions provide better context for the model's performance.

      For the MDP task, consider reporting the number or proportion of correct choices in addition to negative log-likelihood. This would make the results more interpretable.

      Thank you for the suggestion. To make the results more interpretable, we have added a new prediction performance metric: the rate for behaviour matched. This metric measures the proportion of trials where the model's predicted action matches the human's actual choice. This is now included alongside the negative log-likelihood in Figures 2, 3, 4, 8, 9, and 11.

      In Figure 5, what is the difference between the "% correct" and "% match to behaviour"? If so, it would help to clarify the distinction in the text or figure captions.

      We have clarified these terms in the revised manuscript. As defined in the Result section (Lines 116--122, 231), "%correct" (now "rate of correct responses") is a measure of task performance, whereas "%match to behaviour" (now "rate for behaviour matched") is a measure of prediction accuracy.

      For the cognitive model, it would be useful to report the fitted parameters (e.g., learning rate, inverse temperature) per individual. This can offer insight into what kinds of behavioural variability the individual latent representation might be capturing.

      We have added histograms of the fitted Q-learning parameters for the human participants in Supplementary Materials (Figure S1). This analysis revealed which parameters varied most across the population and directly informed the design of our subsequent simulation study with Q-learning agents (see response to Comment 2-2), where we linked these parameters to the individual latent representation (Lines 208--223).

      A few of the terms and labels in the paper could be made more intuitive. For example, the name "individuality index" might give the impression of a scalar value rather than a latent vector, and the labels "SX" and "SY" are somewhat arbitrary. You might consider whether clearer or more descriptive alternatives would help readers follow the paper more easily.

      We have adopted the suggested changes for clarity.

      "Individuality index" has been changed to "individual latent representation".

      "Situation SX" and "Situation SY" have been renamed to the more descriptive "Within-Condition Prediction" and "Cross-Condition Transfer", respectively.

      We have also added a table in Figure 7 to clarify the MNIST condition acronyms (EA/ES/DA/DS).

      Please consider including training and validation curves for your models. These would help readers assess convergence, overfitting, and general training stability, especially given the complexity of the encoder-decoder architecture.

      Training and validation curves for both the MDP and MNIST tasks have been added to Supplementary Materials (Figure S2 and S6) to show model convergence and stability.

      Reviewer #3 (Public review):

      To demonstrate the effectiveness of the approach, the authors compare a Q-learning cognitive model (for the MDP task) and RTNet (for the MNIST task) against the proposed framework. However, as I understand it, neither the cognitive model nor RTNet is designed to fit or account for individual variability. If that is the case, it is unclear why these models serve as appropriate baselines. Isn't it expected that a model explicitly fitted to individual data would outperform models that do not? If so, does the observed superiority of the proposed framework simply reflect the unsurprising benefit of fitting individual variability? I think the authors should either clarify why these models constitute fair control or validate the proposed approach against stronger and more appropriate baselines.

      Thank you for raising this critical point. We wish to clarify the nature of our baselines:

      For the MDP task, the cognitive model baseline was indeed designed to account for individual variability. We estimated its parameters (e.g., learning rate) from each individual's source task behaviour and then used those specific parameters to predict their behaviour in the target task. This makes it a direct, parameter-based transfer model and thus a fair and appropriate baseline for individuality transfer.

      For the MNIST task, we agree that the RTNet baseline was insufficient for evaluating individual-level transfer in the "Cross-Condition Transfer" scenario. We have now introduced a much stronger baseline, the "task solver (source)," which is trained specifically on the source task data of each test participant. Our results (Figure 9) show that the EIDT framework significantly outperforms this more appropriate, individualized baseline, highlighting the value of our transfer method over direct, within-condition fitting.

      It's not very clear in the results section what it means by having a shorter within-individual distance than between-individual distances. Related to the comment above, is there any control analysis performed for this? Also, this analysis appears to have nothing to do with predicting individual behavior. Is this evidence toward successfully parameterizing individual differences? Could this be task-dependent, especially since the transfer is evaluated on exceedingly similar tasks in both experiments? I think a bit more discussion of the motivation and implications of these results will help the reader in making sense of this analysis.

      We agree that the previous analysis on inter- and intra-participant distances was not sufficiently clear or directly linked to the model's predictive power. We have removed this analysis from the manuscript. In its place, we have introduced a new, more direct analysis (Figures 4, 11, S5, S8, and S9) that demonstrates a quantitative relationship between the distance in the latent space and prediction accuracy. This new analysis shows that prediction error for an individual increases as a function of this distance, providing much stronger and clearer evidence that our framework successfully parameterizes meaningful individual differences.

      The authors have to better define what exactly he meant by transferring across different "tasks" and testing the framework in "more distinctive tasks". All presented evidence, taken at face value, demonstrated transferring across different "conditions" of the same task within the same experiment. It is unclear to me how generalizable the framework will be when applied to different tasks.

      Conceptually, it is also unclear to me how plausible it is that the framework could generalize across tasks spanning multiple cognitive domains (if that's what is meant by more distinctive). For instance, how can an individual's task performance on a Posner task predict task performance on the Cambridge face memory test? Which part of the framework could have enabled such a cross-domain prediction of task performance? I think these have to be at least discussed to some extent, since without it the future direction is meaningless.

      We agree with your assessment and have corrected our terminology throughout the manuscript. We now consistently refer to the transfer as being "across task conditions" to accurately describe the scope of our findings.

      We have also expanded our Discussion (Line 332-343) to address the important conceptual point about cross-domain transfer. We hypothesize that such transfer would be possible if the tasks, even if structurally different, rely on partially shared underlying cognitive functions (e.g., working memory). In such a scenario, the individual latent representation would capture an individual's specific characteristics related to that shared function, enabling transfer. Conversely, we state that transfer between tasks with no shared cognitive basis would not be expected to succeed with our current framework.

      How is the negative log-likelihood, which seems to be the main metric for comparison, computed? Is this based on trial-by-trial response prediction or probability of responses, as what usually performed in cognitive modelling?

      The negative log-likelihood is computed on a trial-by-trial basis. It is based on the probability the model assigned to the specific action that the human participant actually took on that trial. This calculation is applied consistently across all models (cognitive models, RTNet, and EIDT). We have added sentences to the Results section to clarify this point (Lines 116--122).

      None of the presented evidence is cross-validated. The authors should consider performing K-fold cross-validation on the train, test, and evaluation split of subjects to ensure robustness of the findings.

      All prediction performance results reported in the revised manuscript are now based on a rigorous leave-one-participant-out cross-validation procedure to ensure the robustness of our findings. We have updated the

      Methods section to reflect this (Lines 127--129 and 229).

      For some purely illustrative visualizations (e.g., plotting the entire latent space in Figures S3 and S7), we used a model trained on all participants' data to provide a single, representative example and avoid clutter. We have explicitly noted this in the relevant figure captions.

      The authors excluded 25 subjects (20% of the data) for different reasons. This is a substantial proportion, especially by the standards of what is typically observed in behavioral experiments. The authors should provide a clear justification for these exclusion criteria and, if possible, cite relevant studies that support the use of such stringent thresholds.

      We acknowledge the concern regarding the exclusion rate. The previous criteria were indeed empirical. We have now implemented more systematic exclusion procedure based on the interquartile range of performance metrics, which is detailed in Section 4.2.2 (Lines 489--498). This revised, objective criterion resulted in the exclusion of 42 participants (34% of the initial sample). While this rate is high, we attribute it to the online nature of the data collection, where participant engagement can be more variable. We believe applying these strict criteria was necessary to ensure the quality and reliability of the behavioural data used for modeling.

      The authors should do a better job of creating the figures and writing the figure captions. It is unclear which specific claim the authors are addressing with the figure. For example, what is the key message of Figure 2C regarding transfer within and across participants? Why are the stats presentation different between the Cognitive model and the EIDT framework plots? In Figure 3, it's unclear what these dots and clusters represent and how they support the authors' claim that the same individual forms clusters. And isn't this experiment have 98 subjects after exclusion, this plot has way less than 98 dots as far as I can tell. Furthermore, I find Figure 5 particularly confusing, as the underlying claim it is meant to illustrate is unclear. Clearer figures and more informative captions are needed to guide the reader effectively.

      We agree that several figures and analyses in the original manuscript were unclear, and we have thoroughly revised our figures and their captions to improve clarity.

      The confusing analysis in the old Figures 2C and 5 (Original/Others comparison) have been completely removed. The unclear visualization of the latent space for the test pool (old Figure 3 showing representations only from test participants) has also been removed to avoid confusion. For visualization of the overall latent space, we now use models trained on all data (Figures S3 and S7) and have clarified this in the captions. In place of these removed analyses, we have introduced a new, more intuitive "cross-individual" analysis (presented in Figures 4, 11, S5, S8, and S9). As explained in the new, more detailed captions, this analysis directly plots prediction performance as a function of the distance in latent space, providing a much clearer demonstration of how the latent representation relates to predictive accuracy.

      I also find the writing somewhat difficult to follow. The subheadings are confusing, and it's often unclear which specific claim the authors are addressing. The presentation of results feels disorganized, making it hard to trace the evidence supporting each claim. Also, the excessive use of acronyms (e.g., SX, SY, CG, EA, ES, DA, DS) makes the text harder to parse. I recommend restructuring the results section to be clearer and significantly reducing the use of unnecessary acronyms.

      Thank you for this feedback. We have made significant revisions to improve the clarity and organization of the manuscript. We have renamed confusing acronyms: "Situation SX" is now "Within- Condition Prediction," and "Situation SY" is now "Cross-Condition Transfer." We also added a table to clarify the MNIST condition acronyms (EA/ES/DA/DS) in Figure 7.

      The Results section has been substantially restructured with clearer subheadings.

    1. eLife Assessment

      This study presents compelling new data that combine two FTD-tau mutations, P301L/S320F (PL-SF), that reliably induce spontaneous full-length tau aggregation across multiple cellular systems. The findings are important for the field of neurodegenerative disease. The strength of evidence is solid; however, several conclusions would benefit from more validation.

    2. Reviewer #1 (Public review):

      Summary:

      This study presents compelling new data that combine two FTD-tau mutations P301L/S320F (PL-SF), that reliably induce spontaneous full-length tau aggregation across multiple cellular systems. However, several conclusions would benefit from more validation. Key findings rely on quantification of overexposed immunoblot, and in some experiments, the tau bands shift in molecular weight that are not explained (and in some instances vary between experiments). The effect seems to be driven by the S320F mutation, with the P301L mutation enhancing the effect observed with S320F alone. Although the observation that Hsp70, but not the related Hsc70, enhances aggregation is intriguing, the mechanistic basis for these differences remains unclear despite both Hsp70 and Hsc70 binding to tau. Additional experiments clarifying which PL-SF tau species Hsp70 engages, how this interaction alters tau conformational landscapes, and whether other chaperones or cofactors contribute to this effect would help solidify the conclusions and build a mechanistic picture. Overexpression of Hsp70 in the context of PL tau did not increase tau aggregation, which raises questions about whether the observed effects are specific to the SF mutation. Hsp70 functions in the context of a larger network of chaperones and has been proposed to cooperate with other proteins/machinery to disassemble tau amyloids, perhaps to produce more seeds. This would be consistent with the presented observations. For example, co-IP experiments using Hsp70 as bait combined with proteomics could really help build a more complete picture of what tau species Hsp70 binds and what other factors cooperate to yield the observed increases in aggregation. As it stands, the Hsp70 component of the paper is not fully developed, and additional experiments to address these questions would strengthen this manuscript beyond simply presenting a new tool to study spontaneous tau aggregation.

      Strengths:

      (1) The PL-SF FL tau mutant aggregates spontaneously in different cellular systems and shows hallmarks of tau pathology linked to disease.

      (2) PL-SF 4delta mutant reverses the spontaneous aggregation phenotype, consistent with these residues being critical for tau aggregation.

      (3) PL-SF 4delta also loses the ability to recruit Hsp70/Hsc70, consistent with these residues also being critical for chaperone recruitment.

      (4) The PL-SF tau mutant establishes a new system to study spontaneous tau assembly and to begin to compare it to seeded tau aggregation processes.

      Weaknesses:

      (1) Mechanistic insight into how Hsp70 but not Hsc70 increase PL-SF FL tau aggregation/pathology is missing. This is despite both chaperones binding to PL-SF FL tau. What species of tau does Hsp70 bind, and what cofactors are important in this process?

      (2) The study relies heavily on densitometry of bands to draw conclusions; in several instances, the blots are overexposed to accurately quantify the signal.

    3. Reviewer #2 (Public review):

      Summary:

      This study developed a novel tauopathy model combining two mutations, P301L and S320F, termed the PL-SF model. This model shows rapid tau protein aggregation.

      Strengths:

      The authors demonstrated pathogenicity through solid in vivo and in vitro experiments. Simultaneously, they used this model to investigate the role of the heat shock protein Hsp70 in tau protein aggregation, finding that Hsp70 promotes rather than inhibits tau pathology, which differs from previous findings.

      Weaknesses:

      (1) Although the PL-SF model can accelerate tau aggregation, it is crucial to determine whether this aligns with the temporal progression and spatial distribution of tau pathology in the brains of patients with tauopathies.

      (2) The authors did not elucidate the specific molecular mechanism by which Hsp70 promotes tau aggregation.

      (3) Some figures in this study show large error bars in the quantitative data (some statistical analysis figures, MEA recordings, etc.), indicating significant inter-sample variability. It is recommended to label individual data points in all quantitative figures and clearly indicate them in figure legends.

    4. Author response:

      Reviewer #1

      (1) Mechanistic insight into how Hsp70 but not Hsc70 increase PL-SF FL tau aggregation/pathology is missing. This is despite both chaperones binding to PL-SF FL tau. What species of tau does Hsp70 bind, and what cofactors are important in this process?

      We agree that explaining why Hsp70, but not Hsc70, promotes tau aggregation would strengthen the study. Although both chaperones bind tau, they diverge slightly in 1) protein sequence, 2) biochemical activity, and 3) co-chaperone engagement.

      Sequence: Hsp70 has an extra cysteine residue (Cys306) that is highly reactive to oxidation and a glycine residue that is critical for cysteine oxidation (Gly557). Both residues are specific to Hsp70 (not present in Hsc70) and may alter Hsp70 conformation or client handling (Hong et al., 2022).

      Biochemical activity: Prior studies indicate that Hsp70’s ATPase domain (NBD) is critical for tau interactions (Jinwal et al., 2009; Fontaine et al., 2015; Young et al., 2016) and can be disrupted with point mutations including K71E and E175S for ATPase and A406G/V438G for substrate binding (Fontaine et al., 2015).

      Co-chaperone engagement: Hsp70 recruits the co-chaperone and E3 ubiquitin ligase CHIP/Stub1 more strongly than Hsc70, suggesting co-chaperone engagement could lead to differences in tau processing (Jinwal et al., 2013).

      To directly test how the two closely related chaperones could differentially impact tau, we plan to perform the following experiments:

      (a) We will mutate residues responsible for cysteine reactivity in Hsp70 including the cysteine itself (Cys306) and the critical glycine that facilitates cysteine reactivity (Gly557). These residues will be deleted from Hsp70 or alternatively inserted into Hsc70 to determine whether cysteine reactivity is the reason for Hsp70’s ability to drive tau aggregation.

      (b) We will generate Hsp70 mutants lacking ATPase- or substrate-binding mutants to determine which Hsp70 domains are responsible for driving tau aggregation.

      (c) We will perform seeding assays in stable tau-expressing cell lines to determine whether Hsp70/Hsc70 overexpression or depletion alters seeded tau aggregation.

      (d) We will perform confocal microscopy to determine the extent of co-localization of Hsp70 or Hsc70 with phospho-tau, oligomeric tau, or Thioflavin-S (ThioS) to identify which tau species are engaged by Hsp70/Hsc70.

      (e) We will perform immunoprecipitation pull-downs followed by mass spectrometry to globally identify any relevant Hsp70/Hsc70 interacting factors that might account for the differences in tau aggregation.

      (2) The study relies heavily on densitometry of bands to draw conclusions; in several instances, the blots are overexposed to accurately quantify the signal.

      All immunoblots were acquired as 16-bit TIFFs with exposure settings chosen to prevent pixel saturation, and quantification was performed on raw, unsaturated images. Brightness and contrast adjustments were applied only for visualization and did not alter pixel values used for analysis. All quantified bands fell within the linear range of the detector, with one exception in Figure 7B, which we removed from quantification. We will add both low- and high-exposure versions of immunoblots to the revised figures to demonstrate signal linearity and dynamic range.

      Reviewer #2

      (1) Although the PL-SF model can accelerate tau aggregation, it is crucial to determine whether this aligns with the temporal progression and spatial distribution of tau pathology in the brains of patients with tauopathies.

      No single tauopathy model fully recapitulates the temporal and spatial progression of human tauopathies. The PL-SF system is not intended to model the disease course. Rather, it is an excellent model for mechanistic studies of mature tau aggregation, which is otherwise challenging to study. We note that prior studies showed that PL-SF tau expression in transgenic mice (Xia et al., 2022 and Smith et al., 2025) and rhesus monkeys (Beckman et al., 2021) led to prion-like tau seeding and aggregation in hippocampal and cortical regions. Indeed, the spatial and temporal tau aggregation patterns aligned with features of human tauopathies. So far, these findings all support PL-SF as a valid accelerated model of tauopathy than can be used to interrogate pathogenic mechanisms that impact tau processing, degradation, and/or aggregation.

      (2) The authors did not elucidate the specific molecular mechanism by which Hsp70 promotes tau aggregation.

      We agree that a deeper understanding of the molecular mechanism is needed. The revision experiments outlined above (Reviewer #1, point #1) will define how Hsp70 promotes tau aggregation by testing sequence contributions, dissecting ATPase and substrate-binding domain requirements, and mapping Hsp70/Hsc70 interactors to directly address this mechanistic question.

      (3) Some figures in this study show large error bars in the quantitative data (some statistical analysis figures, MEA recordings, etc.), indicating significant inter-sample variability. It is recommended to label individual data points in all quantitative figures and clearly indicate them in figure legends.

      We acknowledge the inter-sample variability in some of the quantitative datasets. This level of variability can occur in primary neuronal cultures (e.g., MEA recordings) that are sensitive to growth and surface adhesion conditions, leading to many technical considerations. To improve transparency and interpretation, we will revise all quantitative figures to display individual data points overlaid on summary statistics and will update figure legends to clearly indicate sample sizes and statistical tests used.

      References

      Hong Z, Gong W, Yang J, Li S, Liu Z, Perrett S, Zhang H. Exploration of the cysteine reactivity of human inducible Hsp70 and cognate Hsc70. J Biol Chem. 2023 Jan;299(1):102723. doi: 10.1016/j.jbc.2022.102723. Epub 2022 Nov 19. PMID: 36410435; PMCID: PMC9800336.

      Jinwal UK, Miyata Y, Koren J 3rd, Jones JR, Trotter JH, Chang L, O'Leary J, Morgan D, Lee DC, Shults CL, Rousaki A, Weeber EJ, Zuiderweg ER, Gestwicki JE, Dickey CA. Chemical manipulation of hsp70 ATPase activity regulates tau stability. J Neurosci. 2009 Sep 30;29(39):12079-88. doi: 10.1523/JNEUROSCI.3345-09.2009. PMID: 19793966; PMCID: PMC2775811.

      Fontaine SN, Rauch JN, Nordhues BA, Assimon VA, Stothert AR, Jinwal UK, Sabbagh JJ, Chang L, Stevens SM Jr, Zuiderweg ER, Gestwicki JE, Dickey CA. Isoform-selective Genetic Inhibition of Constitutive Cytosolic Hsp70 Activity Promotes Client Tau Degradation Using an Altered Co-chaperone Complement. J Biol Chem. 2015 May 22;290(21):13115-27. doi: 10.1074/jbc.M115.637595. Epub 2015 Apr 11. PMID: 25864199; PMCID: PMC4505567

      Young ZT, Rauch JN, Assimon VA, Jinwal UK, Ahn M, Li X, Dunyak BM, Ahmad A, Carlson G, Srinivasan SR, Zuiderweg ERP, Dickey CA, Gestwicki JE. Stabilizing the Hsp70‑Tau Complex Promotes Turnover in Models of Tauopathy. Cell Chem Biol. 2016 Aug 4;23(8):992–1001. doi:10.1016/j.chembiol.2016.04.014.

      Jinwal UK, Akoury E, Abisambra JF, O'Leary JC 3rd, Thompson AD, Blair LJ, Jin Y, Bacon J, Nordhues BA, Cockman M, Zhang J, Li P, Zhang B, Borysov S, Uversky VN, Biernat J, Mandelkow E, Gestwicki JE, Zweckstetter M, Dickey CA. Imbalance of Hsp70 family variants fosters tau accumulation. FASEB J. 2013 Apr;27(4):1450-9. doi: 10.1096/fj.12-220889. Epub 2012 Dec 27. PMID: 23271055; PMCID: PMC3606536.

      Xia, Y., Prokop, S., Bell, B.M. et al. Pathogenic tau recruits wild-type tau into brain inclusions and induces gut degeneration in transgenic SPAM mice. Commun Biol 5, 446 (2022). https://doi.org/10.1038/s42003-022-03373-1.

      Smith ED, Paterno G, Bell BM, Gorion KM, Prokop S, Giasson BI. Tau from SPAM Transgenic Mice Exhibit Potent Strain-Specific Prion-Like Seeding Properties Characteristic of Human Neurodegenerative Diseases. Neuromolecular Med. 2025 May 30;27(1):44. doi: 10.1007/s12017-025-08850-4. PMID: 40447946; PMCID: PMC12125038.

      Beckman D, Chakrabarty P, Ott S, Dao A, Zhou E, Janssen WG, Donis-Cox K, Muller S, Kordower JH, Morrison JH. A novel tau-based rhesus monkey model of Alzheimer's pathogenesis. Alzheimers Dement. 2021 Jun;17(6):933-945. doi: 10.1002/alz.12318. Epub 2021 Mar 18. PMID: 33734581; PMCID: PMC8252011.

    1. eLife Assessment

      This study investigates the role of developmental oligodendrocytes in synchronising spontaneous activity in neuronal circuits and influencing cerebellar-dependent behaviour. The authors use advanced viral targeting techniques to deplete oligodendrocytes in a cell-specific manner, paired with in vivo calcium imaging of Purkinje cells, to establish a relationship between oligodendrocyte-mediated neuronal synchrony and complex brain function. The authors present compelling evidence of oligodendrocyte-regulated neuronal synchrony. Overall, this manuscript holds promise as an important contribution to neurodevelopment research.

    2. Reviewer #1 (Public review):

      Summary:

      This study presents convincing findings that oligodendrocytes play a regulatory role in spontaneous neural activity synchronization during early postnatal development, with implications for adult brain function. Utilizing targeted genetic approaches, the authors demonstrate how oligodendrocyte depletion impacts Purkinje cell activity and behaviors dependent on cerebellar function. Delayed myelination during critical developmental windows is linked to persistent alterations in neural circuit function, underscoring the lasting impact of oligodendrocyte activity.

      Strengths:

      (1) The research leverages the anatomically distinct olivocerebellar circuit, a well-characterized system with known developmental timelines and inputs, strengthening the link between oligodendrocyte function and neural synchronization.

      (2) Functional assessments, supported by behavioral tests, validate the findings of in vivo calcium imaging, enhancing the study's credibility.

      (3) Extending the study to assess long-term effects of early life myelination disruptions adds depth to the implications for both circuit function and behavior.

      Weaknesses:

      (1) The study would benefit from a closer analysis of myelination during the periods when synchrony is recorded. Direct correlations between myelination and synchronized activity would substantiate the mechanistic link and clarify if observed behavioral deficits stem from altered myelination timing.

      (2) Although the study focuses on Purkinje cells in the cerebellum, neural synchrony typically involves cross-regional interactions. Expanding the discussion on how localized Purkinje synchrony affects broader behaviors-such as anxiety, motor function, and sociality - would enhance the findings' functional significance.

      (3) The authors discuss the possibility of oligodendrocyte-mediated synapse elimination as a possible mechanism behind their findings, drawing from relevant recent literature on oligodendrocyte precursor cells. However, there are no data presented supporting these assumptions. The authors should explain why they think the mechanism behind their observation extends beyond the contribution of myelination or remove this point from the discussion entirely.

      Comment for resubmission: Although the argument on synaptic elimination has been removed, it has been replaced with similarly unclear speculation about roles for oligodendrocytes outside of conventional myelination or metabolic support, again without clear evidence. The authors measured MBP area but have not performed detailed analysis of oligodendrocyte biology to support the claims made in the discussion. Please consider removing this section or rephrasing it to align with the data presented.

      (4) It would be valuable to investigate secondary effects of oligodendrocyte depletion on other glial cells, particularly astrocytes or microglia, which could influence long-term behavioral outcomes. Identifying whether the lasting effects stem from developmental oligodendrocyte function alone or also involve myelination could deepen the study's insights.

      (5) The authors should explore the use of different methods to disturb myelin production for a longer time, in order to further determine if the observed effects are transient or if they could have longer-lasting effects.

      (6) Throughout the paper, there are concerns about statistical analyses, particularly on the use of the Mann-Whitney test or using fields of view as biological replicates.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      This study presents convincing findings that oligodendrocytes play a regulatory role in spontaneous neural activity synchronisation during early postnatal development, with implications for adult brain function. Utilising targeted genetic approaches, the authors demonstrate how oligodendrocyte depletion impacts Purkinje cell activity and behaviours dependent on cerebellar function. Delayed myelination during critical developmental windows is linked to persistent alterations in neural circuit function, underscoring the lasting impact of oligodendrocyte activity. 

      Strengths: 

      (1) The research leverages the anatomically distinct olivocerebellar circuit, a well-characterized system with known developmental timelines and inputs, strengthening the link between oligodendrocyte function and neural synchronization. 

      (2) Functional assessments, supported by behavioral tests, validate the findings of in vivo calcium imaging, enhancing the study's credibility. 

      (3) Extending the study to assess the long-term effects of early-life myelination disruptions adds depth to the implications for both circuit function and behavior.

      We appreciate these positive evaluation.

      Weaknesses: 

      (1) The study would benefit from a closer analysis of myelination during the periods when synchrony is recorded. Direct correlations between myelination and synchronized activity would substantiate the mechanistic link and clarify if observed behavioral deficits stem from altered myelination timing. 

      We appreciate the reviewer’s thoughtful suggestion and have expanded the manuscript to clarify how oligodendrocyte maturation relates to the development of Purkinje-cell synchrony. The developmental trajectory of Purkinje-cell synchrony has already been comprehensively characterized by Good et al. (2017, Cell Reports 21: 2066–2073): synchrony drops from a high level at P3–P5 to adult-like values by P8. We found that the myelination in the cerebellum starts to appear from P5-P7 (Figure S1A, B), indicating that the timing of Purkinje cell desynchronization coincides with the initial appearance of oligodendrocytes and myelin in the cerebellum. To determine whether myelin growth could nevertheless modulate this process, we quantified ASPA-positive oligodendrocyte density and MBP-positive bundle thickness and area at P10, P14, P21 and adulthood (Fig. 1J, K, Fig. S1E). Both metrics increase monotonically and clearly lag behind the rapid drop in synchrony, indicating that myelination could be not the primary trigger for the desynchronization. When oligodendrocytes were ablated during the second postnatal week, the synchrony was reduced (new Fig. 2). Thus, once myelination is underway, oligodendrocytes become critical for maintaining the synchrony, acting not as the initiators but as the stabilizers and refiners of the mature network state.

      We have added the new subsection in discussion (lines 451–467) now in which we propose a two-phase model. Phase I (P3–P8): High early synchrony is generated by non-myelin mechanisms (e.g. transient gap junctions, shared climbing-fiber input). Phase II (P8-). As oligodendrocytes proliferate and ensheath axons, they fine-tune conduction velocity and stabilize the mature, low-synchrony network state.

      We believe these additions fully address the reviewer’s concerns.

      (2) Although the study focuses on Purkinje cells in the cerebellum, neural synchrony typically involves cross-regional interactions. Expanding the discussion on how localized Purkinje synchrony affects broader behaviors - such as anxiety, motor function, and sociality - would enhance the findings' functional significance.

      We appreciate the reviewer’s helpful suggestion and have expanded the Discussion (lines 543–564) to clarify how localized Purkinje-cell synchrony can influence broader behavioral domains. In the revised text we note that changes in PC synchrony propagate  into thalamic, prefrontal, limbic, and parietal targets, thereby impacting distributed networks involved in motor coordination, affect, and social interaction. Our optogenetic rescue experiments further support this framework, as transient resynchronization of PCs normalized sociability and motor coordination while leaving anxiety-like behavior impaired. This dissociation highlights that different behavioral domains rely to varying degrees on precise cerebellar synchrony and underscores how even localized perturbations in Purkinje timing can acquire system-level significance.

      (3) The authors discuss the possibility of oligodendrocyte-mediated synapse elimination as a possible mechanism behind their findings, drawing from relevant recent literature on oligodendrocyte precursor cells. However, there are no data presented supporting this assumption. The authors should explain why they think the mechanism behind their observation extends beyond the contribution of myelination or remove this point from the discussion entirely.

      We thank the reviewer for pointing out that our original discussion of oligodendrocyte-mediated synapse elimination was not directly supported by data in the present manuscript. Because we are actively analyzing this question in a separate, follow-up study, we have deleted the speculative passage to keep the current paper focused on the demonstrated, myelination-dependent effects. We believe this change sharpens the mechanistic narrative and fully addresses the reviewer’s concern.

      (4) It would be valuable to investigate the secondary effects of oligodendrocyte depletion on other glial cells, particularly astrocytes or microglia, which could influence long-term behavioral outcomes. Identifying whether the lasting effects stem from developmental oligodendrocyte function alone or also involve myelination could deepen the study's insights. 

      We thank the reviewer for raising this point and have performed the requested analyses. Using IBA1 immunostaining for microglia and S100b for Bergmann glia, we quantified cell density and these marker signal intensity at P14 and P21. Neither microglial or Bergmann-glial differed between control and oligodendrocyte-ablated mice at either time‐point (new Figure S2). These results indicate that the behavioral phenotypes we report are unlikely to arise from secondary activation or loss of other glial populations.

      We now added results (lines 275–286) and also discuss myelination and other oligodendrocyte function (lines 443–450). It remains difficult to disentangle conduction-related effects from myelination-independent trophic roles of oligodendrocytes. We therefore note explicitly that future work employing stage-specific genetic tools or acute metabolic manipulations will be required to parse these contributions more definitively.

      (5) The authors should explore the use of different methods to disturb myelin production for a longer time, in order to further determine if the observed effects are transient or if they could have longer-lasting effects.

      We agree that distinguishing transient from enduring effects is critical. Importantly, our original submission already included data demonstrating a persistent deficit of PC population synchrony (Fig. 4, previous Fig. 3): (i) at P14—the early age after oligodendrocyte ablation—population synchrony is reduced, and (ii) the same deficit is still present in adults (P60–P70) despite full recovery of ASPA-positive cell density and MBP-area and -thickness (Fig. 2H-K, Fig. S1E, and Fig. 4). We also performed the ablation of oligodendrocytes after the third postnatal week. Despite a similar acute drop in ASPA-positive cells, neither population synchrony nor anxiety-, motor-, or social behaviors differed from littermate controls. Thus, extending myelin disruption beyond the developmental window does not exacerbate or prolong the phenotype, whereas a short perturbation within that window leaves a permanent timing defect. These findings strengthen our conclusion that it is the developmental oligodendrocyte/myelination program itself—rather than ongoing adult myelin production—that is essential for establishing stable network synchrony. We now highlight this point explicitly in the revised Discussion (lines 507–522).

      (6) Throughout the paper, there are concerns about statistical analyses, particularly on the use of the Mann-Whitney test or using fields of view as biological replicates.

      We appreciate the reviewer’s guidance on appropriate statistical treatment. To address these concerns we have re-analyzed all datasets that contained multiple measurements per animal (e.g., fields of view, lobules, or trials) using nested statistics with animal as the higher-order unit. Specifically, we applied a two-level nested ANOVA when more than two groups were compared and a nested t-test when two conditions were present. The re-analysis confirmed all original conclusions. Because the nested models yielded comparable effect sizes to the Mann–Whitney tests, we have retained the mean ± SEM for ease of comparison with prior literature but now also report all values for each mouse in Table 1. In cases where a single measurement per mouse was compared between two groups, we used the Mann–Whitney test and present the results in the graphs as median values.

      Major

      (1) The authors present compelling evidence that early loss of myelination disrupts synchronous firing prematurely. However, synchronous neuronal firing does not equate to circuit synchronization. It is improbable that myelination directly generates synchronous firing in Purkinje cells (PCs). For instance, Foran et al. (1992) identified that cerebellar myelination begins around postnatal day 6 (P6), while Good et al. (2017) recorded a developmental decline in PC activity correlation from P5-P11. To clarify myelin's role, we recommend detailed myelin imaging through light microscopy (MBP staining at higher magnification) to assess the extent of myelin removal accurately. Myelin sheaths, as shown by Snaidero et al. (2020), can persist after oligodendrocyte (OL) death, particularly following DTA induction (Pohl et al. 2011). Quantification of MBP+ area, rather than mean MBP intensity, is necessary to accurately measure myelin coverage.

      We appreciate the reviewer’s concern that residual sheaths might remain after oligodendrocyte ablation and have therefore re-examined myelin at higher spatial resolution. Then, two independent metrics were extracted: MBP⁺ area fraction in the white matter and MBP⁺ bundle thickness (new Figure 1J, K, and Fig. S1E). We confirm a robust, transient loss of myelin at P10 and P14 as shown by the reduction of MBP⁺ area and MBP⁺ bundle thickness. Both parameters recovered to control values by P21 and adulthood, indicating effective remyelination. These data demonstrate that, in our paradigm, oligodendrocyte ablation is accompanied by substantial sheath loss rather than the persistent myelin reported after acute toxin exposure. We have added them in Result (lines 266–271).

      The results reinforce the view that myelin removal and/or loss of trophic support during a narrow developmental window drive the long-term hyposynchrony and behavioral phenotypes we report. We have added the new subsection in discussion (lines 443–450) now in which we propose a two-phase model. Phase I (P3–P8): High early synchrony is generated by non-myelin mechanisms (e.g. transient gap junctions, shared climbing-fiber input). Phase II (P8-). As oligodendrocytes proliferate and ensheath axons, they fine-tune conduction velocity and stabilize the mature, low-synchrony network state. We believe these additions fully address the reviewer’s concerns.

      (2) Surprisingly, the authors speculate about oligodendrocyte-mediated synaptic pruning without supportive data, shifting the focus away from the potential impact of myelination. Even if OLs perform synaptic pruning, OL depletion would likely maintain synchrony, yet the opposite was observed. Further characterisation of the model and the potential source of the effect is needed. 

      We thank the reviewer for pointing out that our original discussion of oligodendrocyte-mediated synapse elimination was not directly supported by data in the present manuscript. Because we are actively analyzing this question in a separate, follow-up study, we have deleted the speculative passage to keep the current paper focused on the demonstrated, myelination-dependent effects. We believe this change sharpens the mechanistic narrative and fully addresses the reviewer’s concern.

      (3) Improved characterization of the DTA model would add clarity. Although almost all infected cells are reported as OLs, quantification of infected OL-lineage cells (e.g., via Olig2 staining) would verify this. It remains possible that observed activity changes are driven by OL-independent demyelination effects. We suggest cross-staining with Iba1 and GFAP to rule out inflammation or gliosis. 

      We thank the reviewer for this important suggestion and have expanded our histological characterization accordingly. First, to verify that DTA expression is confined to mature oligodendrocytes, we co-stained cerebellar sections collected 7 days after AAV-hMAG-mCherry injection with Olig2 (pan-OL lineage) and ASPA (mature OL marker) as shown in Figure S1C-D. Quantitative analysis revealed that 100 % of mCherry⁺ cells were Olig2⁺/ASPA⁺, whereas mCherry signal was virtually absent in Olig2⁺/ASPA⁻ immature oligodendrocytes. These data confirm that our DTA manipulation targets mature myelinating OLs rather than earlier lineage stages. We have added them in Result (lines 260–262).

      Second, to examine indirect effects mediated by other glia, we performed cross-staining with IBA1 (microglia) and S100β (Bergmann). Cell density and fluorescence intensity for each marker were indistinguishable between control and DTA groups at P14 and P21 (Figure S2A-H). Thus, neither inflammation nor astro-/microgliosis accompanies OL ablation. We have added them in Result (lines 275–286).

      Collectively, these results demonstrate that the observed desynchronization and behavioral phenotypes arise from specific loss of mature oligodendrocytes and their myelin, rather than from off-target viral expression or secondary glial responses.

      (4) The use of an independent model of myelin loss, such as the inducible Myrf knockout mouse with a MAG promoter, to assess if oligodendrocyte loss causes temporary or sustained impacts, employing an extended knockout model like Myrf cKO with MAG-Cre viral methods would be advantageous.

      We agree that distinguishing transient from enduring effects is critical. Importantly, our original submission already included data demonstrating a persistent deficit of PC population synchrony (Fig. 4, previous Fig. 3): (i) at P13-15—the early age after oligodendrocyte ablation—population synchrony is reduced, and (ii) the same deficit is still present in adults (P60–P70) despite full recovery of ASPA-positive cell density and MBP-area and -thickness (Fig. 2H-K, Fig. S1E, and Fig. 4). We also performed the ablation of oligodendrocytes after the third postnatal week. Despite a similar acute drop in ASPA-positive cells, neither population synchrony nor anxiety-, motor-, or social behaviors differed from littermate controls. Thus, extending myelin disruption beyond the developmental window does not exacerbate or prolong the phenotype, whereas a short perturbation within that window leaves a permanent timing defect. These findings strengthen our conclusion that it is the developmental oligodendrocyte/myelination program itself—rather than ongoing adult myelin production—that is essential for establishing stable network synchrony. We now highlight this point explicitly in the revised Discussion (lines 507–522).

      (5) For statistical robustness, the use of non-parametric tests (Mann-Whitney) necessitates reporting the median instead of the mean as the authors do. Furthermore, as repeated measurements within the same animal are not independent, the authors should ideally use nested ANOVA (or nested t-test comparing two conditions) to validate their findings (Aarts et al., Nat. Neuroscience 2014). Alternatively use one-way ANOVA with each animal as a biological replicate, although this means that the distribution in the data sets per animal is lost.

      We appreciate the reviewer’s guidance on appropriate statistical treatment. To address these concerns we have re-analyzed all datasets that contained multiple measurements per animal (e.g., fields of view, lobules, or trials) using nested statistics with animal as the higher-order unit. Specifically, we applied a two-level nested ANOVA when more than two groups were compared and a nested t-test when two conditions were present. The re-analysis confirmed all original conclusions. Because the nested models yielded comparable effect sizes to the Mann–Whitney tests, we have retained the mean ± SEM for ease of comparison with prior literature but now also report all values for each mouse in Table 1. In cases where a single measurement per mouse was compared between two groups, we used the Mann–Whitney test and present the results in the graphs as median values.

      Minor Points 

      (1) In all figures, please specify the ages at which each procedure was conducted, as demonstrated in Figure 2A.

      All main and supplementary figures now specify the exact postnatal age.

      (2) Clarify the selection criteria for regions of interest (ROI) in calcium imaging, and provide representative ROIs.

      We appreciate the reviewer’s guidance. We have clarified that our ROI detection followed the protocol reported by our previous paper (Tanigawa et al., 2024, Communications Biology) (lines 177-178) and representative Purkinje cell ROIs are now shown in Fig. 2B.

      (3) Include data on the proportion of climbing fiber or inferior olive neurons expressing Kir and the total number of neurons transfected, which would help contextualize the observed effects on PC synchronization and its broader implications for cerebellar circuit function.

      We appreciate the reviewer’s guidance. New Fig. 7C summarizes the efficiency of AAV-GFP and AAV-Kir2.1-GFP injections into the inferior olive. Across 4 mice PCs with GFP-labeled CFs was detected in 19.3 ± 11.9 (mean ± S.D.) % for control and 26.2 ± 11.8 (mean ± S.D.) % for Kir2.1 of PCs. These numbers are reported in the Results (lines 373–375).

      (4) Higher magnification images in Figures 1 and S3 would improve visual clarity. 

      We have addressed the request for higher-magnification images in two ways. First, all panels in Figure S3 were placed on a larger canvas. Second, in Figure 1 we adjusted panel sizes to emphasize fine structure: panel 1C already represents an enlargement of the RFP positive cells shown in 1B, and panel 1H and 1J now occupies a wider span so that every ASPA-positive cell body can be distinguished. Should the reviewer still require an even closer view, we have additional ready for upload.

      (5) Consider language editing to enhance overall clarity and readability.

      The entire manuscript was edited to improve flow, consistency, and readability.

      (6) Refine the discussion to align with the presented data.

      We have refined the discussion.

      Thank you once again for your constructive suggestions and comments. We believe these changes have improved the clarity and readability of our manuscript.

      Reviewer #2 (Public review):

      We appreciate Reviewer #2’s positive evaluation of our work and thank him/her for the constructive suggestions and comments. We followed these suggestions and comments and have conducted additional experiments. We have rewritten the manuscript and revised the figures according to the points Reviewer #1 mentioned. Our point-by-point responses to the comments are as follows.

      Summary:

      In this manuscript, the authors use genetic tools to ablate oligodendrocytes in the cerebellum during postnatal development. They show that the oligodendrocyte numbers return to normal post-weaning. Yet, the loss of oligodendrocytes during development seems to result in decreased synchrony of calcium transients in Purkinje neurons across the cerebellum. Further, there were deficits in social behaviors and motor coordination. Finally, they suppress activity in a subset of climbing fibers to show that it results in similar phenotypes in the calcium signaling and behavioral assays. They conclude that the behavioral deficits in the oligodendrocyte ablation experiments must result from loss of synchrony. 

      Strengths:

      Use of genetic tools to induce perturbations in a spatiotemporally specific manner.

      We appreciate these positive evaluation.

      Weaknesses: 

      The main weakness in this manuscript is the lack of a cohesive causal connection between the experimental manipulation performed and the phenotypes observed. Though they have taken great care to induce oligodendrocyte loss specifically in the cerebellum and at specific time windows, the subsequent experiments do not address specific questions regarding the effect of this manipulation.

      Calcium transients in Purkinje neurons are caused to a large extent by climbing fibers, but there is evidence for simple spikes to also underlie the dF/F signatures (Ramirez and Stell, Cell Reports, 2016).

      We thank the reviewer for drawing attention to the work of Ramirez & Stell (2016), which showed that simple-spike bursts can elicit Ca²⁺ rises, but only in the soma and proximal dendrites of adult Purkinje cells. In our study, Regions of Interest were restricted to the dendritic arbor, where SS-evoked signals are essentially undetectable (Ramirez & Stell, 2016), whereas climbing-fiber complex spikes generate large, all-or-none transients (Good et al., 2017). Accordingly, even if a rare SS-driven event reached threshold it would likely fall outside our ROIs.

      In addition, we directly imaged CF population activity by expressing GCaMP7f in inferior-olive neurons. Correlation analysis of CF boutons revealed that DTA ablation lowers CF–CF synchrony at P14 (new Fig. 3A–D). Because CF synchrony is a principal driver of Purkinje-cell co-activation, this reduction provides a mechanistic link between oligodendrocyte loss and the hyposynchrony we observe among Purkinje cells. Consistent with this interpretation, electrophysiological recordings showed that parallel-fiber EPSCs and inhibitory synaptic inputs onto Purkinje cells were unchanged by DTA treatment (Fig. 3E-H) , which makes it less likely that the reduced synchrony simply reflects changes in other excitatory or inhibitory synaptic drive.

      That said, SS-dependent somatic Ca²⁺ signals could still influence downstream plasticity and long-term cerebellar function. In future work we therefore plan to combine somatic imaging with stage-specific oligodendrocyte manipulations to test whether SS-evoked Ca²⁺ dynamics are themselves modulated by oligodendrocyte support. We have added these descriptions in the Results (lines 288–294) and Discussion (lines 423–434).

      Also, it is erroneous to categorize these calcium signals as signatures of "spontaneous activity" of Purkinje neurons as they can have dual origins.

      Thank you for pointing out the potential ambiguity. In the revised manuscript we have clarified how we use the term “spontaneous activity” in the context of our measurements (lines 289-290). Our calcium imaging was restricted to the dendritic arbor of Purkinje cells, where calcium transients are dominated by climbing-fiber (CF) inputs (Ramirez & Stell, 2016; Good et al., 2017). Thus, the synchrony values reported here primarily reflect CF-driven complex spikes rather than mixed signals of dual origin. We have revised the Results section accordingly (lines 289–293) to make this measurement-specific limitation explicit.

      Further, the effect of developmental oligodendrocyte ablation on the cerebellum has been previously reported by Mathis et al., Development, 2003. They report very severe effects such as the loss of molecular layer interneurons, stunted Purkinje neuron dendritic arbors, abnormal foliations, etc. In this context, it is hardly surprising that one would observe a reduction of synchrony in Purkinje neurons (perhaps due to loss of synaptic contacts, not only from CFs but also from granule cells).

      We appreciate the reviewer’s comparison to Mathis et al. (2003). Mathis et al. used MBP–HSV-TK transgenic mice in which systemic FIAU treatment eliminates oligodendrocytes. When ablation began at P1, they observed severe dysmorphology—loss of molecular-layer interneurons, Purkinje-cell (PC) dendritic stunting, and abnormal foliation. Crucially, however, the same study reports that starting the ablation later (FIAU from P6-P20) left cerebellar cyto-architecture entirely normal.

      Our AAV MAG-DTA paradigm resembles this later window. Our temporally restricted DTA protocol produces the same ‘late-onset’ profile—robust yet reversible hypomyelination with no loss of Purkinje cells, interneurons, dendritic length, or synaptic input (new Fig. S1–S2, Fig. 3E-H). The enduring hyposynchrony we report therefore cannot be attributed to the dramatic anatomical defects seen after prenatal ablation, but instead reveals a specific requirement for early-postnatal myelin in stabilizing PC synchrony, especially affecting CF-CF synchrony.

      This clarification shows that we have carefully considered the Mathis model and that our findings not only replicate, but also extend the earlier work. We have added these description in Result (lines 273-286)

      The last experiment with the expression of Kir2.1 in the inferior olive is hardly convincing.

      We appreciate the reviewer’s concern and have reinforced the causal link between Purkinje-cell synchrony and behavior. To test whether restoring PC synchrony is sufficient to rescue behavior, we introduced a red-shifted opsin (AAV-L7-rsChrimine) into PCs of DTA mice raised to adulthood. During testing we delivered 590-nm light pulses (10 ms, 1 Hz) to the vermis, driving brief, population-wide spiking (new Fig. 8). This periodic re-synchronization left anxiety measures unchanged (open-field center time remained low) but rescued both motor coordination (rotarod latency normalized to control levels) and sociability (time spent with a novel mouse restored). The dissociation implies that distinct behavioral domains differ in their sensitivity to PC timing precision and confirms that reduced synchrony—not cell loss or gross circuit damage (Fig. S1F, S2)—is the primary driver of the motor and social deficits. Together, the optogenetic rescue establishes a bidirectional, mechanistic link between PC synchrony and behavior, addressing the reviewer’s reservations about the original experiment. We have added these descriptions in Result (lines 394-415)

      In summary, while the authors used a specific tool to probe the role of developmental oligodendrocytes in cerebellar physiology and function, they failed to answer specific questions regarding this role, which they could have done with more fine-grained experimental analysis.

      Thank you once again for your constructive suggestions and comments. We believe these changes have improved the clarity and readability of our manuscript.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Show that ODC loss is specific to the cerebellum.

      We thank the reviewer for requesting additional evidence. To verify that oligodendrocyte ablation was confined to the cerebellum, we injected an AAV carrying mCherry under the human MAG promoter (AAV-hMAG-mCherry) into the cerebellum, and screened the whole brain one week later. As shown in the new Figure 1E–G, mCherry positive cells were present throughout the injected cerebellar cortex (Fig. 1E), but no fluorescent cells were detected in extracerebellar regions—including cerebral cortex, medulla, pons, midbrain. These data demonstrate that our viral approach are specific to the cerebellum, ruling out off-target demyelination elsewhere in the CNS as a contributor to the behavioral and synchrony phenotypes. We have added these descriptions in Result (lines 262-264)

      (2) Characterize the gross morphology of the cerebellum at different developmental stages. Are major cell types all present? Major pathways preserved? 

      We thank the reviewer for requesting additional evidence. To ensure that the developmental loss of oligodendrocytes did not globally disturb cerebellar architecture, we performed a comprehensive histological and electrophysiological survey during development. New data are presented (new Fig. S1–S2, Fig. 3E-H).

      (1) Overall morphology. Low-magnification parvalbumin counterstaining revealed similar cerebellar area in DTA versus control mice at every age (Fig. S1F, G).

      (2) Major neuronal classes. Quantification of parvalbumin-positive Purkinje cells and interneurons showed no differences in density between control and DTA (Fig. S2E, H, M, N, P). Purkinje dendritic arbors were not different between control and DTA (Fig. S2G, O).

      (3) Excitatory and inhibitory synapse inputs. Miniature IPSCs and Parallel-fiber-EPSCs onto Purkinje cells were quantified; neither was differed between groups (Fig. 3E-G).

      (4) Glial populations. IBA1-positive microglia and S100β-positive astrocytes exhibited normal density and marker intensity (Fig. S2).

      Taken together, these analyses show that all major cell types are present at normal density, synaptic inputs from excitatory and inhibitory neurons are preserved, and gross cerebellar morphology is intact after DTA-mediated oligodendrocyte ablation.

      (3) Recording of PNs to see whether the lack of synchrony is due to CFs or simple spikes.

      We thank the reviewer for drawing attention to the work of Ramirez & Stell (2016), which showed that simple-spike bursts can elicit Ca<sup>2+</sup> rises, but only in the soma and proximal dendrites of adult Purkinje cells. In our study, Regions of Interest were restricted to the dendritic arbor, where SS-evoked signals are essentially undetectable (Ramirez & Stell, 2016), whereas climbing-fiber complex spikes generate large, all-or-none transients (Good et al., 2017). Accordingly, even if a rare SS-driven event reached threshold it would likely fall outside our ROIs.

      In addition, we directly imaged CF population activity by expressing GCaMP7f in inferior-olive neurons. Correlation analysis of CF boutons revealed that DTA ablation lowers CF–CF synchrony at P14 (new Fig. 3A–D). Because CF synchrony is a principal driver of Purkinje-cell co-activation, this reduction provides a mechanistic link between oligodendrocyte loss and the hyposynchrony we observe among Purkinje cells. Consistent with this interpretation, electrophysiological recordings showed that parallel-fiber EPSCs and inhibitory synaptic inputs onto Purkinje cells were unchanged by DTA treatment (Fig. 3E-H) , which makes it less likely that the reduced synchrony simply reflects changes in other excitatory or inhibitory synaptic drive.

      That said, SS-dependent somatic Ca<sup>2+</sup> signals could still influence downstream plasticity and long-term cerebellar function. In future work we therefore plan to combine somatic imaging with stage-specific oligodendrocyte manipulations to test whether SS-evoked Ca²⁺ dynamics are themselves modulated by oligodendrocyte support. We have added these descriptions in the Results (lines 301–312) and Discussion (lines 423–434).

      (4) Is CF synapse elimination altered? Test using evoked EPSCs or staining methods.

      We agree that directly testing whether oligodendrocyte loss disturbs climbing-fiber synapse elimination would provide a full mechanistic picture. We are already quantifying climbing fiber terminal number with vGluT2 immunostaining and recording evoked CF-EPSCs in the same DTA model; these data, together with an analysis of how population synchrony is involved in synapse elimination, will form the basis of a separate manuscript now in preparation. To keep the present paper focused on the phenomena we have rigorously documented—transient oligodendrocyte loss and the resulting long-lasting hyposynchrony and abnormal behaviors—we have removed the speculative sentence on oligodendrocyte-mediated synapse elimination. We believe this revision meets the reviewer’s request without over-extending the current dataset.

      Thank you once again for your constructive suggestions and comments. We believe these changes have improved the clarity and readability of our manuscript.

    1. eLife Assessment

      By investigating spine nanostructure and dynamics across multiple genetic mouse models for neurodevelopmental disorders, this important study has the potential to uncover convergent or divergent synaptic phenotypes that may be specifically associated with autism versus schizophrenia risk. While the imaging and breadth are impressive, there are potential methodological concerns, especially around statistical analyses, which render the evidence incomplete and should be addressed. The purely in vitro nature of the study also slightly limits the generalisability of the findings.

    2. Reviewer #1 (Public review):

      Summary:

      Kashiwagi et al. undertook a population analysis of dendritic spine nanostructure applied to the objective grouping of 8 mouse models of neuropsychiatric disorders. They report that spine morphology in cultured hippocampal neurons shows a higher similarity among schizophrenia mouse models (compared with autism spectrum disorder (ASD) mouse models), and identify an effect of Ecrg4 (encoding small secretory peptides) on spine dynamics and shape in these models.

      Strengths:

      The study developed a method for objectively comparing spine properties in primary hippocampal neuron cultures from 8 mouse models of psychiatric disorders at the population level using high-resolution structured illumination microscopy (SIM) imaging. This novel technique identified two distinct groups of mouse models according to the population-level spine properties: those with ASD-related gene mutations and those with schizophrenia-related gene mutations. Functional studies, including gene knockdown and overexpression experiments, identified an effect of Ecrg4 on the spine phenotype of the schizophrenia model mice.

      Weaknesses:

      The main weakness is that the study is wholly in vitro, using cultured hippocampal neurons. The authors present this as an advantage, however, arguing that spine morphology as measured in a reduced culture system can demonstrate direct effects of gene mutations on neuronal phenotypes in the absence of indirect influences from nonneuronal cells or specific environments.

      Another weakness is that CaMKIIαK42R/K42R mutant mice are presented as a schizophrenia model, the authors justifying this by saying that "CaMKII-related signaling pathway disruption has been implicated in the working memory deficits found in schizophrenia patients". Since mutations in CAMK2A cause autosomal dominant intellectual developmental disorder-53 (OMIM 617798) and autosomal recessive intellectual developmental disorder-63 (OMIM 618095), and mice carrying the CAMK2A E183V mutation exhibit ASD-related synaptic and behavioral phenotypes (PMID: 28130356), I think it's stretching credibility to refer to the CaMKIIαK42R/K42R mice as a schizophrenia model.

      Although the manuscript is largely well written, there are some instances of ambiguous/unspecific language. This extends to the title (Decoding Spine Nanostructure in Mental Disorders Reveals a Schizophrenia-1 Linked Role for Ecrg4), which gives no indication that the work was in vitro on cultured neurons derived from mouse models.

    3. Reviewer #2 (Public review):

      Okabe and colleagues build on a super-resolution-based technique that they have previously developed in cultured hippocampal neurons, improving the pipeline and using it to analyze spine nanostructure differences across 8 different mouse lines with mutations in autism or schizophrenia (Sz) risk genes/pathways. It is a worthy goal to try to use multiple models to examine potential convergent (or not) phenotypes, and the authors have made a good selection of models. They identify some key differences between the autism versus the Sz risk gene models, primarily that dendritic spines are smaller in Sz models and (mostly) larger in autism risk gene models. They then focus on three models (2 Sz - 22q11.2 deletion, Setd1a; 1 ASD - Nlgn3) for time-lapse imaging of spine dynamics, and together with computational modelling provide a mechanistic rationale for the smaller spines in Sz risk models. Bulk RNA sequencing of all 8 model cultures identifies several differentially expressed genes, which they go on to test in cultures, finding that ecgr4 is upregulated in several Sz models and its misexpression recapitulates spine dynamics changes seen in the Sz mutants, while knockdown rescues spine dynamics changes in the Sz mutants. Overall, these have the potential to be very interesting findings and useful for the field. However, I do have a number of major concerns.

      (1) The main finding of spine nanostructure changes is done by carrying out a PCA on various structural parameters, creating spine density plots across PC1 and PC2, and then subtracting the WT density plot from the mutant. Then, spines in the areas with obvious differences only are analyzed, from which they derive the finding that, for example, spine sizes are smaller. However, this seems a circular approach. It is like first identifying where there might be a difference in the data, then only analyzing that part of the data. I welcome input from a statistician, but to me, this is at best unconventional and potentially misleading. I assume the overall means are not different (although this should be included), but could they look at the distribution of sizes and see if these are shifted?

      (2) Despite extracting 64 parameters describing spine structure, only 5 of these seemed to be used for the PCA. It should be possible to use all parameters and show the same results. More information on PC1 and PC2 would be helpful, given that the rest of the paper is based on these - what features are they related to? These specific features could then be analyzed in the full dataset, without doing the cherry picking above. It would also be helpful to demonstrate whether PC1 and 2 differ across groups - for example, the authors could break their WT data into 2 subsets and repeat the analysis.

      (3) Throughout the paper, the 'n' used for statistical analysis is often spine, which is not appropriate. At a minimum, cell should be used, but ideally a nested mixed model, which would take into account factors like cell, culture, and animal, would be preferable. Also, all of these factors should be listed, with sufficient independent cultures.

      (4) The authors should confirm that all mutants are also on the C57BL/6J background, and clarify whether control cultures are from littermates (this would be important). Also, are control versus mutant cultures done simultaneously? There can be significant batch effects with cultures.

      (5) The spine analysis uses cultures from 18-22 DIV - this is quite a large range. It would be worth checking whether age is a confounder or correlated with any parameters / principal components.

      (6) The computational modelling is interesting, but again, I am concerned about some circularity. Parameter optimization was used to identify the best fit model that replicated the spine turnover rates, so it is somewhat circular to say that this matched the observations when one of these is the turnover rate. It is more convincing for spine density and size, but why not go back and test whether parameter differences are actually seen - for example, it would be possible to extract the probability of nascent spine loss, etc. More compelling would be to repeat the experiments and see if the model still fits the data. In the interpretation (line 314-318) it is stated that '... reduced spine maturation rate can account for the three key properties of schizophrenia-related spines...', which is interesting if true, but it has just been stated that the probability of spine destabilization is also higher in mutants (line 303) - the authors should test whether if the latter is set to be the same as controls whether all the findings are replicated.

      (7) No validation for overexpression or knockdown is shown, although it is mentioned in the methods - please include. Also, for the knockdown, a scrambled shRNA control would be preferable.

      (8) The finding regarding ecgr4 is interesting, but showing that some ecgr4 is expressed at boutons and spines and some in DCVs is not enough evidence to suggest that actively involved in the regulation of synapse formation and maturation (line 356).

      (9) The same caveats that apply to the analysis also apply to the ecgr4 rescue. In addition, while for 22q the control shRNA mutant vs WT looks vaguely like Figure 2, setd1a looks completely different. And if rescued, surely shRNA in the mutant should now resemble control in WT, so there shouldn't be big differences, but in fact, there are just as many differences as comparing mutant vs wildtype? Plus, for spine features, they only compare mutant rescue with mutant control, but this is not ideal - something more like a 2-way ANOVA is really needed. Maybe input from a statistician might be useful here?

      (10) Although this is a study entirely focused on spine changes in mouse models for Sz, there is no discussion (or citation) of the various studies that have examined this in the literature. For example, for Setd1a, smaller spines or reduced spine densities have been described in various papers (Mukai et al, Neuron 2019; Chen et al, Sci Adv 2022; Nagahama et al, Cell Rep 2020).

      (11) There is a conceptual problem with the models if being used to differentiate autism risk from Sz risk genes. It is difficult to find good mouse models for Sz, so the choice of 22q11.2del and Setd1a haploinsufficiency is completely reasonable. However, these are both syndromic. 22qdel syndrome involves multiple issues, including hearing loss, delayed development, and learning disabilities, and is associated with autism (20% have autism, as compared to 25% with Sz). Similarly, Setd1a is also strongly associated with autism as well as Sz (and also involves global developmental delay and intellectual disability). While I think this is still the best we can do, and it is reasonable to say that these models show biased risk for these developmental disorders, it definitely can't be used as an explanation for the higher variability seen in the autism risk models.

      (12) I am not convinced that using dissociated cultures is 'more likely to reflect the direct impact of schizophrenia-related gene mutations on synaptic properties' - first, cultures do have non-neuronal cells, although here glial proliferation was arrested at 2 days, glia will be present with the protocol used (or if not, this needs demonstrating). Second, activity levels will affect spine size, and activity patterns are very abnormal in dissociated cultures, so it is very possible that spine changes may not translate into in vivo scenarios. Overall, it is a weakness that the dissociated culture system has been used, which is not to say that it is not useful, and from a technical and practical perspective, there are good justifications.

      (13) As a minor comment, the spine time-lapse imaging is a strength of the paper. I wonder about the interpretation of Figure 5. For example, the results in Figure 5G and J look as if they may be more that the spines grow to a smaller size and start from a smaller size, rather than necessarily the rate of growth.

    4. Author response:

      Reviewer #1

      (1) The main weakness is that the study is wholly in vitro, using cultured hippocampal neurons.

      We appreciate this reviewer's concern about the limitation of cultured hippocampal neurons in extracting disease-related spine phenotypes. While we fully recognize this limitation, we consider that this in vitro system has several advantages that contribute to translational research on mental disorders.

      First, our culture system has been shown to support the development of spine morphology similar to that of the hippocampal CA1 excitatory synapse in vivo. High-resolution imaging techniques confirmed that the in vitro spine structure was highly preserved compared with in vivo preparations (Kashiwagi et al., Nature Communications, 2019). The present study used the same culture system and SIM imaging. Therefore, the difference we detected in samples derived from disease models is likely to reflect impairment of molecular mechanisms underlying native structural development in vivo.

      Second, super-resolution imaging of thousands of spines in tissue preparations under precisely controlled conditions cannot be practically applied using currently available techniques. The advantage of our imaging and analytical pipeline is its reproducibility, which enabled us to compare the spine population data from eight different mouse models without normalization.

      Third, a reduced culture system can demonstrate the direct effects of gene mutations on synapse phenotypes, independent of environmental influences. This property is highly advantageous for screening chemical compounds that rescue spine phenotypes. Neuronal firing patterns and receptor functions can also be easily controlled in a culture system. The difference in spine structure between ASD and schizophrenia mouse models is valuable information to establish a drug screening system.

      Fourth, establishing an in vitro system for evaluating synapse phenotypes could reduce the need for animal experiments. Researchers should be aware of the 3Rs principles. In the future, combined with differentiation techniques for human iPS cells, our in vitro approach will enable the evaluation of disease-related spine phenotypes without the need for animal experiments. The effort to establish a reliable culture system should not be eliminated.

      (2) Another weakness is that CaMKIIαK42R/K42R mutant mice are presented as a schizophrenia model.

      We agree with this reviewer that CAMK2A mutations in humans are linked to multiple mental disorders, including developmental disorders, ASD, and schizophrenia. Association of gene mutations with the categories of mental disorders is not straightforward, as the symptoms of these disorders also overlap with each other. For the CaMKIIα K42R/K42R mutant, we considered the following points in its characterization as a model of mental disorder. Analysis of CaMKIIα +/- mice in Dr. Tsuyoshi Miyakawa's lab has provided evidence for the reduced CaMKIIα in schizophrenia-related phenotypes (Yamasaki et al., Mol Brain 2008; Frankland et al., Mol Brain Editorial 2008). It is also known that the CaMKIIα R8H mutation in the kinase domain is linked to schizophrenia (Brown et al., 2021). Both CaMKIIα R8H and CaMKIIα K42R mutations are located in the N-terminal domain and eliminate kinase activity. On the other hand, the representative CaMKIIα E183V mutation identified in ASD patients exhibits unique characteristics, including reduced kinase activity, decreased protein stability and expression levels, and disrupted interactions with ASD-associated proteins such as Shank3 (Stephenson et al., 2017). Importantly, reduced dendritic spines in neurons expressing CaMKIIα E183V is a property opposite to that of the CaMKIIα K42R/K42R mutant, which showed increased spine density (Koeberle et al. 2017).

      Different CAMK2A mutations likely cause distinct phenotypes observed in the broad spectrum of mental disorders. In the revised manuscript, we will include a discussion of the relevant literature to categorize this mouse model appropriately.

      References related to this discussion.

      (1) Yamasaki et al., Mol Brain. 2008 DOI: 10.1186/1756-6606-1-6

      (2) Frankland et al. Mol Brain. 2008 DOI: 10.1186/1756-6606-1-5

      (3) Stephenson et al., J Neurosci. 2017 DOI: 10.1523/JNEUROSCI.2068-16.2017

      (4) Koeberle et al. Sci Rep. 2017 DOI: 10.1038/s41598-017-13728-y

      (5) Brown et al., iScience. 2021 DOI: 10.1016/j.isci.2021.103184

      Reviewer #2

      We recognize the reviewer's comments as important for improving our manuscript. We outline our general approach to addressing major concerns. Detailed responses to each point, along with additional data, will be provided in a formal revised manuscript.

      (1) Demonstrating the robustness of statistical analyses

      We appreciate this reviewer's concern about our strategies for the quantitative analysis of the large spine population. For the PCA analysis (Point 2), our preliminary results indicated that including all parameters or the selected five parameters did not make a significant difference in the relative placement of spines with specific morphologies in the feature space defined by the principal components. This point will be discussed in the revised manuscript. The potential problem of selecting a particular region within a feature space for spine shape analysis (Point 1) can be addressed by using alternative simulation-based approaches, such as bootstrap or permutation tests. These analyses will be included in the revised manuscript. The use of sample numbers in statistical analyses should align with the analysis's purpose (Point 3). When analyzing the distribution of samples in the feature space, it is necessary to use spine numbers for statistical assessment. We will recheck the statistical methods and apply the appropriate method for each analysis. The spine population data in Figures 2 and 8 cannot be directly compared, as the spine visualization methods differ (Figure 2 with membrane DiI labeling; Figure 8 with cytoplasmic GFP labeling) (Point 9). Spine populations of the same size are inevitably plotted in different feature spaces. This point will be discussed more clearly in the revised manuscript.

      (2) Clarification of experimental conditions and data reliability

      Per this reviewer's suggestion, we will provide more information on the genetic background of mice and the differences in spine structure from DIV 18-22 (Points 4 and 5). We will also provide additional validation data for the functional analyses using knockdown and overexpression methods, for which we already have preliminary data (Point 7). Concerns about the interpretation of data obtained from in vitro culture (Point 12), raised by this reviewer, are also noted by reviewer #1. As explained in the response to reviewer #1, we intentionally selected an in vitro culture system to analyze multiple samples derived from mouse models of mental disorders for several reasons. Nevertheless, we will revise the discussion and incorporate the points this reviewer raised regarding the disadvantages of in vitro systems.

      (3) Validation of biological mechanisms and interpretation

      In the computational modeling (Point 6), we started from the data of spine turnover (excluding the data of spine volume increase/decrease), fitted the model with the data, and found that the best-fit model showed three features: fast spine turnover, lower spine density, and smaller size of transient spines in schizophrenia mouse models. As the reviewer noted, information about spine turnover is already present in the input data. However, the other two properties are generated independently of the input data, indicating the value of this model. We plan to add additional confirmatory analyses to this model in the revised manuscript.

      In response to Point 8, we will provide supporting data on the functional role of Ecgr4 in synapse regulation. We will also refine our discussion on the ASD and Schizophrenia phenotypes based on the suggested literature (Points 10 and 11). Quantification of the initial growth of spines is technically demanding, as it requires higher imaging frequency and longer time-lapse recordings to capture rare events. It is difficult to conclude which of the two possibilities, slow spine growth or initial size differences, is correct, based on our available data. This point will be discussed in the revised manuscript (Point 13).

    1. eLife Assessment

      This useful study provides a systematic and solid comparison of sex-biased enteroendocrine peptide expression, including AstC and Tk, to show that these peptides contribute to female-biased fat storage. The major research question of this study is based on the authors' previous papers, and therefore, the presented results are incremental. This study serves as a foundation for future investigation of regulatory mechanisms for the sex-biased fat content by AstC and Tk.

    2. Reviewer #1 (Public review):

      Summary of goals:

      The authors' stated goal (line 226) was to compare gene expression levels for gut hormones between males and females. As female flies contain more fat than males, they also sought to identify hormones that control this sex difference. Finally, they attempted to place their findings in the broader context of what is already known about established underlying mechanisms.

      Strengths:

      (1) The core research question of this work is interesting. The authors provide a reasonable hypothesis (neuro/entero-peptides may be involved) and well-designed experiments to address it.

      (2) Some of the data are compelling, especially positive results that clearly implicate enteropeptides in sex-biased fat contents (Figures 1 and 3).

      Weaknesses:

      (1) The greatest weakness of this work is that it falls short of providing a clear mechanism for the regulation of sex-biased fat content by AstC and Tk. By and large, feminization of neurons or enteroendocrine cells with UAS-traF did not increase fat in males (Figure 2). The authors mention that ecdysone, juvenile hormone or Sex-lethal may instead play a role (lines 258-270), but this is speculative, making this study incomplete.

      (2) Related to the above point, the cellular mechanisms by which AstC and Tk regulate fat content in males and females are only partially characterized. For example, knockdown of TkR99D in insulin-producing neurons (Figure 4E) but not pan-neuronally (Figure 4B) increases fat in males, but Tk itself only shows a tendency (Figure 3B). In females, the situation is even less clear: again, Tk only shows a tendency (Figure 3B), and pan-neuronal, but not IPC-specific knockdown of TkR99D decreases fat.

      (3) The text sometimes misrepresents or contradicts the Results shown in the figures. UAS-traF expression in neurons or enteroendocrine cells did sometimes alter fat contents (Figure 2H, S), but the authors report that sex differences were unaffected (lines 164-166). On the other hand, although knockdown of Tk in enteroendocrine cells caused no significant effect (Figure 3B), the authors report this as a trend towards reduction (lines 182-183). This biased representation raises concerns about the interpretation of the data and the authors' conclusions.

      (4) The authors find that in males, neuropeptide expression in the head is higher (Figure 1F-J). This may also play an important role in maintaining lower levels of fat in males, but this finding is not explored in the manuscript.

      Appraisal of goal achievement & conclusions:

      The authors were successful in identifying hormones that show sex bias in their expression and also control the male vs. female difference in fat content. However, elucidation of the relevant cellular pathways is incomplete. Additionally, some of their conclusions are not supported by the data (see Weaknesses, point 3).

      Impact:

      It is difficult to evaluate the impact of this study. This is in great part because the authors do not attempt to systematically place their findings about AstC/Tk in the broader context of their previous studies, which investigated the same phenomenon (Wat et al., 2021, eLife and Biswas et al., 2025, Cell Reports). As the underlying mechanisms are complex and likely redundant, it is necessary to generate a visual model to explain the pathways which regulate fat content in males and females.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript by Biswas and Rideout investigates sex differences in the expression and function of hormones derived from Drosophila enteroendocrine cells (EE). The authors report that while whole-body and head expression of several EE hormones (AstA, AstC, Tk, NPF, Dh31) is male-biased, gut-specific expression of AstC, Tk, and NPF is female-biased. Intriguingly, this sex-specific effect is not dependent on Tra - a surprising and important result. The authors then used an RNAi-based approach to demonstrate that gut-derived AstC and Tk promote fat storage specifically in females. Similar effects are observed when their receptors are knocked down in neurons. In addition, the authors were able to demonstrate that while Tk promotes female body fat via the insulin-producing cells. Together, these findings suggest that EE cell-derived hormones contribute to sex-specific fat storage regulation.

      Strengths:

      Overall, I find the paper quite interesting. While the findings are brief, they reveal novel aspects of the sex-specific lipid storage program that I believe are important. As noted by the authors in the discussion, there are many open questions, including how these neuronal effects translate into systemic sex-specific regulation of lipid storage. Regardless, I find the results to be convincing - this paper will serve as the launching point of many future studies.

      Weaknesses:

      My main criticisms are focused on two points:

      (1) If the sex specific differences are eliminated by tra overexpression, what else might be responsible? As the authors note, the differences in 20E titers might be responsible. I would encourage the authors to simply feed adult flies with food containing 20E and determine if this alters sex-specific 20E expression.

      (2) I'm quite intrigued by the discovery that Tra does not eliminate the sex-specific differences. There are quite a few recent studies demonstrating that fruitless influences sex-specific neuronal function - here to I would encourage the authors to examine whether this aspect of the sex-determination pathway is involved in the lipid accumulation phenotype.

    1. eLife Assessment

      This important study introduces an advance in multi-animal tracking by reframing identity assignment as a self-supervised contrastive representation learning problem. It eliminates the need for segments of video where all animals are simultaneously visible and individually identifiable, and significantly improves tracking speed, accuracy, and robustness with respect to occlusion. This innovation has implications beyond animal tracking, potentially connecting with advances in behavioral analysis and computer vision. The strength of support for these advances is compelling overall, although there were some remaining minor methodological concerns.

    2. Reviewer #1 (Public review):

      Summary:

      This is a strong paper that presents a clear advance in multi-animal tracking. The authors introduce an updated version of idtracker.ai that reframes identity assignment as a contrastive representation learning problem rather than a classification task requiring global fragments. This change leads to substantial gains in speed and accuracy and removes a known bottleneck in the original system. The benchmarking across species is comprehensive, the results are convincing, and the work significant.

      Strengths:

      The main strengths are the conceptual shift from classification to representation learning, the clear performance gains, and the improved robustness of the new version. Removing the need for global fragments makes the software much more flexible in practice, and the accuracy and speed improvements are well demonstrated across a diverse set of datasets. The authors' response also provides further support for the method's robustness.

      The comparison to other methods is now better documented. The authors clarify which features are used, how failures are defined, how parameters are sampled, and how accuracy is assessed against human-validated data. This helps ensure that the evaluation is fair and that readers can understand the assumptions behind the benchmarks.

      The software appears thoughtfully implemented, with GUI updates, integration with pose estimators, and tools such as idmatcher.ai for linking identities across videos. The overall presentation has been improved so that the limitations of the original idtracker.ai, the engineering optimizations, and the new contrastive formulation are more clearly separated. This makes the central ideas and contributions easier to follow.

      Weaknesses:

      I do not have major remaining criticisms. The authors have addressed my earlier concerns about the clarity and fairness of the comparison with prior methods, the benchmark design, and the memory usage analysis by adding methodological detail and clearly explaining their choices. At this point I view these aspects as transparent features of the experimental design that readers can take into account, rather than weaknesses of the work.

      Overall, this is a high-quality paper. The improvements to idtracker.ai are well justified and practically significant, and the authors' response addresses the main concerns about clarity and evaluation. The conceptual contribution, thorough empirical validation, and thoughtful software implementation make this a valuable and impactful contribution to multi-animal tracking.

    3. Reviewer #3 (Public review):

      Summary:

      The authors propose a new version of idTracker.ai for animal tracking. Specifically, they apply contrastive learning to embed cropped images of animals into a feature space where clusters correspond to individual animal identities. By doing this, they address the requirement for so-called global fragments - segments of the video, in which all entities are visible/detected at the same time. In general, the new method reduces the long tracking times from the previous versions, while also increasing the average accuracy of assigning the identity labels.

      Strengths and weaknesses:

      The authors have reorganized and rewritten a substantial portion of their manuscript, which has improved the overall clarity and structure to some extent. In particular, omitting the different protocols enhanced readability. However, all technical details are now in appendix which is now referred to more frequently in the manuscript, which was already the case in the initial submission. These frequent references to the appendix - and even to appendices from previous versions - make it difficult to read and fully understand the method and the evaluations in detail. A more self-contained description of the method within the main text would be highly appreciated.

      Furthermore, the authors state that they changed their evaluation metric from accuracy to IDF1. However, throughout the manuscript they continue to refer to "accuracy" when evaluating and comparing results. It is unclear which accuracy metric was used or whether the authors are confusing the two metrics. This point needs clarification, as IDF1 is not an "accuracy" measure but rather an F1-score over identity assignments.

      The authors compare the speedups of the new version with those of the previous ones by taking the average. However, it appears that there are striking outliers in the tracking performance data (see Supplementary Table 1-4). Therefore, using the average may not be the most appropriate way to compare. The authors should consider using the median or providing more detailed statistics (e.g., boxplots) to better illustrate the distributions.

      The authors did not provide any conclusion or discussion section. Including a concise conclusion that summarizes the main findings and their implications would help to convey the message of the manuscript.

      The authors report an improvement in the mean accuracy across all benchmarks from 99.49% to 99.82% (with crossings). While this represents a slight improvement, the datasets used for benchmarking seem relatively simple and already largely "solved". Therefore, the impact of this work on the field may be limited. It would be more informative to evaluate the method on more challenging datasets that include frequent occlusions, crossings, or animals with similar appearances. The accuracy reported in the main text is "without crossings" - this seems like incomplete evaluation, especially that tracking objects that do not cross seems a straightforward task. Information is missing why crossings are a problem and are dealt with separately. There are several videos with a much lower tracking accuracy, explaining what the challenges of these videos are and why the method fails in such cases would help to understand the method's usability and weak points.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      This is a strong paper that presents a clear advance in multi-animal tracking. The authors introduce an updated version of idtracker.ai that reframes identity assignment as a contrastive learning problem rather than a classification task requiring global fragments. This change leads to gains in speed and accuracy. The method eliminates a known bottleneck in the original system, and the benchmarking across species is comprehensive and well executed. I think the results are convincing and the work is significant.

      Strengths

      The main strengths are the conceptual shift from classification to representation learning, the clear performance gains, and the fact that the new version is more robust. Removing the need for global fragments makes the software more flexible in practice, and the accuracy and speed improvements are well demonstrated. The software appears thoughtfully implemented, with GUI updates and integration with pose estimators.

      Weaknesses

      I don't have any major criticisms, but I have identified a few points that should be addressed to improve the clarity and accuracy of the claims made in the paper.

      (1) The title begins with "New idtracker.ai," which may not age well and sounds more promotional than scientific. The strength of the work is the conceptual shift to contrastive representation learning, and it might be more helpful to emphasize that in the title rather than branding it as "new."

      We considered using “Contrastive idtracker.ai”. However, we thought that readers could then think that we believe they could use both the old idtracker.ai or this contrastive version. But we want to say that the new version is the one to use as it is better in both accuracy and tracking times. We think “New idtracker.ai” communicates better that this version is the version we recommend.

      (2) Several technical points regarding the comparison between TRex (a system evaluated in the paper) and idtracker.ai should be addressed to ensure the evaluation is fair and readers are fully informed.

      (2.1) Lines 158-160: The description of TRex as based on "Protocol 2 of idtracker.ai" overlooks several key additions in TRex, such as posture image normalization, tracklet subsampling, and the use of uniqueness feedback during training. These features are not acknowledged, and it's unclear whether TRex was properly configured - particularly regarding posture estimation, which appears to have been omitted but isn't discussed. Without knowing the actual parameters used to make comparisons, it's difficult to dassess how the method was evaluated.

      We added the information about the key additions of TRex in the section “The new idtracker.ai uses representation learning”, lines 153-157. Posture estimation in TRex was not explicitly used but neither disabled during the benchmark; we clarified this in the last paragraph of “Benchmark of accuracy and tracking time”, lines 492-495.

      (2.2) Lines 162-163: The paper implies that TRex gains speed by avoiding Protocol 3, but in practice, idtracker.ai also typically avoids using Protocol 3 due to its extremely long runtime. This part of the framing feels more like a rhetorical contrast than an informative one.

      We removed this, see new lines 153-157.

      (2.3) Lines 277-280: The contrastive loss function is written using the label l, but since it refers to a pair of images, it would be clearer and more precise to write it as l_{I,J}. This would help readers unfamiliar with contrastive learning understand the formulation more easily.

      We added this change in lines 613-620.

      (2.4) Lines 333-334: The manuscript states that TRex can fail to track certain videos, but this may be inaccurate depending on how the authors classify failures. TRex may return low uniqueness scores if training does not converge well, but this isn't equivalent to tracking failure. Moreover, the metric reported by TRex is uniqueness, not accuracy. Equating the two could mislead readers. If the authors did compare outputs to human-validated data, that should be stated more explicitly.

      We observed TRex crashing without outputting any trajectories on some occasions (Appendix 1—figure 1), and this is what we labeled as “failure”. These failures happened in the most difficult videos of our benchmark, that’s why we treated them the same way as idtracker.ai going to P3. We clarified this in new lines 464-469.

      The accuracy measured in our benchmark is not estimated but it is human-validated (see section Computation of tracking accuracy in Appendix 1). Both softwares report some quality estimators at the end of a tracking (“estimated accuracy” for idtracker.ai and "uniqueness” for TRex) but these were not used in the benchmark.

      (2.5) Lines 339-341: The evaluation approach defines a "successful run" and then sums the runtime across all attempts up to that point. If success is defined as simply producing any output, this may not reflect how experienced users actually interact with the software, where parameters are iteratively refined to improve quality.

      Yes, our benchmark was designed to be agnostic to the different experiences of the user. Also, our benchmark was designed for users that do not inspect the trajectories to choose parameters again not to leave room for potential subjectivity.

      (2.6) Lines 344-346: The simulation process involves sampling tracking parameters 10,000 times and selecting the first "successful" run. If parameter tuning is randomized rather than informed by expert knowledge, this could skew the results in favor of tools that require fewer or simpler adjustments. TRex relies on more tunable behavior, such as longer fragments improving training time, which this approach may not capture.

      We precisely used the TRex parameter track_max_speed to elongate fragments for optimal tracking. Rather than randomized parameter tuning, we defined the “valid range” for this parameter so that all values in it would produce a decent fragment structure. We used this procedure to avoid worsening those methods that use more parameters.

      (2.7) Line 354 onward: TRex was evaluated using two varying parameters (threshold and track_max_speed), while idtracker.ai used only one (intensity_threshold). With a fixed number of samples, this asymmetry could bias results against TRex. In addition, users typically set these parameters based on domain knowledge rather than random exploration.

      idtracker.ai and TRex have several parameters. Some of them have a single correct value (e.g. number of animals) or the default value that the system computes is already good (e.g. minimum blob size). For a second type of parameters, the system finds a value that is in general not as good, so users need to modify them. In general, users find that for this second type of parameter there is a valid interval of possible values, from which they need to choose a single value to run the system. idtracker.ai has intensity_threshold as the only parameter of this second type and TRex has two: threshold and track_max_speed. For these parameters, choosing one value or another within the valid interval can give different tracking results. Therefore, when we model a user that wants to run the system once except if it goes to P3 (idtracker.ai) or except if it crashes (TRex), it is these parameters we sample from within the valid interval to get a different value for each run of the system. We clarify this in lines 452-469 of the section “Benchmark of accuracy and tracking time”.

      Note that if we chose to simply run old idtracker.ai (v4 or v5) or TRex a single time, this would benefit the new idtracker.ai (v6). This is because old idtracker.ai can enter the very slow protocol 3 and TRex can fail to track. So running old idtracker.ai or TRex up to 5 times until old idtracker.ai does not use Protocol 3 and TRex does not fail is to make them as good as they can be with respect to the new idtracker.ai

      (2.8) Figure 2-figure supplement 3: The memory usage comparison lacks detail. It's unclear whether RAM or VRAM was measured, whether shared or compressed memory was included, or how memory was sampled. Since both tools dynamically adjust to system resources, the relevance of this comparison is questionable without more technical detail.

      We modified the text in the caption (new Figure 1-figure supplement 2) adding the kind of memory we measured (RAM) and how we measured it. We already have a disclaimer for this plot saying that memory management depends on the machine's available resources. We agree that this is a simple analysis of the usage of computer resources.

      (3) While the authors cite several key papers on contrastive learning, they do not use the introduction or discussion to effectively situate their approach within related fields where similar strategies have been widely adopted. For example, contrastive embedding methods form the backbone of modern facial recognition and other image similarity systems, where the goal is to map images into a latent space that separates identities or classes through clustering. This connection would help emphasize the conceptual strength of the approach and align the work with well-established applications. Similarly, there is a growing literature on animal re-identification (ReID), which often involves learning identity-preserving representations across time or appearance changes. Referencing these bodies of work would help readers connect the proposed method with adjacent areas using similar ideas, and show that the authors are aware of and building on this wider context.

      We have now added a new section in Appendix 3, “Differences with previous work in contrastive/metric learning” (lines 792-841) to include references to previous work and a description of what we do differently.

      (4) Some sections of the Results text (e.g., lines 48-74) read more like extended figure captions than part of the main narrative. They include detailed explanations of figure elements, sorting procedures, and video naming conventions that may be better placed in the actual figure captions or moved to supplementary notes. Streamlining this section in the main text would improve readability and help the central ideas stand out more clear

      Thank you for pointing this out. We have rewritten the Results, for example streamlining the old lines 48-74 (new lines 42-48)  by moving the comments about names, files and order of videos to the caption of Figure 1.

      Overall, though, this is a high-quality paper. The improvements to idtracker.ai are well justified and practically significant. Addressing the above comments will strengthen the work, particularly by clarifying the evaluation and comparisons.

      We thank the reviewer for the detailed suggestions. We believe we have taken all of them into consideration to improve the ms.

      Reviewer #2 (Public review):

      Summary:

      This work introduces a new version of the state-of-the-art idtracker.ai software for tracking multiple unmarked animals. The authors aimed to solve a critical limitation of their previous software, which relied on the existence of "global fragments" (video segments where all animals are simultaneously visible) to train an identification classifier network, in addition to addressing concerns with runtime speed. To do this, the authors have both re-implemented the backend of their software in PyTorch (in addition to numerous other performance optimizations) as well as moving from a supervised classification framework to a self-supervised, contrastive representation learning approach that no longer requires global fragments to function. By defining positive training pairs as different images from the same fragment and negative pairs as images from any two co-existing fragments, the system cleverly takes advantage of partial (but high-confidence) tracklets to learn a powerful representation of animal identity without direct human supervision. Their formulation of contrastive learning is carefully thought out and comprises a series of empirically validated design choices that are both creative and technically sound. This methodological advance is significant and directly leads to the software's major strengths, including exceptional performance improvements in speed and accuracy and a newfound robustness to occlusion (even in severe cases where no global fragments can be detected). Benchmark comparisons show the new software is, on average, 44 times faster (up to 440 times faster on difficult videos) while also achieving higher accuracy across a range of species and group sizes. This new version of idtracker.ai is shown to consistently outperform the closely related TRex software (Walter & Couzin, 2021\), which, together with the engineering innovations and usability enhancements (e.g., outputs convenient for downstream pose estimation), positions this tool as an advancement on the state-of-the-art for multi-animal tracking, especially for collective behavior studies.

      Despite these advances, we note a number of weaknesses and limitations that are not well addressed in the present version of this paper:

      Weaknesses

      (1) The contrastive representation learning formulation. Contrastive representation learning using deep neural networks has long been used for problems in the multi-object tracking domain, popularized through ReID approaches like DML (Yi et al., 2014\) and DeepReID (Li et al., 2014). More recently, contrastive learning has become more popular as an approach for scalable self-supervised representation learning for open-ended vision tasks, as exemplified by approaches like SimCLR (Chen et al., 2020), SimSiam (Chen et al., 2020\), and MAE (He et al., 2021\) and instantiated in foundation models for image embedding like DINOv2 (Oquab et al., 2023). Given their prevalence, it is useful to contrast the formulation of contrastive learning described here relative to these widely adopted approaches (and why this reviewer feels it is appropriate):

      (1.1) No rotations or other image augmentations are performed to generate positive examples. These are not necessary with this approach since the pairs are sampled from heuristically tracked fragments (which produces sufficient training data, though see weaknesses discussed below) and the crops are pre-aligned egocentrically (mitigating the need for rotational invariance).

      (1.2) There is no projection head in the architecture, like in SimCLR. Since classification/clustering is the only task that the system is intended to solve, the more general "nuisance" image features that this architectural detail normally affords are not necessary here.

      (1.3) There is no stop gradient operator like in BYOL (Grill et al., 2020\) or SimSiam. Since the heuristic tracking implicitly produces plenty of negative pairs from the fragments, there is no need to prevent representational collapse due to class asymmetry. Some care is still needed, but the authors address this well through a pair sampling strategy (discussed below).

      (1.4) Euclidean distance is used as the distance metric in the loss rather than cosine similarity as in most contrastive learning works. While cosine similarity coupled with L2-normalized unit hypersphere embeddings has proven to be a successful recipe to deal with the curse of dimensionality (with the added benefit of bounded distance limits), the authors address this through a cleverly constructed loss function that essentially allows direct control over the intra- and inter-cluster distance (D\_pos and D\_neg). This is a clever formulation that aligns well with the use of K-means for the downstream assignment step.

      No concerns here, just clarifications for readers who dig into the review. Referencing the above literature would enhance the presentation of the paper to align with the broader computer vision literature.

      Thank you for this detailed comparison. We have now added a new section in Appendix 3, “Differences with previous work in contrastive/metric learning” (lines 792-841) to include references to previous work and a description of what we do differently, including the points raised by the reviewer.

      (2) Network architecture for image feature extraction backbone. As most of the computations that drive up processing time happen in the network backbone, the authors explored a variety of architectures to assess speed, accuracy, and memory requirements. They land on ResNet18 due to its empirically determined performance. While the experiments that support this choice are solid, the rationale behind the architecture selection is somewhat weak. The authors state that: "We tested 23 networks from 8 different families of state-of-the-art convolutional neural network architectures, selected for their compatibility with consumer-grade GPUs and ability to handle small input images (20 × 20 to 100 × 100 pixels) typical in collective animal behavior videos."

      (2.1) Most modern architectures have variants that are compatible with consumer-grade GPUs. This is true of, for example, HRNet (Wang et al., 2019), ViT (Dosovitskiy et al., 2020), SwinT (Liu et al., 2021), or ConvNeXt (Liu et al., 2022), all of which report single GPU training and fast runtime speeds through lightweight configuration or subsequent variants, e.g., MobileViT (Mehta et al., 2021). The authors may consider revising that statement or providing additional support for that claim (e.g., empirical experiments) given that these have been reported to outperform ResNet18 across tasks.

      Following the recommendation of the reviewer, we tested the architectures SwinT, ConvNeXt and ViT. We found out that none of them outperformed ResNet18 since they all showed a slower learning curve. This would result in higher tracking times. These tests are now included in the section “Network architecture” (lines 550-611).

      (2.2) The compatibility of different architectures with small image sizes is configurable. Most convolutional architectures can be readily adapted to work with smaller image sizes, including 20x20 crops. With their default configuration, they lose feature map resolution through repeated pooling and downsampling steps, but this can be readily mitigated by swapping out standard convolutions with dilated convolutions and/or by setting the stride of pooling layers to 1, preserving feature map resolution across blocks. While these are fairly straightforward modifications (and are even compatible with using pretrained weights), an even more trivial approach is to pad and/or resize the crops to the default image size, which is likely to improve accuracy at a possibly minimal memory and runtime cost. These techniques may even improve the performance with the architectures that the authors did test out.

      The only two tested architectures that require a minimum image size are AlexNet and DenseNet. DenseNet proved to underperform ResNet18 in the videos where the images are sufficiently large. We have tested AlexNet with padded images to see that it also performs worse than ResNet18 (see Appendix 3—figure 1).

      We also tested the initialization of ResNet18 with pre-trained weights from ImageNet (in Appendix 3—figure 2) and it proved to bring no benefit to the training speed (added in lines 591-592).

      (2.3) The authors do not report whether the architecture experiments were done with pretrained or randomly initialized weights.

      We adapted the text to make it clear that the networks are always randomly initialized (lines 591-592, lines 608-609 and the captions of Appendix 3—figure 1 and 2).

      (2.4) The authors do not report some details about their ResNet18 design, specifically whether a global pooling layer is used and whether the output fully connected layer has any activation function. Additionally, they do not report the version of ResNet18 employed here, namely, whether the BatchNorm and ReLU are applied after (v1) or before (v2) the conv layers in the residual path.

      We use ResNet18 v1 with no activation function nor bias in its last layer (this has been clarified in the lines 606-608). Also, by design, ResNet has a global average pool right before the last fully connected layer which we did not remove. In response to the reviewer, Resnet18 v2 was tested and its performance is the same as that of v1 (see Appendix 3—figure 1 and lines 590-591).

      (3) Pair sampling strategy. The authors devised a clever approach for sampling positive and negative pairs that is tailored to the nature of the formulation. First, since the positive and negative labels are derived from the co-existence of pretracked fragments, selection has to be done at the level of fragments rather than individual images. This would not be the case if one of the newer approaches for contrastive learning were employed, but it serves as a strength here (assuming that fragment generation/first pass heuristic tracking is achievable and reliable in the dataset). Second, a clever weighted sampling scheme assigns sampling weights to the fragments that are designed to balance "exploration and exploitation". They weigh samples both by fragment length and by the loss associated with that fragment to bias towards different and more difficult examples.

      (3.1) The formulation described here resembles and uses elements of online hard example mining (Shrivastava et al., 2016), hard negative sampling (Robinson et al., 2020\), and curriculum learning more broadly. The authors may consider referencing this literature (particularly Robinson et al., 2020\) for inspiration and to inform the interpretation of the current empirical results on positive/negative balancing.

      Following this recommendation, we added references of hard negative mining in the new section “Differences with previous work in contrastive/metric learning”, lines 792-841. Regarding curriculum learning, even though in spirit it might have parallels with our sampling method in the sense that there is a guided training of the network, we believe the approach is more similar to an exploration-exploitation paradigm.

      (4) Speed and accuracy improvements. The authors report considerable improvements in speed and accuracy of the new idTracker (v6) over the original idTracker (v4?) and TRex. It's a bit unclear, however, which of these are attributable to the engineering optimizations (v5?) versus the representation learning formulation.

      (4.1) Why is there an improvement in accuracy in idTracker v5 (L77-81)? This is described as a port to PyTorch and improvements largely related to the memory and data loading efficiency. This is particularly notable given that the progression went from 97.52% (v4; original) to 99.58% (v5; engineering enhancements) to 99.92% (v6; representation learning), i.e., most of the new improvement in accuracy owes to the "optimizations" which are not the central emphasis of the systematic evaluations reported in this paper.

      V5 was a two year-effort designed to improve time efficiency of v4. It was also a surprise to us that accuracy was higher, but that likely comes from the fact that the substituted code from v4 contained some small bug/s. The improvements in v5 are retained in v6 (contrastive learning) and v6 has higher accuracy and shorter tracking times. The difference in v6 for this extra accuracy and shorter tracking times is contrastive learning.

      (4.2) What about the speed improvements? Relative to the original (v4), the authors report average speed-ups of 13.6x in v5 and 44x in v6. Presumably, the drastic speed-up in v6 comes from a lower Protocol 2 failure rate, but v6 is not evaluated in Figure 2 - figure supplement 2.

      Idtracker.ai v5 runs an optimized Protocol 2 and, sometimes, the Protocol 3. But v6 doesn’t run either of them. While P2 is still present in v6 as a fallback protocol when contrastive fails, in our v6 benchmark P2 was never needed. So the v6 speedup comes from replacing both P2 and P3 with the contrastive algorithm.

      (5) Robustness to occlusion. A major innovation enabled by the contrastive representation learning approach is the ability to tolerate the absence of a global fragment (contiguous frames where all animals are visible) by requiring only co-existing pairs of fragments owing to the paired sampling formulation. While this removes a major limitation of the previous versions of idtracker.ai, its evaluation could be strengthened. The authors describe an ablation experiment where an arc of the arena is masked out to assess the accuracy under artificially difficult conditions. They find that the v6 works robustly up to significant proportions of occlusions, even when doing so eliminates global fragments.

      (5.1) The experiment setup needs to be more carefully described.

      (5.1.1) What does the masking procedure entail? Are the pixels masked out in the original video or are detections removed after segmentation and first pass tracking is done?

      The mask is defined as a region of interest in the software. This means that it is applied at the segmentation step where the video frame is converted to a foreground-background binary image. The region of interest is applied here, converting to background all pixels not inside of it. We clarified this in the newly added section Occlusion tests, lines 240-244.

      (5.1.2) What happens at the boundary of the mask? (Partial segmentation masks would throw off the centroids, and doing it after original segmentation does not realistically model the conditions of entering an occlusion area.)

      Animals at the boundaries of the mask are partially detected. This can change the location of their detected centroid. That’s why, when computing the ground-truth accuracy for these videos, only the groundtruth centroids that were at minimum 15 pixels further from the mask were considered. We clarified this in the newly added section Occlusion tests, lines 248-251.

      (5.1.3) Are fragments still linked for animals that enter and then exit the mask area?

      No artificial fragment linking was added in these videos. Detected fragments are linked the usual way. If one animal hides into the mask, the animal disappears so the fragment breaks.  We clarified this in the newly added section Occlusion tests, lines 245-247.

      (5.1.4) How is the evaluation done? Is it computed with or without the masked region detections?

      The groundtruth used to validate these videos contains the positions of all animals at all times. But only the positions outside the mask at each frame were considered to compute the tracking accuracy. We clarified this in the newly added section Occlusion tests, lines 248-251.

      (5.2) The circular masking is perhaps not the most appropriate for the mouse data, which is collected in a rectangular arena.

      We wanted to show the same proof of concept in different videos. For that reason, we used to cover the arena parametrized by an angle. In the rectangular arena the circular masking uses an external circle, so it is covering the rectangle parametrized by an angle.

      (5.3) The number of co-existing fragments, which seems to be the main determinant of performance that the authors derive from this experiment, should be reported for these experiments. In particular, a "number of co-existing fragments" vs accuracy plot would support the use of the 0.25(N-1) heuristic and would be especially informative for users seeking to optimize experimental and cage design. Additionally, the number of co-existing fragments can be artificially reduced in other ways other than a fixed occlusion, including random dropout, which would disambiguate it from potential allocentric positional confounds (particularly relevant in arenas where egocentric pose is correlated with allocentric position).

      We included the requested analysis about the fragment connectivity in Figure 3-figure supplement 1. We agree that there can be additional ways of reducing co-existing fragments, but we think the occlusion tests have the additional value that there are many real experiments similar to this test.

      (6) Robustness to imaging conditions. The authors state that "the new idtracker.ai can work well with lower resolutions, blur and video compression, and with inhomogeneous light (Figure 2 - figure supplement 4)." (L156). Despite this claim, there are no speed or accuracy results reported for the artificially corrupted data, only examples of these image manipulations in the supplementary figure.

      We added this information in the same image, new Figure 1 - figure supplement 3.

      (7) Robustness across longitudinal or multi-session experiments. The authors reference idmatcher.ai as a compatible tool for this use case (matching identities across sessions or long-term monitoring across chunked videos), however, no performance data is presented to support its usage. This is relevant as the innovations described here may interact with this setting. While deep metric learning and contrastive learning for ReID were originally motivated by these types of problems (especially individuals leaving and entering the FOV), it is not clear that the current formulation is ideally suited for this use case. Namely, the design decisions described in point 1 of this review are at times at odds with the idea of learning generalizable representations owing to the feature extractor backbone (less scalable), low-dimensional embedding size (less representational capacity), and Euclidean distance metric without hypersphere embedding (possible sensitivity to drift). It's possible that data to support point 6 can mitigate these concerns through empirical results on variations in illumination, but a stronger experiment would be to artificially split up a longer video into shorter segments and evaluate how generalizable and stable the representations learned in one segment are across contiguous ("longitudinal") or discontiguous ("multi-session") segments.

      We have now added a test to prove the reliability of idmatcher.ai in v6. In this test, 14 videos are taken from the benchmark and split in two non-overlapping parts (with a 200 frames gap in between). idmatcher.ai is run between the two parts presenting a 100% accuracy identity matching across all of them (see section “Validity of idmatcher.ai in the new idtracker.ai”, lines 969-1008).

      We thank the reviewer for the detailed suggestions. We believe we have taken all of them into consideration to improve the ms.

      Reviewer #3 (Public review):

      Summary

      The authors propose a new version of idTracker.ai for animal tracking. Specifically, they apply contrastive learning to embed cropped images of animals into a feature space where clusters correspond to individual animal identities.

      Strengths

      By doing this, the new software alleviates the requirement for so-called global fragments - segments of the video, in which all entities are visible/detected at the same time - which was necessary in the previous version of the method. In general, the new method reduces the tracking time compared to the previous versions, while also increasing the average accuracy of assigning the identity labels.

      Weaknesses

      The general impression of the paper is that, in its current form, it is difficult to disentangle the old from the new method and understand the method in detail. The manuscript would benefit from a major reorganization and rewriting of its parts. There are also certain concerns about the accuracy metric and reducing the computational time.

      We have made the following modifications in the presentation:

      (1) We have added section tiles to the main text so it is clearer what tracking system we are referring to. For example, we now have sections “Limitation of the original idtracker.ai”, “Optimizing idtracker.ai without changes in the learning method” and “The new idtracker.ai uses representation learning”.

      (2) We have completely rewritten all the text of the ms until we start with contrastive learning. Old L20-89 is now L20-L66, much shorter and easier to read.

      (3) We have rewritten the first 3 paragraphs in the section “The new idtracker.ai uses representation learning” (lines 68-92).

      (4) We now expanded Appendix 3 to discuss the details of our approach  (lines 539-897).  It discusses in detail the steps of the algorithm, the network architecture, the loss function, the sampling strategy, the clustering and identity assignment, and the stopping criteria in training

      (5) To cite previous work in detail and explain what we do differently, we have now added in Appendix 3 the new section “Differences with previous work in contrastive/metric learning” (lines 792-841).

      Regarding accuracy metrics, we have replaced our accuracy metric with the standard metric IDF1. IDF1 is the standard metric that is applied to systems in which the goal is to maintain consistent identities across time. See also the section in Appendix 1 "Computation of tracking accuracy” (lines 414-436) explaining IDF1 and why this is an appropriate metric for our goal.

      Using IDF1 we obtain slightly higher accuracies for the idtracker.ai systems. This is the comparison of mean accuracy over all our benchmark for our previous accuracy score and the new one for the full trajectories:

      v4:   97.42% -> 98.24%

      v5:   99.41% -> 99.49%

      v6:   99.74% -> 99.82%

      trex: 97.89% -> 97.89%

      We thank the reviewer for the suggestions about presentation and about the use of more standard metrics.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1a: A graphical legend inset would make it more readable since there are multiple colors, line styles, and connecting lines to parse out.

      Following this recommendation, we added a graphical legend in the old Figure 1 (new Figure 2).

      (2) L46: "have images" → "has images".

      We applied this correction. Line 35.

      (3) L52: "videos start with a letter for the species (z,**f**,m)", but "d" is used for fly videos.

      We applied this correction in the caption of Figure 1.

      (4) L62: "with Protocol 3 a two-step process" → "with Protocol 3 being a two-step process".

      We rewrote this paragraph without mentioning Protocol 3, lines 37-41.

      (5) L82-89: This is the main statement of the problems that are being addressed here (speed and relaxing the need for global fragments). This could be moved up, emphasized, and made clearer without the long preamble and results on the engineering optimizations in v5. This lack of linearity in the narrative is also evident in the fact that after Figure 1a is cited, inline citations skip to Figure 2 before returning to Figure 1 once the contrastive learning is introduced.

      We have rewritten all the text until the contrastive learning, (old lines 20-89 are now lines 20-66). The text is shorter, more linear and easier to read.

      (6) L114: "pairs until the distance D_{pos}" → "pairs until the distance approximates D_{pos}".

      We rewrote as “ pairs until the distance 𝐷pos (or 𝐷neg) is reached” in line 107.

      (7) L570: Missing a right parenthesis in the equation.

      We no longer have this equation in the ms.

      (8) L705: "In order to identify fragments we, not only need" → "In order to identify fragments, we not only need".

      We applied this correction, Line 775.

      (9) L819: "probably distribution" → "probability distribution".

      We applied this correction, Line 776.

      (10) L833: "produced the best decrease the time required" → "produced the best decrease of the time required".

      We applied this correction, Line 746.

      Reviewer #3 (Recommendations for the authors):

      (1) We recommend rewriting and restructuring the manuscript. The paper includes a detailed explanation of the previous approaches (idTracker and idTracker.ai) and their limitations. In contrast, the description of the proposed method is short and unstructured, which makes it difficult to distinguish between the old and new methods as well as to understand the proposed method in general. Here are a few examples illustrating the problem. 

      (1.1) Only in line 90 do the authors start to describe the work done in this manuscript. The previous 3 pages list limitations of the original method.

      We have now divided the main text into sections, so it is clearer what is the previous method (“Limitation of the original idtracker.ai”, lines 28-51), the new optimization we did of this method (“Optimizing idtracker.ai without changes in the learning method”, lines 52-66) and the new contrastive approach that also includes the optimizations (“The new idtracker.ai uses representation learning”, lines 66-164). Also, the new text has now been streamlined until the contrastive section, following your suggestion. You can see that in the new writing the three sections are 25 , 15 and 99 lines. The more detailed section is the new system, the other two are needed as reference, to describe which problem we are solving and the extra new optimizations.  

      (1.2) The new method does not have a distinct name, and it is hard to follow which idtracker.ai is a specific part of the text referring to. Not naming the new method makes it difficult to understand.

      We use the name new idtracker.ai (v6) so it becomes the current default version. v5 is now obsolete, as well as v4. And from the point of view of the end user, no new name is needed since v6 is just an evolution of the same software they have been using. Also, we added sections in the main text to clarify the ideas in there and indicate the version of idtracker.ai we are referring to.

      (1.3) There are "Protocol 2" and "Protocol 3" mixed with various versions of the software scattered throughout the text, which makes it hard to follow. There should be some systematic naming of approaches and a listing of results introduced.

      Following this recommendation we no longer talk about the specific protocols of the old version of idtracker.ai in the main text. We rewritten the explanation of these versions in a more clear and straightforward way, lines 29-36.

      (2) To this end, the authors leave some important concepts either underexplained or only referenced indirectly via prior work. For example, the explanation of how the fragments are created (line 15) is only explained by the "video structure" and the algorithm that is responsible for resolving the identities during crossings is not detailed (see lines 46-47, 149-150). Including summaries of these elements would improve the paper's clarity and accessibility.

      We listed the specific sections from our previous publication where the reader can find information about the entire tracking pipeline (lines 539-549). This way, we keep the ms clear and focused on the new identification algorithm while indicating where to find such information.

      (3) Accuracy metrics are not clear. In line 319, the authors define it as based on "proportion of errors in the trajectory". This proportion is not explained. How is the error calculated if a trajectory is lost or there are identity swaps? Multi-object tracking has a range of accuracy metrics that account for such events but none of those are used by the authors. Estimating metrics that are common for MOT literature, for example, IDF1, MOTA, and MOTP, would allow for better method performance understanding and comparison.

      In the new ms, we replaced our accuracy metric with the standard metric IDF1. IDF1 is the standard metric that is applied to systems in which the goal is to maintain consistent identities across time. See also the section in Appendix 1 "Computation of tracking accuracy” explaining why IDF1 and not MOTA or MOTP is the adequate metric for a system that wants to give correct tracking by identification in time. See lines 416-436.

      Using IDF1 we obtain slightly higher accuracies for the idtracker.ai systems. This is the comparison of mean accuracy four our previous accuracy and the new one for the full trajectories:

      v4:   97.42% -> 98.24%

      v5:   99.41% -> 99.49%

      v6:   99.74% -> 99.82%

      trex: 97.89% -> 97.89%

      (4) Additionally, the authors distinguish between tracking with and without crossings, but do not provide statistics on the frequency of crossings per video. It is also unclear how the crossings are considered for the final output. Including information such as the frame rate of the videos would help to better understand the temporal resolution and the differences between consecutive frames of the videos.

      We added this information in the Appendix 1 “Benchmark of accuracy and tracking time”, lines 445-451. The framerate in our benchmark videos goes from 25 to 60 fps (average of 37 fps). On average 2.6% of the blobs are crossings (1.1% for zebrafish 0.7% for drosophila 9.4% for mice).

      (5) In the description of the dataset used for evaluation (lines 349-365), the authors describe the random sampling of parameter values for each tracking run. However, it is unclear whether the same values were used across methods. Without this clarification, comparisons between the proposed method, older versions, and TRex might be biased due to lucky parameter combinations. In addition, the ranges from which the values were randomly sampled were also not described.

      Only one parameter is shared between idtracker.ai and TRex: intensity_threshold (in idtracker.ai) and threshold (in TRex). Both are conceptually equivalent but differ in their numerical values since they affect different algorithms. V4, v5, and TRex each required the same process of independent expert visual inspection of the segmentation to select the valid value range. Since versions 5 and 6 use exactly the same segmentation algorithm, they share the same parameter ranges.

      All the ranges of valid values used in our benchmark are public here https://drive.google.com/drive/folders/1tFxdtFUudl02ICS99vYKrZLeF28TiYpZ as stated in the section “Data availability”, lines 227-228.

      (6) Lines 122-123, Figure 1c. "batches" - is an imprecise metric of training time as there is no information about the batch size.

      We clarified the Figure caption, new Figure 2c.

      (7) Line 145 - "we run some steps... For example..." leaves the method description somewhat unclear. It would help if you could provide more details about how the assignments are carried out and which metrics are being used.

      Following this recommendation, we listed the specific sections from our previous publication where the reader can find information about the entire tracking pipeline (lines 539-549). This way, we keep the ms clear and focused on the new identification algorithm while indicating where to find such information.

      (8) Figure 3. How is tracking accuracy assessed with occlusions? Are the individuals correctly recognized when they reappear from the occluded area?

      The groundtruth for this video contains the positions of all animals at all times. Only the groundtruth points inside the region of interest are taken into account when computing the accuracy. When the tracking reaches high accuracy, it means that animals are successfully relabeled every time they enter the non-masked region. Note that this software works all the time by identification of animals, so crossings and occlusion are treated the same way. What is new here is that the occlusions are so large that there are no global fragments. We clarified this in the new section “Occlusion tests” in Methods, lines 239-251.

      (9) Lines 185-187 this part of the sentence is not clear.

      We rewrote this part in a clearer way, lines 180-182.

      (10) The authors also highlight the improved runtime performance. However, they do not provide a detailed breakdown of the time spent on each component of the tracking/training pipeline. A timing breakdown would help to compare the training duration with the other components. For example, the calculation of the Silhouette Score alone can be time-consuming and could be a bottleneck in the training process. Including this information would provide a clearer picture of the overall efficiency of the method.

      We measured that the training of ResNet takes on average in our benchmark 47% of the tracking time (we added this information line 551 section “Network Architecture”). In this training stage the bottleneck becomes the network forward and backward pass, limited by the GPU performance. All other processes happening during training have been deeply optimized and parallelized when needed so their contribution to the training time is minimal. Apart from the training, we also measured 24.4% of the total tracking time spent in reading and segmenting the video files and 11.1% in processing the identification images and detecting crossings.

      (11) An important part of the computational cost is related to model training. It would be interesting to test whether a model trained on one video of a specific animal type (e.g., zebrafish_5) generalizes to another video of the same type (e.g., zebrafish_7). This would assess the model's generalizability across different videos of the same species and spare a lot of compute. Alternatively, instead of training a model from scratch for each video, the authors could also consider training a base model on a superset of images from different videos and then fine-tuning it with a lower learning rate for each specific video. This could potentially save time and resources while still achieving good performance.

      Already before v6, there was the possibility for the user to start training the identification network by copying the final weights from another tracking session. This knowledge transfer feature is still present in v6 and it still decreases the training times significatively. This information has been added in Appendix 4, lines 906-909.

      We have already begun working on the interesting idea of a general base model but it brings some complex challenges. It could be a very useful new feature for future idtracker.ai releases.

      We thank the reviewer for the many suggestions. We have implemented all of them.

    1. eLife Assessment

      This important study provides a detailed analysis of the transcriptional landscape of the mouse hippocampus in the context of various physiological states. The main conclusions have solid support: that most transcriptional targets are generally stable, with notable exceptions in the dentate gyrus and with regard to circadian changes. There are some weaknesses and it would improve the manuscript to address them.

    2. Reviewer #1 (Public review):

      Olmstead et al. present a single-cell nuclear sequencing dataset that interrogates how hippocampal gene expression changes in response to distinct physiological stimuli and across circadian time. The authors perform single-nucleus RNA sequencing on mouse hippocampal tissue after (1) kainic acid-induced seizure, (2) exposure to an enriched environment, and (3) at multiple circadian phases.

      The dataset is rigorously collected, and a major strength is the use of the previously established ABC taxonomy from Yao et al. (2023) to define cell types. The authors further show that this taxonomy is largely independent of activity-driven transcriptional programs. Using these annotations, they examine activity-regulated gene expression across neuronal and glial subclasses. They identify ZT12, corresponding to the transition from the light to the dark period, as transcriptionally distinct from other circadian time points, and show that this pattern is conserved across many cell types. Finally, they test how circadian phase influences activity-dependent gene expression by exposing mice to an enriched environment at different times of day, and report no significant interaction between circadian phase and enriched environment exposure.

      A crucial consideration for users of this dataset is the potential confounding effect between circadian phase and locomotor activity. This is particularly relevant because dentate gyrus activity is strongly modulated by locomotion. The authors acknowledge this issue in the Discussion and provide useful guidance for how to interpret their findings, considering this confound.

      Taken together, this dataset represents a useful resource for the neuroscience community, particularly for investigators interested in how novel experience and circadian phase shape activity-related and immediate early gene expression in the hippocampus

    3. Reviewer #2 (Public review):

      This manuscript presents the ACT-DEPP dataset, a comprehensive single-nucleus RNA-sequencing atlas of the mouse hippocampus that examines how activity-dependent and circadian transcriptional programs intersect. The dataset spans multiple experimental conditions and circadian time points, clarifying how cell-type identity relates to transcriptional state. In particular, the authors compare stimulus-evoked activity programs (environmental enrichment and kainate-induced seizures) with circadian phase-dependent transcriptional oscillations. They also identify a transcriptional inflection point near ZT12 and argue that immediate early gene (IEG) induction is broadly maintained across circadian phases, with minimal ZT-dependent modulation.

      Strengths:

      The study is ambitious in scope and data volume, and outlines the data-processing and atlas-registration workflows. The side-by-side treatment of stimulus paradigms and ZT sampling provides a coherent framework for parsing state (activity) from phase (circadian) across diverse neuronal and non-neuronal classes. Several findings - especially the ZT12 "inflection" and the differential sensitivity of pathways across subclasses - are intriguing.

      Weaknesses:

      (1) The authors acknowledge, but do not adequately address, the fundamental confounding factor between circadian phase and spontaneous locomotor activity. The assertion that these represent "orthogonal regulatory axes," based on largely non-overlapping DEGs, may be overstated. The absence of behavioral monitoring during baseline is a major limitation.

      (2) The statement "Thus, novel experiences and seizures trigger categorically distinct transcriptional responses-with respect to both magnitude and specific genes-in these hippocampal subregions" is overstated, given the data presented. Figure 2A-B shows that approximately one-third of EE-induced DEGs at 30 minutes overlap with KA DEGs, and this overlap increases substantially at 6 hours in CA1 (where EE and KA responses become "fully shared"). This suggests the responses are quantitatively different rather than "categorically distinct."

      (3) In Figure 4B, "active cells" are defined as those with {greater than or equal to}3 of 15 IEGs above the 90th percentile, with thresholds apparently calibrated in CA1. Because baseline expression distributions differ across subclasses, this rule can bias activation rates across cell types.

      (4) Few genes show significant ZT × stimulus (EE or seizure) interactions, concentrated in neuronal populations. Given unequal nucleus counts and biological replicates across subclasses, small effects may be underpowered.

      (5) In Figure 6 I, J, the relationship between the highlighted pathways/functions and circadian phase is not yet explicit.

      (6) Line 276-280: The enrichment of lncRNAs at ZT12 in CA1 is intriguing but underdeveloped. What are these lncRNAs, and what might they regulate?

      Overall, most descriptive conclusions are supported (e.g., broad phase-robustness of classical IEGs; an inflection near ZT12). Claims about the separability/orthogonality of activity vs circadian programs, and about categorical distinctions between EE and KA responses, would benefit from more conservative wording or additional analyses to rule out behavioral and power-related alternatives.

    1. eLife Assessment

      This valuable study uses fiber photometry, implantable lenses, and optogenetics, to show that a subset of subthalamic nucleus neurons are active during movement, and that active but not passive avoidance depends in part on STN projections to substantia nigra. The strength of the evidence for these claims is solid, whereas evidence supporting the claims that STN is involved in cautious responding is unclear as presented. This paper may be of interest to basic and applied behavioural neuroscientists working on movement or avoidance.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript presents a robust set of experiments that provide new insights into the role of STN neurons during active and passive avoidance tasks. These forms of avoidance have received comparatively less attention in the literature than the more extensively studied escape or freezing responses, despite being extremely relevant to human behaviour and more strongly influenced by cognitive control.

      Strengths:

      Understanding the neural infrastructure supporting avoidance behaviour would be a fundamental milestone in neuroscience. The authors employ sophisticated methods to delineate the role of STN neurons during avoidance behaviours. The work is thorough and the evidence presented is compelling. Experiments are carefully constructed, well-controlled, and the statistical analyses are appropriate.

      Weaknesses:

      One possible remaining conceptual concern that might require future work is determining whether STN primarily mediates higher-level cognitive avoidance or if its activation primarily modulates motor tone.

    3. Reviewer #2 (Public review):

      Summary:

      Zhou, Sajid et al. present a study investigating the STN involvement in signaled movement. They use fiber photometry, implantable lenses, and optogenetics during active avoidance experiments to evaluate this. The data are useful for the scientific community and the overall evidence for their claims is solid, but many aspects of the findings are confusing. The authors present a huge collection of data, it is somewhat difficult to extract the key information and the meaningful implications resulting from these data.

      Strengths:

      The study is comprehensive in using many techniques and many stimulation powers and frequencies and configurations.

      Weaknesses - re-review:

      All previous weaknesses have been addressed. The authors should explain how inhibition of the STN impairing active avoidance is consistent with the STN encoding cautious action. If 'caution' is related to avoid latency, why does STN lesion or inhibition increase avoid latency, and therefore increase caution? Wouldn't the opposite be more consistent with the statement that the STN 'encodes cautious action'?

    4. Reviewer #3 (Public review):

      Summary:

      The authors use calcium recordings from STN to measure STN activity during spontaneous movement and in a multi-stage avoidance paradigm. They also use optogenetic inhibition and lesion approaches to test the role of STN during the avoidance paradigm. The paper reports a large amount of data and makes many claims, some seem well supported to this Reviewer, others not so much.

      Strengths:

      Well-supported claims include data showing that during spontaneous movements, especially contraversive ones, STN calcium activity is increased using bulk photometry measurements. Single-cell measures back this claim but also show that it is only a minority of STN cells that respond strongly, with most showing no response during movement, and a similar number showing smaller inhibitions during movement.

      Photometry data during cued active avoidance procedures show that STN calcium activity sharply increases in response to auditory cues, and during cued movements to avoid a footshock. Optogenetic and lesion experiments are consistent with an important role for STN in generating cue-evoked avoidance. And a strength of these results is that multiple approaches were used.

      Original Weaknesses:

      I found the experimental design and presentation convoluted and some of the results over-interpreted.

      As presented, I don't understand this idea that delayed movement is necessarily indicative of cautious movements. Is the distribution of responses multi-modal in a way that might support this idea; or do the authors simply take a normal distribution and assert that the slower responses represent 'caution'? Even if responses are multi-modal and clearly distinguished by 'type', why should readers think this that delayed responses imply cautious responding instead of say: habituation or sensitization to cue/shock, variability in attention, motivation, or stress; or merely uncertainty which seems plausible given what I understand of the task design where the same mice are repeatedly tested in changing conditions. This relates to a major claim (i.e., in the title).

      Related to the last, I'm struggling to understand the rationale for dividing cells into 'types' based the their physiological responses in some experiments.

      In several figures the number of subjects used was not described. This is necessary. Also necessary is some assessment of the variability across subjects. The only measure of error shown in many figures relates trial-to-trial or event variability, which is minimal because in many cases it appears that hundreds of trials may have been averaged per animal, but this doesn't provide a strong view of biological variability (i.e., are results consistent across animals?).

      It is not clear if or how spread of expression outside of target STN was evaluated, and if or how or how many mice were excluded due to spread or fiber placements. Inadequate histological validation is presented and neighboring regions that would be difficult to completely avoid, such as paraSTN may be contributing to some of the effects.

      Raw example traces are not provided.

      The timeline of the spontaneous movement and avoidance sessions were not clear, nor the number of events or sessions per animal and how this was set. It is not clear if there was pre-training or habituation, if many or variable sessions were combined per animal, or what the time gaps between sessions was, or if or how any of these parameters might influence interpretation of the results.

      Comments on revised version:

      The authors removed the optogenetic stimulation experiments, but then also added a lot of new analyses. Overall the scope of their conclusions are essentially unchanged.

      Part of the eLife model is to leave it to the authors discretion how they choose to present their work. But my overall view of it is unchanged. There are elements that I found clear, well executed, and compelling. But other elements that I found difficult to understand and where I could not follow or concur with their conclusions.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #2 (Public review):

      (1) Vglut2 isn't a very selective promoter for the STN. Did the authors verify every injection across brain slices to ensure the para-subthalamic nucleus, thalamus, lateral hypothalamus, and other Vglut2-positive structures were never infected?

      The STN is anatomically well-confined, with its borders and the overlying zona incerta (composed of GABAergic neurons) providing protection against off-target expression in most neighboring forebrain regions. All viral injections were histologically verified and did not into extend into thalamic or hypothalamic areas. As described in the Methods, we employed an app we developed (Brain Atlas Analyzer, available on OriginLab) that aligns serial histological sections with the Allen Brain Atlas to precisely assess viral spread and confirm targeting accuracy. The experiments included in the revised manuscript now focus on optogenetic inhibition and irreversible lesion approaches—three complementary methods that consistently targeted the STN and yielded similar behavioral effects.

      (2) The authors say in the methods that the high vs low power laser activation for optogenetic experiments was defined by the behavioral output. This is misleading, and the high vs low power should be objectively stated and the behavioral results divided according to the power used, not according to the behavioral outcome.

      Optogenetic excitation is no longer part of the study.

      (3) In the fiber photometry experiments exposing mice to the range of tones, it is impossible to separate the STN response to the tone from the STN response to the movement evoked by the tone. The authors should expose the mouse to the tones in a condition that prevents movement, such as anesthetized or restrained, to separate out the two components.

      The new mixed-effects modeling approach clearly differentiates sensory (auditory) from motor contributions during tone-evoked STN activation. In prior work (see Hormigo et al, 2023, eLife), we explored experimental methods such as head restraint or anesthesia to reduce movement, but we concluded that these approaches are unsuitable for addressing this question. Mice exhibit substantial residual movement even when head-fixed, and anesthesia profoundly alters neural excitability and behavioral state, introducing major confounds. To fully eliminate movement would require paralysis and artificial ventilation, which would again disrupt physiological network dynamics and raise ethical concerns. Therefore, the current modeling approach—incorporating window-specific covariates for movement—is the most appropriate and rigorous way to dissociate tone-evoked sensory activity from motor activity in behaving animals.

      (4) The claim 'STN activation is ideally suited to drive active avoids' needs more explanation. This claim comes after the fiber photometry experiments during active avoidance tasks, so there has been no causality established yet.

      Text adjusted. 

      (5) The statistical comparisons in Figure 7E need some justification and/or clarification. The 9 neuron types are originally categorized based on their response during avoids, then statistics are run showing that they respond differently during avoids. It is no surprise that they would have significantly different responses, since that is how they were classified in the first place. The authors must explain this further and show that this is not a case of circular reasoning.

      Statistically verifying the clustering is useful to ensure that the selected number of clusters reflects distinct classes. It is also necessary when different measurements are used to classify (movement time series classified the avoids) and to compare neuronal types within each avoid mode/class (know called “mode”). Moreover, the new modeling approach goes beyond the prior statistical limitations related to considering movement and neuronal variables separately. 

      (6) The authors show that neurons that have strong responses to orientation show reduced activity during avoidance. What are the implications of this? The author should explain why this is interesting and important.

      The new modeling approach goes beyond the prior analysis limitations. For instance, it shows that most of the prior orienting related activations closely reflect the orienting movement, and only in a few cases (noted and discussed in the results) orienting activations are related to the behavioral contingencies or behavioral outcomes in the task. 

      (8) The experiments in Figure 10 are used to say that STN stimulation is not aversive, but they only show that STN stimulation cannot be used as punishment in place of a shock. This doesn't mean that it is not aversive; it just means it is not as aversive as a shock. The authors should do a simpler aversion test, such as conditioned or real-time place preference, to claim that STN stimulation is not aversive. This is particularly surprising as previous work (Serra et al., 2023) does show that STN stimulation is aversive.

      Optogenetic excitation is no longer part of the study. 

      (7) It is not clear which conditions each mouse experienced in which order. This is critical to the interpretation of Figure 9 and the reduction of passive avoids during STN stimulation. Did these mice have the CS1+STN stimulation pairing or the STN+US pairing prior to this experiment? If they did, the stimulation of the STN could be strongly associated with either punishment or with the CS1 that predicts punishment. If that is the case, stimulating the STN during CS2 could be like presenting CS1+CS2 at the same time and could be confusing.

      Optogenetic excitation is no longer part of the study. 

      (8) The experiments in Figure 10 are used to say that STN stimulation is not aversive, but they only show that STN stimulation cannot be used as punishment in place of a shock. This doesn't mean that it is not aversive; it just means it is not as aversive as a shock. The authors should do a simpler aversion test, such as conditioned or real-time place preference, to claim that STN stimulation is not aversive. This is particularly surprising as previous work (Serra et al., 2023) does show that STN stimulation is aversive.

      Optogenetic excitation is no longer part of the study.

      (9) In the discussion, the idea that the STN encodes 'moving away' from contralateral space is pretty vague and unsupported. It is puzzling that the STN activates more strongly to contraversive turns, but when stimulated, it evokes ipsiversive turns; however, it seems a stretch to speculate that this is related to avoidance. In the last experiments of the paper, the axons from the STN to the GPe and to the midbrain are selectively stimulated. Do these evoke ipsiversive turns similarly?

      Optogenetic excitation is no longer part of the study. 

      (10) In the discussion, the authors claim that the STN is essential for modulating action timing in response to demands, but their data really only show this in one direction. The STN stimulation reliably increases the speed of response in all conditions (except maximum speed conditions such as escapes). It seems to be over-interpreting the data to say this is an inability to modulate the speed of the task, especially as clear learning and speed modulation do occur under STN lesion conditions, as shown in Figure 12B. The mice learn to avoid and increase their latency in AA2 vs AA1, though the overall avoids and latency are different from controls. The more parsimonious conclusion would be that STN stimulation biases movement speed (increasing it) and that this is true in many different conditions.

      Optogenetic excitation is no longer part of the study.

      (11)  In the discussion, the authors claim that the STN projections to the midbrain tegmentum directly affect the active avoidance behavior, while the STN projections to the SNr do not affect it. This seems counter to their results, which show STN projections to either area can alter active avoidance behavior. What is the laser power used in these terminal experiments? If it is high (3mW), the authors may be causing antidromic action potentials in the STN somas, resulting in glutamate release in many brain areas, even when terminals are only stimulated in one area. The authors could use low (0.25mW) laser power in the terminals to reduce the chance of antidromic activation and spatially restrict the optical stimulation.

      Optogenetic excitation is no longer part of the study. 

      (12) Was normality tested for data prior to statistical testing?

      Yes, although now we use mixed models

      (13) Why are there no error bars on Figure 5B, black circles and orange triangles?

      When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Reviewer #3 (Public review):

      (1) I really don't understand or accept this idea that delayed movement is necessarily indicative of cautious movements. Is the distribution of responses multi-modal in a way that might support this idea, or do the authors simply take a normal distribution and assert that the slower responses represent 'caution'? Even if responses are multi-modal and clearly distinguished by 'type', why should readers think this that delayed responses imply cautious responding instead of say: habituation or sensitization to cue/shock, variability in attention, motivation, or stress; or merely uncertainty which seems plausible given what I understand of the task design where the same mice are repeatedly tested in changing conditions. This relates to a major claim (i.e., in the work's title).

      In our study, “caution” is defined operationally as the tendency to delay initiation of an avoidance response in demanding situations (e.g., taking more time or care before crossing a busy street). The increase in avoidance latency with task difficulty is highly robust, as we have shown previously through detailed analyses of timing distributions and direct comparisons with appetitive behaviors (e.g., Zhou et al., 2022 JNeurosci). Moreover, we used the tracked movement time series to statistically classify responses into cautious modes, which is likely novel. This definition can dissociate cautious responding from broader constructs listed by a reviewer, such as attention, motivation, or stress, which must be explicitly defined to be rigorously considered in this context, including the likelihood that they covary with caution without being equivalent to it. 

      Cue-evoked orienting responses at CS onset are directly measured, and their habituation and sensitization have been characterized in our prior work (e.g., Zhou et al., 2023 JNeurosci). US-evoked escapes are also measured in the present study and directly compared with avoidance responses. Together, these analyses provide a rigorous and consistent framework for defining and quantifying caution within our behavioral procedures.

      Importantly, mice exhibit cautious responding as defined here across different tasks, making it more informative to classify avoidance responses by behavioral mode rather than by task alone. Accordingly, in the miniscope, single-neuron, and mixed-effects model analyses, we classified active avoids into distinct modes reflecting varying levels of caution. Although these modes covary with task contingencies, their explicit classification improves model predictability and interpretability with respect to cautious responding.

      (2) Related to the last, I'm struggling to understand the rationale for dividing cells into 'types' based the their physiological responses in some experiments (e.g., Figure 7).

      This section has now been expanded into 3 figures (Fig. 7-9) with new modeling approaches that should make the rationale more straight forward.

      By emphasizing the mixed-effects modeling results and integrating these analyses directly into the figures, the revised manuscript now more clearly delineates what is encoded at the population and single-neuron levels. Including movement and baseline covariates allowed us to dissociate motor-related modulation from other neural signals, substantially clarifying the distinction between movement encoding and other task-related variables, which we focus on in the paper. These analyses confirm the strong role of the STN in representing movement while revealing additional signals related to aversive stimulation and cautious responding that persist after accounting for motor effects. These signals arise from distinct neuronal populations that can be differentiated by their movement sensitivity and activation patterns across avoidance modes, reflecting varying levels of caution. At the same time, several effects that initially reflected orienting-related activity at CS-onset (note that our movement tracking captures both head position and orientation as a directional vector) dissipated once movement and baseline covariates were included in the models, emphasizing the utility of the analytical improvements in the revision.

      (3)The description and discussion of orienting head movements were not well supported, but were much discussed in the avoidance datasets. The initial speed peaks to cue seem to be the supporting data upon which these claims rest, but nothing here suggests head movement or orientation responses.

      As described in the methods (and noted above), we track the head and decompose the movement into rotational and translational components. With the new approach, several effects that initially reflected orienting-related activity at CS-onset (note that our movement tracking captures both head position and orientation as a directional vector) dissipated once movement and baseline covariates were included in the models, emphasizing the utility of the analytical improvements in the revision.

      (4) Similar to the last, the authors note in several places, including abstract, the importance of STN in response timing, i.e., particularly when there must be careful or precise timing, but I don't think their data or task design provides a strong basis for this claim.

      The avoidance modes and the measured latencies directly support the relation to action timing, but now the portion of the previous paper about optogenetic excitation and apparently the main source of criticism is no longer in the present study. 

      (5) I think that other reports show that STN calcium activity is recruited by inescapable foot shock as well. What do these authors see? Is shock, independent of movement, contributing to sharp signals during escapes?

      The question, “Is shock, independent of movement, contributing to sharp signals during escapes?” is now directly addressed in the revised analyses. By incorporating movement and baseline covariates into the mixed-effects models, we dissociate STN activity related to aversive stimulation from that associated with motor output. The results show that shock-evoked STN activation persists even after controlling for movement within defined neuronal populations, supporting a specific nociceptive contribution independent of motor dynamics—a dissociation that appears to be new in this field.

      (6) In particular, and related to the last point, the following work is very relevant and should be cited:  Note that the focus of this other paper is on a subset of VGLUT2+ Tac1 neurons in paraSTN, but using VGLUT2-Cre to target STN will target both STN and paraSTN.

      We appreciate the reviewer’s reference to the recent preprint highlighting the role of the para-subthalamic nucleus in avoidance learning. However, our study focused specifically on performance in well-trained mice rather than on learning processes. Behavioral learning is inherently more variable and can be disrupted by less specific manipulations, whereas our experiments targeted the stable execution of learned avoidance behaviors. Future work will extend these findings to the learning phase and examine potential contributions of subthalamic subdivisions, which our current Vglut2-based manipulations do not dissociate. We will consider this and related work more closely in those studies.

      (7) In multiple other instances, claims that were more tangential to the main claims were made without clearly supporting data or statistics. E.g., claim that STN activation is related to translational more than rotational movement; claim that GCaMP and movement responses to auditory cues were small; claims that 'some animals' responded differently without showing individual data.

      We have adjusted the text accordingly.

      (8) In several figures, the number of subjects used was not described. This is necessary. Also necessary is some assessment of the variability across subjects. The only measure of error shown in many figures relates to trial-to-trial or event variability, which is minimal because, in many cases, it appears that hundreds of trials may have been averaged per animal, but this doesn't provide a strong view of biological variability. When bar/line plots are used to display data, I recommend showing individual animals where feasible.

      All experiments report number of mice and sessions. Wherever feasible, we display individual data points (e.g., Figures 1 and 2) to convey variability directly. However, in cases where figures depict hundreds of paired (repeated-measures) data points, showing all points without connecting them would not be appropriate, while linking them would make the figures visually cluttered and uninterpretable. All plots and traces include measures of variability (SEM), and the raw data will be shared on Dryad. When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Also, to minimize visual clutter, only a subset of relevant comparisons is highlighted with asterisks, whereas all relevant statistical results, comparisons, and mouse/session numbers are fully reported in the Results section, with statistical analyses accounting for the clustering of data within subjects and sessions.

      (9) Can the authors consider the extent to which calcium imaging may be better suited to identify increases compared to decreases and how this may affect the results, particularly related to the GRIN data when similar numbers of cells show responses in both directions (e.g., Figure 3)?

      This is an interesting issue related to a widely used technique beyond the scope of our study.

      (10) Raw example traces are not provided.

      We do not think raw traces are useful here. All figures contain average traces to reflect the activity of the estimated population.

      (11) The timeline of the spontaneous movement and avoidance sessions was not clear, nor was the number of events or sessions per animal nor how this was set. It is not clear if there was pre-training or habituation, if many or variable sessions were combined per animal, or what the time gaps between sessions were, or if or how any of these parameters might influence interpretation of the results.

      We have enhanced the description of the sessions, including the number of animals and sessions, which are daily and always equal per animals in each group of experiments. As noted, the sessions are part of the random effects in the model.

      (12) It is not clear if or how the spread of expression outside of the target STN was evaluated, and if or how many mice were excluded due to spread or fiber placements.

      The STN is anatomically well-confined, with its borders and the overlying zona incerta (composed of GABAergic neurons) providing protection against off-target expression in most neighboring forebrain regions. All viral injections were histologically verified and did not into extend into thalamic or hypothalamic areas. As described in the Methods, we employed an app we developed (Brain Atlas Analyzer, available on OriginLab) that aligns serial histological sections with the Allen Brain Atlas to precisely assess viral spread and confirm targeting accuracy. The experiments included in the revised manuscript now focus on optogenetic inhibition and irreversible lesion approaches—three complementary methods that consistently targeted the STN and yielded similar behavioral effects.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The primary feedback agreed upon by all the reviewers was that the manuscript requires significant streamlining as it is currently overly long and convoluted.

      We thank the reviewers and editors for their thoughtful and constructive feedback. In response to the primary comment that “the manuscript requires significant streamlining as it is currently overly long and convoluted,” we have substantially revised and refocused the paper. Specifically, we streamlined the included data and enhanced the analyses to emphasize the central findings: the encoding of movement, cautious responding, and punishment in the STN during avoidance behavior. We also focused the causal component of the study by including only the loss-of-function experiments—both optogenetic inhibition and irreversible viral/electrolytic lesions—that establish the critical role of STN circuits in generating active avoidance. Together, these revisions enhance clarity, tighten the narrative focus, and align the manuscript more closely with the reviewers’ recommendations.

      Major revisions include the addition of mixed-effects modeling to dissociate the contributions of movement from other STN-encoded signals related to caution and punishment. This modeling approach allowed us to reveal that these components are statistically separable, demonstrating that movement, cautious responding, and aversive input are encoded by neuronal subsets. To streamline the manuscript and address reviewer concerns, we removed the optogenetic excitation experiments. As revised, the paper presents a more concise and cohesive narrative showing that STN neurons differentially encode movement, caution, and aversive stimuli, and that this circuitry is essential for generating active avoidance behavior.

      Many of the specific points raised by reviewers now fall outside the scope of the revised manuscript. This is primarily because the revised version omits data and analyses related to optogenetic excitation and associated control experiments. By removing these components, the paper now presents a streamlined and internally consistent dataset focused on how the STN encodes movement, cautious responding, and aversive outcomes during avoidance behavior, as well as on loss-of-function experiments demonstrating its necessity for generating active avoidance. Below, we address the points that remain relevant across reviews.

      Following extensive revisions, the current manuscript differs in several important ways from what the assessment describes:

      The description that the study “uses fiber photometry, implantable lenses, and optogenetics” is more accurately represented as using both fiber photometry and singleneuron calcium imaging with miniscopes, combined with optogenetic and irreversible lesion approaches.

      The phrase stating that “active but not passive avoidance depends in part on STN projections to substantia nigra” is better characterized as “STN projections to the midbrain,” since our data show that optogenetic inhibition of STN terminals in both the mesencephalic reticular tegmentum (MRT) and substantia nigra pars reticulata (SNr) produce equivalent effects, and thus these sites are combined in the study. 

      Finally, the original concern that evidence for STN involvement in cautious responding or avoidance speed was incomplete no longer applies. The revised focus on encoding, through the inclusion of mixed-effects modeling, now dissociates movement-related, cautious, and aversive components of STN activity. By removing the optogenetic excitation data, we no longer claim that the STN controls caution but rather that it encodes cautious responding, alongside movement and punishment signals. Furthermore, loss-of-function experiments demonstrate that silencing STN output abolishes active avoidance entirely, supporting an essential role for the STN in generating goal-directed avoidance behavior—a behavioral domain that, unlike appetitive responding, is fundamentally defined by caution and the need to balance action timing under threat.

      Reviewer #2 (Recommendations for the authors):

      (1) Show individual data points on bar plots.

      Wherever feasible, we display individual data points (e.g., Figures 1 and 2) to convey variability directly. However, in cases where figures depict hundreds of paired (repeatedmeasures) data points, showing all points without connecting them would not be appropriate, while linking them would make the figures visually cluttered and uninterpretable. All plots and traces include measures of variability (SEM), and the raw data will be shared on Dryad. When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Also, to minimize visual clutter, only a subset of relevant comparisons is highlighted with asterisks, whereas all relevant statistical results, comparisons, and mouse/session numbers are fully reported in the Results section, with statistical analyses accounting for the clustering of data within subjects and sessions.

      (2) The active avoidance experiments are confusing when they are introduced in the results section. More explanation of what paradigms were used and what each CS means at the time these are introduced would add clarity. For example, AA1, AA2, etc, are explained only with references to other papers, but a brief description of each protocol and a schematic figure would really help.

      The avoidance protocols (AA1–4) are now described briefly but clearly in the Results section (second paragraph of “STN neurons activate during goal-directed avoidance contingencies”) and in greater detail in the Methods section. As stated, these tasks were conducted sequentially, and mice underwent the same number of sessions per procedure, which are indicated. All relevant procedural information has been included in these sections. Mice underwent daily sessions and learnt these tasks within 1-2 sessions, progressing sequentially across tasks with an equal number of sessions per task (7 per task), and the resulting data were combined and clustered by mouse/session in the statistical models.

      (3) How do the Class 1, 2, 3 avoids relate to Class 1, 2, 3 neural types established in Figure 3? It seems like they are not related, and if that is the case, they should be named something different from each other to avoid confusion. (4) Similarly, having 3 different cell types (a,b,c) in the active avoidance seems unrelated to the original classification of cell types (1,2,3), and these are different for each class of avoid. This is very confusing, and it is unclear how any of these types relate to each other. Presumably, the same mouse has all three classes of avoids, so there are recordings from each cell during each type of avoid.

      The terms class, mode, and type are now clearly distinguished throughout the manuscript. Modes refer to distinct patterns of avoidance behavior that differ in the level of cautious responding (Mode 3 is most cautious). Within each mode, types denote subgroups of neurons identified based on their ΔF/F activity profiles. In contrast, classes categorize neurons according to their relationship to movement, determined by cross-correlation analyses between ΔF/F and head speed (Class1-4; Fig. 7 is a new analysis) or head turns (ClassA-C, renamed from 1-3). This updated terminology clarifies the analytic structure, highlighting distinct neuronal populations within each analysis. For example, during avoidance behaviors, these classifications distinguish neurons encoding movement-, caution-, and outcome-related signals. Comparisons are conducted within each analytical set, within classes (A-C or 1-4 separately), within avoidance modes, or within modespecific neuronal types.

      …So the authors could compare one cell during each avoid and determine whether it relates to movement or sound, or something else. It is interesting that types a,b, and c have the exact same proportions in each class of avoid, and makes it important to investigate if these are the exact same cells or not.

      That previous table with the a,b,c % in the three figure panels was a placeholder, which was not updated in the included figure. It has now been correctly updated. They do not have the same proportions as shown in Fig. 9, although they are similar.

      Also, these mice could be recorded during the open field, so the original neural classification (class 1, 2,3) could be applied to these same cells, and then the authors can see whether each cell type defined in the open field has a different response to the different avoid types. As it stands, the paper simply finds that during movement and during avoidance behaviors, different cells in the STN do different things.

      We included a new analysis in Fig. 7 that classifies neurons based on the cross-correlation with movement. The inclusion of the models now clearly assigns variance to movement versus the other factors, and this analysis leads to the classification based on avoid modes. 

      (5) The use of the same colors to mean two different things in Figure 9 is confusing. AA1 vs AA2 shouldn't be the same colors as light-naïve vs light signaling CS.

      Optogenetic excitation is no longer part of the study.

      (6) The exact timeline of the optogenetics experiments should be presented as a schematic for understanding. It is not clear which conditions each mouse experienced in which order. This is critical to the interpretation of Figure 9 and the reduction of passive avoids during STN stimulation. Did these mice have the CS1+STN stimulation pairing or the STN+US pairing prior to this experiment? If they did, the stimulation of the STN could be strongly associated with either punishment or with the CS1that predicts punishment. If that is the case, stimulating the STN during CS2 could be like presentingCS1+CS2 at the same time and could be confusing. The authors should make it clear whether the mice were naïve during this passive avoid experiment or whether they had experienced STN stimulation paired with anything prior to this experiment.

      Optogenetic excitation is no longer part of the study.

      (20) Similarly, the duration of the STN stimulation should be made clear on the plots that show behavior over time (e.g., Figure 9E).

      Optogenetic excitation is no longer part of the study.

      (21) There is just so much data and so many conditions for each experiment here. The paper is dense and difficult to read. It would really benefit readability if the authors put only the key experiments and key figure panels in the main text and moved much of the repetitive figure panels to supplemental figures. The addition of schematic drawings for behavioral experiment timing and for the different AA1, AA2, and AA3 conditions would also really improve clarity.

      By focusing the study, we believe it has substantially improved clarity and readability. 

      Reviewer #3 (Recommendations for the authors):

      (1) Minor error in results 'Cre-AAV in the STN of Vglut2-Cre' Fixed.

      (2) In some Figure 2 panels, the peaks appear to be cut off, and blue traces are obscured by red.

      In Fig. 2, the peaks of movement (speed) traces are intentionally truncated to emphasize the rising phase of the turn, which would otherwise be obscured if the full y-axis range were displayed (peaks and other measures are statistically compared). This adjustment enhances clarity without omitting essential detail and is now noted in the legend.

    1. eLife Assessment

      This valuable study provides a 3D standardised anatomical atlas of the brain of an orb-weaving spider. The authors describe the brain's shape and its inner compartments-the neuropils-and add information on the distribution of a number of neuroactive substances such as neurotransmitters and neuropeptides. Through the use of histological and microscopy methods the authors provide a more complete view of an arachnid brain than previous studies and also presents convincing evidence about the organisation and homology of brain regions. The work will serve as a reference for future studies on spider brains and will enables comparisons of brain regions with insects so that the evolution of these structures can be inferred across arthropods.

    2. Reviewer #1 (Public review):

      Summary:

      Artiushin et al. establish a comprehensive 3D atlas of the brain of the orb-web building spider Uloborus diversus. First, they use immunohistochemistry detection of synapsin to mark and reconstruct the neuropils of the brain of six specimen and they generate a standard brain by averaging these brains. Onto this standard 3D brain, they plot immunohistochemical stainings of major transmitters to detect cholinergic, serotonergic, octopaminergic/taryminergic and GABAergic neurons, respectively. Further, they add information on the expression of a number of neuropeptides (Proctolin, AllatostatinA, CCAP and FMRFamide). Based on this data and 3D reconstructions, they extensively describe the morphology of the entire synganglion, the discernable neuropils and their neurotransmitter/neuromodulator content.

      Strengths:

      While 3D reconstruction of spider brains and the detection of some neuroactive substances have been published before, this seems to be the most comprehensive analysis so far both in terms of number of substances tested and the ambition to analyzing the entire synganglion. Interestingly, besides the previously described neuropils, they detect a novel brain structure, which they call the tonsillar neuropil.

      Immunohistochemistry, imaging and 3D reconstruction are convincingly done and the data is extensively visualized in figures, schemes and very useful films, which allow the reader to work with the data. Due to its comprehensiveness, this dataset will be a valuable reference for researchers working on spider brains or on the evolution of arthropod brains.

      Weaknesses:

      As expected for such a descriptive groundwork, new insights or hypotheses are limited while the first description of the tonsillar neuropil is interesting. The reconstruction of the main tracts of the brain would be a very valuable complementary piece of data.

    3. Reviewer #2 (Public review):

      Summary

      Artiushin et al. created the first three-dimensional atlas of a synganglion in the hackled orb-weaver spider, which is becoming a popular model for web-building behavior. Immunohistochemical analysis with an impressive array of antisera reveal subcompartments of neuroanatomical structures described in other spider species as well as two previously undescribed arachnid structures, the protocerebral bridge, hagstone, and paired tonsillar neuropils. The authors describe the spider's neuroanatomy in detail and discuss similarities and differences from other spider species. The final section of the discussion examines the homology between onychophoran and chelicerate arcuate bodies and mandibulate central bodies.

      Strengths

      The authors set out to create a detailed 3D atlas and accomplished this goal.

      Exceptional tissue clearing and imaging of the nervous system reveals the three-dimensional relationships between neuropils and some connectivity that would not be apparent in sectioned brains.

      Detailed anatomical description makes it easy to reference structures described between the text and figures.

      The authors used a large palette of antisera which may each be investigated in future studies for function in the spider nervous system and may be compared across species.

      Weaknesses addressed in the revision

      Additional added information about spider-specific neuropils helps orient a non-expert reader. While the function and connectivity of many of these structures is currently unknown, this study will be foundational in future investigations of function.

    4. Reviewer #3 (Public review):

      Summary:

      This is an impressive paper that offers a much-needed 3D standardized brain atlas for the hackled-orb weaving spider Uloborus diversus, an emerging organism of study in neuroethology. The authors used a detailed immunohistological wholemount staining method that allowed them to localize a wide range of common neurotransmitters and neuropeptides and map them on a common brain atlas. Through this approach, they discovered groups of cells that may form parts of neuropils that had not previously been described, such as the 'tonsillar neuropil', which might be part of a larger insect-like central complex. Further, this work provides unique insights into previously underappreciated complexity of higher-order neuropils in spiders, particularly the arcuate body, and hints at a potentially important role for the mushroom bodies in vibratory processing for web-building spiders.

      Strengths:

      To understand brain function, data from many experiments on brain structure must be compiled to serve as a reference and foundation for future work. As demonstrated by the overwhelming success in genetically tractable laboratory animals, 3D standardized brain atlases are invaluable tools-especially as increasing amounts of data are obtained at the gross morphological, synaptic, and genetic levels, and as functional data from electrophysiology and imaging are integrated. Among 'non-model' organisms, such approaches have included global silver staining and confocal microscopy, MRI, and more recently, micro-computed tomography (X-ray) scans used to image multiple brains and average them into a composite reference. In this study, the authors used synapsin immunoreactivity to generate an averaged spider brain as a scaffold for mapping immunoreactivity to other neuromodulators. Using this framework, they describe many previously known spider brain structures and also identify some previously undescribed regions. They argue that the arcuate body-a midline neuropil thought to have diverged evolutionarily from the insect central complex-shows structural similarities that may support its role in path integration and navigation.

      Having diverged from insects such as the fruit fly Drosophila melanogaster over 400 million years ago, spiders are an important group for study-particularly due to their elegant web-building behavior, which is thought to have contributed to their remarkable evolutionary success. How such exquisitely complex behavior is supported by a relatively small brain remains unclear. A rich tradition of spider neuroanatomy emerged in the previous century through the work of comparative zoologists, who used reduced silver and Golgi stains to reveal remarkable detail about gross neuroanatomy. Yet, these techniques cannot uncover the brain's neurochemical landscape, highlighting the need for more modern approaches-such as those employed in the present study.

      A key insight from this study involves two prominent higher-order neuropils of the protocerebrum: the arcuate body and the mushroom bodies. The authors show that the arcuate body has a more complex structure and lamination than previously recognized, suggesting it is insect central complex-like and may support functions such as path integration and navigation, which are critical during web building. They also report strong synapsin immunoreactivity in the mushroom bodies and speculate that these structures contribute to vibratory processing during sensory feedback, particularly in the context of web building and prey localization. These findings align with prior work that noted the complex architecture of both neuropils in spiders and their resemblance (and in some cases greater complexity) compared to their insect counterparts. Additionally, the authors describe previously unrecognized neuropils, such as the 'tonsillar neuropil,' whose function remains unknown but may belong to a larger central complex. The diverse patterns of neuromodulator immunoreactivity further suggest that plasticity plays a substantial role in central circuits.

      Weaknesses:

      My major concern, however, is some of the authors' neuroanatomical descriptions rely too heavily on inference rather than what is currently resolvable from their immunohistochemistry stains alone.

      Comments on revisions:

      I thought that the authors did an excellent job responding to the reviews, and I have no further comments.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Artiushin et al. establish a comprehensive 3D atlas of the brain of the orb-web building spider Uloborus diversus. First, they use immunohistochemistry detection of synapsin to mark and reconstruct the neuropils of the brain of six specimens and they generate a standard brain by averaging these brains. Onto this standard 3D brain, they plot immunohistochemical stainings of major transmitters to detect cholinergic, serotonergic, octopaminergic/taryminergic and GABAergic neurons, respectively. Further, they add information on the expression of a number of neuropeptides (Proctolin, AllatostatinA, CCAP, and FMRFamide). Based on this data and 3D reconstructions, they extensively describe the morphology of the entire synganglion, the discernible neuropils, and their neurotransmitter/neuromodulator content.

      Strengths:

      While 3D reconstruction of spider brains and the detection of some neuroactive substances have been published before, this seems to be the most comprehensive analysis so far, both in terms of the number of substances tested and the ambition to analyze the entire synganglion. Interestingly, besides the previously described neuropils, they detect a novel brain structure, which they call the tonsillar neuropil.<br /> Immunohistochemistry, imaging, and 3D reconstruction are convincingly done, and the data are extensively visualized in figures, schemes, and very useful films, which allow the reader to work with the data. Due to its comprehensiveness, this dataset will be a valuable reference for researchers working on spider brains or on the evolution of arthropod brains.

      Weaknesses:

      As expected for such a descriptive groundwork, new insights or hypotheses are limited, apart from the first description of the tonsillar neuropil. A more comprehensive labeling in the panels of the mentioned structures would help to follow the descriptions. The reconstruction of the main tracts of the brain would be a very valuable complementary piece of data.

      Reviewer #2 (Public review):

      Summary

      Artiushin et al. created the first three-dimensional atlas of a synganglion in the hackled orb-weaver spider, which is becoming a popular model for web-building behavior. Immunohistochemical analysis with an impressive array of antisera reveals subcompartments of neuroanatomical structures described in other spider species as well as two previously undescribed arachnid structures, the protocerebral bridge, hagstone, and paired tonsillar neuropils. The authors describe the spider's neuroanatomy in detail and discuss similarities and differences from other spider species. The final section of the discussion examines the homology between onychophoran and chelicerate arcuate bodies and mandibulate central bodies.

      Strengths

      The authors set out to create a detailed 3D atlas and accomplished this goal.

      Exceptional tissue clearing and imaging of the nervous system reveal the three-dimensional relationships between neuropils and some connectivity that would not be apparent in sectioned brains.

      A detailed anatomical description makes it easy to reference structures described between the text and figures.

      The authors used a large palette of antisera which may be investigated in future studies for function in the spider nervous system and may be compared across species.

      Weaknesses

      It would be useful for non-specialists if the authors would introduce each neuropil with some orientation about its function or what kind of input/output it receives, if this is known for other species. Especially those structures that are not described in other arthropods, like the opisthosomal neuropil. Are there implications for neuroanatomical findings in this paper on the understanding of how web-building behaviors are mediated by the brain?

      Likewise, where possible, it would be helpful to have some discussion of the implications of certain neurotransmitters/neuropeptides being enriched in different areas. For example, GABA would signal areas of inhibitory connections, such as inhibitory input to mushroom bodies, as described in other arthropods. In the discussion section on relationships between spider and insect midline neuropils, are there similarities in expression patterns between those described here and in insects?

      Reviewer #3 (Public review):

      Summary:

      This is an impressive paper that offers a much-needed 3D standardized brain atlas for the hackled-orb weaving spider Uloborus diversus, an emerging organism of study in neuroethology. The authors used a detailed immunohistological whole-mount staining method that allowed them to localize a wide range of common neurotransmitters and neuropeptides and map them on a common brain atlas. Through this approach, they discovered groups of cells that may form parts of neuropils that had not previously been described, such as the 'tonsillar neuropil', which might be part of a larger insect-like central complex. Further, this work provides unique insights into the previously underappreciated complexity of higher-order neuropils in spiders, particularly the arcuate body, and hints at a potentially important role for the mushroom bodies in vibratory processing for web-building spiders.

      Strengths:

      To understand brain function, data from many experiments on brain structure must be compiled to serve as a reference and foundation for future work. As demonstrated by the overwhelming success in genetically tractable laboratory animals, 3D standardized brain atlases are invaluable tools - especially as increasing amounts of data are obtained at the gross morphological, synaptic, and genetic levels, and as functional data from electrophysiology and imaging are integrated. Among 'non-model' organisms, such approaches have included global silver staining and confocal microscopy, MRI, and, more recently, micro-computed tomography (X-ray) scans used to image multiple brains and average them into a composite reference. In this study, the authors used synapsin immunoreactivity to generate an averaged spider brain as a scaffold for mapping immunoreactivity to other neuromodulators. Using this framework, they describe many previously known spider brain structures and also identify some previously undescribed regions. They argue that the arcuate body - a midline neuropil thought to have diverged evolutionarily from the insect central complex - shows structural similarities that may support its role in path integration and navigation.

      Having diverged from insects such as the fruit fly Drosophila melanogaster over 400 million years ago, spiders are an important group for study - particularly due to their elegant web-building behavior, which is thought to have contributed to their remarkable evolutionary success. How such exquisitely complex behavior is supported by a relatively small brain remains unclear. A rich tradition of spider neuroanatomy emerged in the previous century through the work of comparative zoologists, who used reduced silver and Golgi stains to reveal remarkable detail about gross neuroanatomy. Yet, these techniques cannot uncover the brain's neurochemical landscape, highlighting the need for more modern approaches-such as those employed in the present study.

      A key insight from this study involves two prominent higher-order neuropils of the protocerebrum: the arcuate body and the mushroom bodies. The authors show that the arcuate body has a more complex structure and lamination than previously recognized, suggesting it is insect central complex-like and may support functions such as path integration and navigation, which are critical during web building. They also report strong synapsin immunoreactivity in the mushroom bodies and speculate that these structures contribute to vibratory processing during sensory feedback, particularly in the context of web building and prey localization. These findings align with prior work that noted the complex architecture of both neuropils in spiders and their resemblance (and in some cases greater complexity) compared to their insect counterparts. Additionally, the authors describe previously unrecognized neuropils, such as the 'tonsillar neuropil,' whose function remains unknown but may belong to a larger central complex. The diverse patterns of neuromodulator immunoreactivity further suggest that plasticity plays a substantial role in central circuits.

      Weaknesses:

      My major concern, however, is that some of the authors' neuroanatomical descriptions rely too heavily on inference rather than what is currently resolvable from their immunohistochemistry stains alone.

      We would like to thank the reviewers for their time and effort in carefully reading our manuscript and providing helpful feedback, and particularly for their appreciation and realistic understanding of the scope of this study and its context within the existing spider neuroanatomical literature.

      Regarding the limitations and potential additions to this study, we believe these to be well-reasoned and are in agreement. We plan to address some of these shortcomings in future publications.

      As multiple reviewers remarked, a mapping of the major tracts of the brain would be a welcome addition to understanding the neuroanatomy of U. diversus. This is something which we are actively working on and hope to provide in a forthcoming publication. Given the length of this paper as is, we considered that a treatment of the tracts would be better served as an additional paper. Likewise, mapping of the immunoreactive somata of the currently investigated targets is a component which we would like to describe as part of a separate paper, keeping the focus of the current one on neuropils, in order to leverage our aligned volumes to describe co-expression patterns, which is not as useful for the more widely dispersed somata. Furthermore, while we often see somata through immunostaining, the presence and intensity of the signal is variable among immunoreactive populations. We are finding that these populations are more consistently and comprehensively revealed thru fluorescent in situ hybridization.

      We appreciate the desire of the reviewers for further information regarding the connectivity and function of the described neuropils, and where possible we have added additional statements and references. That being said, where this context remains sparse is largely a reflection of the lack of information in the literature. This is particularly the case for functional roles for spider neuropils, especially higher order ones of the protocerebrum, which are essentially unexamined. As summarized in the quite recent update to Foelix’s Spider Neuroanatomy, a functional understanding for protocerebral neuropil is really only available for the visual pathway. Consequently, it is therefore also difficult to speak of the implications for presence or absence of particular signaling elements in these neuropils, if no further information about the circuitry or behavioral correlates are available. Finally, multiple reviewers suggested that it might be worthwhile to explore a comparison of the arcuate body layer innervation to that of the central bodies of insects, of which there is a richer literature. This is an idea which we were also initially attracted to, and have now added some lines to the discussion section. Our position on this is a cautious one, as a series of more recent comparative studies spanning many insect species using the same antibody, reveals a considerable amount of variation in central body layering even within this clade, which has given us pause in interpreting how substantive similarities and differences to the far more distant spiders would be. Still, this is an interesting avenue which merits an eventual comprehensive analysis, one which would certainly benefit from having additional examples from more spider species, in order to not overstate conclusions based on the currently limited neuroanatomical representation.

      Given our framing for the impetus to advance neuroanatomical knowledge in orb-web builders, the question of whether the present findings inform the circuitry controlling web-building is one that naturally follows. While we are unable with this dataset alone to define which brain areas mediate web-building - something which would likely be beyond any anatomical dataset lacking complementary functional data – the process of assembling the atlas has revealed structures and defined innervation patterns in previously ambiguous sectors of the spider brain, particularly in the protocerebrum. A simplistic proposal is that such regions, which are more conspicuous by our techniques and in this model species, would be good candidates for further inquiries into web-building circuitry, as their absence or oversight in past work could be attributable to the different behavioral styles of those model species. Regardless, granted that such a hypothesis cannot be readily refuted by the existing neuroanatomical literature, underscores the need to have more finely refined models of the spider brain, to which we hope that we have positively contributed to and are gratified by the reviewer’s enthusiasm for the strengths of this study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Brenneis 2022 has done a very nice and comprehensive study focused on the visual system - this might be worth including.

      Thank you, we have included this reference on Line 34.

      (2) L 29: When talking about "connectivity maps", the emerging connectomes based on EM data could be mentioned.

      Additional references have been added, thank you. Line 35.

      (3) L 99: Please mention that you are going to describe the brain from ventral to dorsal.

      Thank you, we have added a comment to Line 99.

      (4) L 13: is found at the posterior.

      Thank you, revised.

      (5) L 168: How did you pick those two proctolin+ somata, given that there is a lot of additional punctate signal?

      Although not visible in this image, if you scroll through the stack there is a neurite which extends from these neurons directly to this area of pronounced immunoreactivity.

      (6) Figure 1: Please add the names of the neuropils you go through afterwards.

      We have added labels for neuropils which are recognizable externally.

      (7) Figure 1 and Figure 5: Please mark the esophagus.

      Label has now been added to Figure 1. In Figure 5, the esophagus should not really be visible because these planes are just ventral to its closure.

      (8) Figure 5A: I did not see any CCAP signal where the arrow points to; same for 5B (ChAT).

      In hindsight, the CCAP point is probably too minor to be worth mentioning, so we have removed it.

      The ChAT signal pattern in 5B has been reinforced by adding a dashed circle to show its location as well.

      (9) L 249: Could the circular spot also be a tract (many tracts lack synapsin - at least in insects)?

      Yes, thank you for pointing this out – the sentence is revised (L274). We are currently further analyzing anti-tubulin volumes and it seem that indeed there are tracts which occupy these synapsin-negative spaces, although interestingly they do not tend to account for the entire space.

      (10) L 302: Help me see the "conspicuous" thing.

      Brace added to Fig. 8B, note in caption.

      (11) L 315: Please first introduce the number of the eyes and how these relate to 1{degree sign} and 2{degree sign} pathway. Are these separate pathways from separate eyes or two relay stations of one visual pathway?

      We have expanded the introduction to this section (L336). Yes, these are considered as two separate visual pathways, with a typical segregation of which eyes contribute to which pathway – although there is evidence for species-specific differences in these contributions. In the context of this atlas, we are not currently able to follow which eyes are innervating which pathway.

      (12) L 343: It seems that the tonsillar neuropil could be midline spanning (at least this is how I interpret the signal across the midline). Would it make sense to re-formulate from a paired structure to midline-spanning? Would that make it another option for being a central complex homolog?

      In the spectrum from totally midline spanning and unpaired (e.g., arcuate body (at least in adults)) to almost fully distinct and paired (e.g., mushroom bodies (although even here there is a midline spanning ‘bridge’)), we view the tonsillar to be more paired due to the oval components, although it does have a midline spanning section, particularly unambiguous just posterior to the oval sections.

      Regarding central complex homology, if the suggestion is that the tonsillar with its midline spanning component could represent the entire central complex, then this is a possibility, but it would neglect the highly innervated and layered arcuate body, which we think represent a stronger contender – at least as a component of the central complex. For this reason, we would still be partial to the possibility that the tonsillar is a part of the central complex, but not the entire complex.

      (13) L 407: ...and dorsal (..) lobe...

      Added the word ‘lobe’ to this sentence (L429).

      (14) L 620ff: Maybe mention the role of MBs in learning and memory.

      A reference has been added at L661.

      (15) L 644: In the context of arcuate body homology with the central body, I was missing a discussion of the neurotransmitters expressed in the respective parts in insects. Would that provide additional arguments?

      This is an interesting comparison to explore, and is one that we initially considered making as well. There are certainly commonalities that one could point to, particularly in trying to build the case of whether particular lobes of the arcuate body are similar to the fan-shaped or ellipsoid bodies in insects. Nevertheless, something which has given us pause is studying the more recent comparative works between insect species (Timm et al., 2021, J Comp Neuro, Homberg et al., 2023, J Comp Neuro), which also reveal a fair degree of heterogeneity in expression patterns between species – and this is despite the fact that the neuropils are unambiguously homologous. When comparing to a much more evolutionarily distant organism such as the spider, it becomes less clear which extant species should serve as the best point of comparison, and therefore we fear making specious arguments by focusing on similarities when there are also many differences. We have added some of these comments to the discussion (L699-725).

      Throughout the text, I frequently had difficulties in finding the panels right away in the structures mentioned in the text. It would help to number the panels (e.g., 6Ai, Aii, Aii,i etc) and refer to those in the text. Further, all structures mentioned in the text should be labelled with arrows/arrowheads unless they are unequivocally identified in the panel

      Thank you for the suggestion. We have adopted the additional numbering scheme for panels, and added additional markers where suggested.

      Reviewer #2 (Recommendations for the authors):

      (1) L 18: "neurotransmitter" should be pluralized.

      Thank you, revised (L18).

      (2) L 55: Missing the word "the" before "U. diversus".

      Thank you, revised (L57).

      (3) L 179: Change synaptic dense to "synapse-dense".

      Thank you, revised (L189).

      (4) L 570: "present in" would be clearer than "presented on in".

      Our intention here was to say that Loesel et al did not show slices from the subesophageal mass for CCAP, so it was ambiguous as to whether it had immunoreactivity there but they simply did not present it, or if it indeed doesn’t show signal in the subesophageal. But agreed, this is awkward phrasing which has been revised (L606-608), thank you.

      (5) L 641: It would be worth noting that the upper and lower central bodies are referred to as the fan-shaped and ellipsoid bodies in many insects.

      Thank you, this has been added in L694.

      (6) L 642: Although cited here regarding insect central body layers, Strausfeld et al. 2006 mainly describe the onychophoran brain and the evolutionary relationship between the onychophoran and chelicerate arcuate bodies. The phylogenetic relationships described here would strengthen the discussion in the section titled "A spider central complex?"

      The phylogenetic relationship of onychophorans and chelicerates remains controversial and therefore we find it tricky to use this point to advance the argument in that discussion section, as one could make opposing arguments. The homology of the arcuate body (between chelicerates, onychophorans, and mandibulates) has likewise been argued over, with this Strausfeld et al paper offering one perspective, while others are more permissive (good summary at end of Doeffinger et al., 2010). Our thought was simply to draw attention to grossly similar protocerebral neuropils in examples from distantly related arthropods, without taking a stance, as our data doesn’t really deeply advance one view over the other.

      (7) L 701- Noduli have been described in stomatopods (Thoen et al., Front. Behav. Neurosci., 2017).

      This is an important addition, thank you – it has been incorporated and cited (L766).

      (8) Antisera against DC0 (PKA-C alpha) may distinguish globuli cells from other soma surrounding the mushroom bodies, but this may be accomplished in future studies.

      Agreed, this is something we have been interested in, but have not yet acquired the antibody.

      Reviewer #3 (Recommendations for the authors):

      Overall, this paper is both timely and important. However, it may face some resistance from classically trained arthropod neuroanatomists due to the authors' reliance on immunohistochemistry alone. A method to visualize fiber tracts and neuropil morphology would have been a valuable and grounding complement to the dataset and can be added in future publications. Tract-tracing methods (e.g., dextran injections) would strengthen certain claims about connectivity - particularly those concerning the mushroom bodies. For delineating putative cell populations across regions, fluorescence in situ hybridization for key transcripts would offer convincing evidence, especially in the context of the arcuate body, the tonsillar neuropil, and proposed homologies to the insect central complex.

      That said, the dataset remains rich and valuable. Outlined below are a number of issues the authors may wish to address. Most are relatively minor, but a few require further clarification.

      (1) Abstract

      (a) L 12-14: The authors should frame their work as a novel contribution to our understanding of the spider brain, rather than solely as a tool or stepping stone for future studies. The opening sentences currently undersell the significance of the study.

      Thank you for your encourament! We have revised the abstract.

      (b) Rather than touting "first of its kind" in the abstract, state what was learned from this.

      Thank you, we have revised the abstract.

      (c) The abstract does not mention the major results of the study. It should state which brain regions were found. It should list all of the peptides and transmitters that were tested so that they can be discoverable in searches.

      Thank you, revised.

      (2) Introduction

      (a) L 38: There's a more updated reference for Long (2016): Long, S. M. (2021). Variations on a theme: Morphological variation in the secondary eye visual pathway across the order of Araneae. Journal of Comparative Neurology, 529(2), 259-280.

      Thank you, this has been updated (L41 and elsewhere).

      (b) L 47: While whole-mount imaging offers some benefits, a downside is the need for complete brain dissection from the cuticle, which in spiders likely damages superficial structures (such as the secondary eye pathways).

      True – we have added this caveat to the section (L48-51).

      (c) L 49-52: If making this claim, more explicit comparisons with non-web building C. saeli in terms of neuropil presence, volume, or density later in the paper would be useful.

      We do not have the data on hand to make measured comparisons of C. salei structures, and the neuropils identified in this study are not clearly identifiable in the slices provided in the literature, so would likely require new sample preparations. We’ve removed the reference to proportionality and softened this sentence slightly – we are not trying to make a strong claim, but simply state that this is a possibility.

      (3) Results

      (a) The authors should state how they accounted for autofluorescence.

      While we did not explicitly test for autofluorescence, the long process of establishing a working whole-mount immuno protocol and testing antibodies produced many examples of treated brains which did not show any substantial signal.  We have added a note to the methods section (L866).

      (b) L 69: There is some controversy in delineating the subesophageal and supraesophageal mass as the two major divisions despite its ubiquity in the literature. It might be safer to delineate the protocerebrum, deutocerebrum, and fused postoral ganglia (including the pedipalp ganglion) instead.

      Thank you for this insight, we have modified the section, section headings and Figure 1 to account for this delineation as well. We have chosen to include both ways of describing the synganglion, in order to maintain a parallel with the past literature, and to be further accessible to non-specialist readers. L73-77

      (c) L 90: It might be useful to include a justification for the use of these particular neuropeptides.

      Thank you, revised. L97-99.

      (d) L 106 - 108: It is stated that the innervation pattern of the leg neuropils is generally consistent, but from Figure 2, it seems that there are differences. The density of 5HT, Proctolin, ChAT, and FMRFamide seems to be higher in the posterior legs. AstA seems to have a broader distribution in L1 and is absent in L4.

      We would still stand by the generalization that the innervation pattern is fairly similar for each leg. The L1 neuropils tend to be bigger than the posterior legs, which might explain the difference in density. Another important aspect to keep in mind is that not all of the leg neuropils appear at the exact same imaging plane as we move from ventral to dorsal. If you scroll through the synapsin stack (ventral to dorsal), you will see that L2 and L3 appear first, followed shortly by L1, and then L4, and at the dorsal end of the subesophageal they disappear in the opposite order. The observations listed here are true for the single z-plane in Figure 2, but the fact that they don’t appear at the same time seems to mainly account for these differences. For example, if you scroll further ventrally in the AstA volume, you will see a very similar innervation appear in L4 as well, even though it is absent in the Fig. 2 plane. We plan to have these individual volumes available from a repository so that they can be individually examined to better see the signal at all levels. At the moment, the entire repository can be accessed here: https://doi.org/10.35077/ace-moo-far.

      (e) Figure 1 and elsewhere: The axes for the posterior and lateral views show Lateral and Medial. It would be more accurate to label them Left and Right. because it does not define the medial-to-lateral axis. The medial direction is correct for only one hemiganglion, and it's the opposite for the contralateral side.

      Thank you, revised.

      (f) In Figures that show particular sections, it might be helpful to include a plane in the standard brain to illustrate where that section is.

      Yes, we agree and it was our original intention. It is something we can attempt to do, but there is not much room in the corners of many of the synapsin panels, making it harder to make the 3D representation big enough to be clear.

      (g) Figure 2, 3: Presenting the z-section stack separately in B and C is awkward because it makes it seem that they are unrelated. I think it would be better to display the z160-190 directly above its corresponding z230-260 for each of the exemplars in B and C. Since there's no left-right asymmetry, a hemibrain could be shown for all examples as was done for TH in D. It's not clear why TH was presented differently.

      Thank you for this suggestion. We rearranged the figure as described, but ultimately still found the original layout to be preferrable, in part because the labelling becomes too cramped. We hope that the potential confusion of the continuity of the B and C sections will be mitigated by focusing on the z plane labels and overall shape – which should suggest that the planes are not far from each other. We trust that the form of the leg neuropils is recognizable in both B and C synapsin images, and so readers will make the connection.

      Regarding TH, this panel is apart from the rest because we were unable to register the TH volume to the standard brain because the variant of the protocol which produced good anti-TH staining conflicted with synapsin, and we could not simultaneously have adequate penetration of the synapsin signal. We did not want to align the TH panel with the others to avoid potential confusion that this was a view from the same z-plane of a registered volume, as the others are. We have added a note to the figure caption.

      (h) The locations of the labels should be consistent. The antisera are below the images in Figure 2, above in Figure 3, and to the bottom left in Figure 5. The slices are shown above in Figure 2 and below in Figure 3.

      Thank you, this has been revised for better consistency.

      (i) It is surprising to me that there is no mention of the neuronal somata visible in Figure 2 and Figure 3. A typical mapping of the brain would map the locations of the neurons, not just the neuropils.

      Our first arrangement of this paper described each immunostain individually from ventral to dorsal, including locations of the immunoreactive somata which could be observed. To aid the flow of the paper and leverage the aligned volumes to emphasize co-expression in the function divisions of the brain, we re-formulated to this current layout which is organized around neuropils. Somata locations are tricky to incorporate in this format of the paper which focuses on key z-planes or tight max projections, because the relevant immunoreactive somata are more dispersed throughout the synganglion, not always overlapping in neighboring z-planes. Further, since only a minority of the antisera we used can reveal traceable projections from the supplying somata in the whole-mount preparation, we would be quite limited in the degree to which we could integrate the specific somata mapping with expression patterns in the neuropil.  Finally, compared to immuno, which can be variable in staining intensity between somata for the same target, we find that FISH reveals these locations more clearly and comprehensively – so while we agree that this mapping would also be useful for the atlas, we would like to better provide this information in a future publication using whole-mount FISH.

      (j) L 139: There is a reference to a "brace" in Figure 3B, which does not seem to exist. There's one in Figure 3C.

      There is a smaller brace near the bottom of the TDC2 panel in Fig. 3B.

      (k) L 151 should be "3D".

      Thank you, revised (L160).

      (l) Figure 4C: It is not mentioned in the legend that the bottom inset is Proctolin without synapsin.

      Thank you, revised (L1213).

      (m) L 199: Are the authors sure this subdivision is solely on the anterior-posterior axis? Could it also be dorsal ventral? (i.e., could this be an artifact of the protocerebrum and deutocerebrum?)

      Yes, this division can be appreciated to extend somewhat in the dorsal-ventral axis and it is possible that this is the protocerebrum emerging after the deutocerebrum, although this area is largely dorsal to the obvious part of the deutocerebrum. In the horizontal planes there appears to be a boundary line which we use for this subdivision in order to assist in better describing features within this generally ventral part of the protocerebrum – referred to as “stalk” because it is thinner before the protocerebrum expands in size, dorsally. Our intention was more organizational, and as stated in the text, this area is likely heterogenous and we are not suggesting that it has a unified function, so being a visual artifact would not be excluded.

      (n) L 249: Could it also indicate large tracts projecting elsewhere?

      Yes, definitely, we have evidence that part of the space is occupied by tracts. Revised, thank you (L262).

      (o) L 281: Several investigators, including Long (2021,) noted very large and robust mushroom bodies of Nephila.

      Thank you – the point is well taken that there are examples of orb-web builders that do have appreciable mushroom bodies. We have added a note in this section (L295), giving the examples of Deinopis spinosa and Argiope trifasciata (Figure 4.20 and 4.22 in Long, 2016).

      It looks like these species make the point better than Nephila, as Long lists the mushroom body percentage of total protocerebral volume for D. spinosa as 4.18%, for A. trifasciata as 2.38%, but doesn’t give a percentage for Nephila clavipes (Figure 4.24) and only labels the mushroom bodies structures as “possible” in the figure.

      In Long (2021), Nephilidae is described as follows: “In Nephilidae, I found what could be greatly reduced medullae at the caudal end of the laminae, as well as a structure that has many physical hallmarks of reduced mushroom bodies”

      (p) L 324: If the authors were able to stain for histamine or supplement this work with a different dissection technique for the dorsal structures, the visual pathways might have been apparent, which seems like a very important set of neuropils to include in a complete brain atlas.

      Yes, for this reason histamine has been an interesting target which we have attempted to visualize, but unfortunately have not yet been able to successfully stain for in U. diversus. An additional complication is that the antibodies we have seen call for glutaraldehyde fixation, which may make them incompatible with our approach to producing robust synapsin staining throughout the brain. 

      We agree that the lack of the complete visual pathway is a substantial weakness of our preparation, and should be amended in future work, but this will likely require developing a modified approach in order to preserve these delicate structures in U. diversus.

      (q) L 331: Is this bulbous shape neuropil, or just the remains of neuropil that were not fully torn away during dissection?

      This certainly is a severed part of the primary pathway, although it seems more likely that the bulbous shape is indicative of a neuropil form, rather than just being a happenstance shape that occurred during the breakage. We have examples where the same bulbous shape appears on both sides, and in different brains. It is possible that this may be the principal eye lamina – although we did not see co-staining with expected markers in examples where it did appear, so cannot be sure.

      (r) L 354: Is tyraminergic co-staining with the protocerebral bridge enough evidence to speculate that inputs are being supplied?

      We agree that this is not compelling, and have removed the statement.

      (s) L 372: This whole structure appears to be a previously described structure in spiders, the 'protocerebral commissure'.

      We are reasonably sure that what we are calling the PCB is a distinct structure from the protocerebral bridge (PCC). In Babu and Barth’s (1984) horizontal slice (Fig. 11b), you can see the protocerebral commissure immediately adjacent to the mushroom body bridge. It is found similarly located in other species, as can be seen in the supplementary 3D files provided by Steinhoff et al., (2024).

      While not visible with synapsin in U. diversus, we likewise can make out a commissure in this area in close proximity to the mushroom body bridge using tubulin staining. What we are calling the protocerebral bridge is a structure which is much more dorsal to the protocerebral commissure, not appearing in the same planes as the MB bridge.

      (t) L 377: Do you have an intuition why the tonsillar neuropil and the protocerebral bridge would show limited immunoreactivity, while the arcuate body's is quite extensive?

      This is an interesting question. Given the degree of interconnection and the fact that multiple classes of neurons in insects will innervate both central body as well as PCB or noduli, perhaps it would be expected that expression in tonsillar and protocerebral bridge should be commensurate to the innervation by that particular neurotransmitter expressing population in the arcuate body. Apart from the fact that the arcuate body is just bigger, perhaps this points to a great role of the arcuate body for integration, whereas the tonsillar and PCB may engage in more particular processing, or be limited to certain sensory modalities.

      Interestingly, it seems that this pattern of more limited immunoreactivity in the PCB and noduli compared with the central bodies (fan-shaped/ellipsoid) also appears in insects (Kahsai et al., 2010, J Comp Neuro, Timm et al., 2021, J Comp Neuro, Homberg et al., 2023, J Comp Neuro) – particularly, with almost every target having at least some layering in the fan-shaped body (Kahsai et al., 2010, J Comp Neuro).  For example, serotoninergic innervation is fairly consistently seen in the upper and lower central bodies across insects, but its presence in the PCB or noduli is more variable – appearing in one or the other in a species-dependent manner (Homberg et al., 2023, J Comp Neuro).

      (4) Discussion

      (a) L 556: But if confocal images from slices are aligned, is the 3D shape not preserved?

      Yes, fair enough – the point we wanted to make was that there is still a limitation in z resolution depending on the thickness of the slices used, which could obscure structures, but perhaps this is too minor of a comment.

      (b) L 597: This is a very interesting result. I agree it's likely to do with the processing of mechanosensory information relevant to web activities, and the mushroom body seems like the perfect candidate for this.

      (c) L 638: Worth noting that neuropil volume vs density of synapses might play a role in this, as the literature is currently a bit ambiguous with regards to the former.

      Thank you, noted (L689).

      (d) L 651: The latter seems far more plausible.

      Agreed, though the presence of mushroom bodies appears to be variable in spiders, so we didn’t want to take a strong stance, here.

    1. eLife Assessment

      This valuable study addresses T cell receptor activation during autoreactive T cell development and how the strength of T cell receptor engagement in naïve cells can predispose T cells to develop into effector/memory T cells. The authors lead with solid results that are largely consistent with data in the field suggesting that, in comparison to their counterparts with relatively lower basal self-reactivity, naive CD5hi CD8 T cells in non-obese diabetic (NOD) mice are poised for activation. They propose that diabetogenic T cells are preferentially found among the naive CD5hi CD8 T cell population. While the evidence does not fully support all the authors' conclusions, the data provide a foundation that sets up future studies.

    2. Reviewer #1 (Public review):

      Summary

      In their manuscript, Ho and colleagues investigate the importance of thymically-imprinted self-reactivity in determining CD8 T cell pathogenicity in non-obese diabetic (NOD) mice. The authors describe pre-existing functional biases associated with naive CD8 T cell self-reactivity based on CD5 levels, a well characterized proxy for T cell affinity to self-peptide. They find that naive CD5hi CD8 T cells are poised to respond to antigen challenge; these findings are largely consistent with previously published data on the C57Bl/6 background. The authors go on to suggest that naive CD5hi CD8 T cells are more diabetogenic as 1) the CD5hi naive CD8 T cell receptor repertoire has features associated with autoreactivity and contains a larger population of islet-specific T cells, and 2) the autoreactivity of "CD5hi" monoclonal islet-specific TCR transgenic T cells cannot be controlled by phosphatase over-expression. Thus, they implicate CD8 T cells with relatively higher levels of basal self-reactivity in autoimmunity. The data presented offers valuable insights and sets the foundation for future studies, but some conclusions are not yet fully supported.

      Specific comments

      There is value in presenting phenotypic differences between naive CD5lo and CD5hi CD8 T cells in the NOD background as most previous studies have used T cells harvested from C57Bl/6 mice or peripheral blood from healthy human donors.

      The comparison of a marker of self-reactivity, CD5 in this case, on broad thymocyte populations (DN/DP/CD8SP) is cautioned. CD5 is upregulated with signals associated with b-selection and positive selection; CD5 levels will thus vary even among subsets within these broad developmental intermediates. This is a particularly important consideration when comparing CD5 across thymic intermediates in polyclonal versus TCR transgenic thymocytes due to the striking differences in thymic selection efficiency, resulting in different developmental population profiles. The higher levels of CD5 noted in the DN population of NOD8.3 mice, for example, is likely due to the shift towards more mature DN4 post-b-selection cells. Similarly, in the DP population, the larger population of post-positive selection cells in the NOD8.3 transgenic thymus may also skew CD5 levels significantly. Overall, the reported differences between NOD and NOD8.3 thymocyte subsets could be due largely to differences in differentiation/maturation stage rather than affinity for self-antigen during T cell development. The authors have added some additional text to the revised manuscript that acknowledges some of these limitations.

      The lack of differences in CD5 levels of post-positive selection DP thymocytes, CD8 SP thymocytes, and CD8 T cells in the pancreas draining lymph nodes from NOD vs NOD8.3 mice also raises questions about the relevance of this model to address the question of basal self-reactivity and diabetogenicity and the authors' conclusion that "that intrinsic high CD5-associated self-reactivity in NOD8.3 T cells overrides the transgenic Pep-mediated protection observed in dLPC/NOD mice"; the phenotype of the polyclonal and NOD8.3 TCR transgenic CD8 T cells that were analyzed in the (spleen and) pancreas draining lymph nodes is not clear (i.e., are these gated on naive T cells?). Furthermore, the rationale for the comparison with NOD-BDC2.5 mice that carry an MHC II-restricted TCR is unclear.

      In reference to the conclusion that transgenic Pep phosphatase does not inhibit the diabetogenic potential of "CD5hi" CD8 T cells, there is some concern that comparing diabetes development in mice receiving polyclonal versus TCR transgenic T cells specific for an islet antigen is not appropriate. The increased frequency and number of antigen specific T cells in the NOD8.3 mice may be responsible for some of the observed differences. Further justification for the comparison is suggested.

      The manuscript presents an interesting observation that TCR sequences from CD5hi CD8 T cells may share certain characteristics with diabetogenic T cells found in patients (e.g., CDR3 length), and that autoantigen-specific T cells may be enriched within the CD5hi naive CD8 T cell population. However, the percentage of tetramer-positive cells among naive CD8 T cells appears unusually high in the data presented, and caution is warranted when comparing additional T cell receptor features of self-reactivity/auto-reactivity between CD4 and CD8 T cells.

      The counts for the KEGG enrichment pathways presented are relatively low, and the robustness of the analysis should be carefully considered, particularly given that several significance values appear borderline. That said, the differentially expressed genes among CD5lo and CD5hi CD8 T cells are generally consistent with previously published datasets.

      The manuscript includes some imprecise wording that may be misleading. For example (not exhaustive): The strength of TCR reactivity to foreign antigen is not "contributed by basal TCR signal" per se but rather correlates with sub-threshold TCR signals necessary for T cell development and survival, CD5 is not broadly expressed on all B cells as the text might suggest but is restricted to a specific subset of B cells, some of the proximal signaling molecules downstream of the preTCR are different than for the mature TCR, upregulation of CD127 at early timepoints post T cell activation is not directly suggestive of their "heightened capabilities in memory T cell homeostasis", etc. The statement "Our study exclusively examined female mice because the disease modeled is relevant in females" should be reconsidered. While the use of female NOD mice can be justified by their higher incidence of diabetes than their male counterparts, the current wording could be misleading.

      For clarity and transparency, please consider while additional information is provided in the revised manuscript, gating strategies are not always clear (i.e., naive versus total CD8 T cells), and the age/status of the mice from which cells are harvested (i.e., prediabetic?) is not consistently provided as far as this reviewer noted.

    3. Reviewer #2 (Public review):

      Summary:

      In this study Chia-Lo Ho et al. study the impact of CD5high CD8 T cells in the pathophysiology of type 1 diabetes (T1D) in NOD mice. The authors used high expression of CD5 as a surrogate of high TCR signaling and self-reactivity and compared the phenotype, transcriptome, TCR usage, function and pathogenic properties of CD5high vs. CD5low CD8 T cells extracted from the so-called naive T cell pool. The study shows that CD5high CD8 T cells resemble memory T cells poised for stronger response to TCR stimulation and that they exacerbate disease upon transfer in RAG-deficient NOD mice. The authors attempt to link these features to the thymic selection events of these CD5high CD8 T cells. Importantly, forced overexpression of the phosphatase PTPN22 in T cells attenuated TCR signaling and reduced pathogenicity of polyclonal CD8 T cells but not highly autoreactive 8.3-TCR CD8 T cells.

      Strengths:

      The study is nicely performed and the manuscript is clearly and well written. Interpretation of the data is careful and fair. The data are novel and likely important. However, some issues would need to be clarified through either text changes or addition of new data.

      Weaknesses:

      The definition of naïve T cells based solely on CD44low and CD62Lhigh staining may be oversimplistic. Indeed, even within this definition naïve CD5high CD8 T cells express much higher levels of CD44 than CD5low CD8 T cells.

      Comments on revisions:

      The authors addressed my previous comments thoughtfully and extensively.

    4. Reviewer #3 (Public review):

      Summary:

      In this study, Ho et al. hypothesised that autoreactive T cells receiving enhanced TCR signals during positive selection in the thymus are primed for generating effector and memory T cells. They used CD5 as a marker for TCR signal strength during their selection at the double positive stage. Supporting their hypothesis, naïve T cells with high CD5 proliferated better and expressed markers of T cell activation compared to naïve T cells with lower levels of CD5. Furthermore, results showed that autoimmune diabetes can be efficiently induced after the transfer of naïve CD5 hi T cells compared to CD5 lo T cells. This provided solid evidence in support of their hypothesis that T cells receiving higher basal TCR signaling are primmed to develop into effector T cells. However, all functional characterisation was done on the cells in the periphery and CD5 hi cells in the peripheral lymphoid compartment can receive tonic TCR signaling. Hence, the function of CD5 hi T cells might not be related to development and programming in the thymus. This is a major hurdle in the interpretation of the results and justifying the title of the study. The evidence that transgenic PTPN22 expression could not regulate T cell activation in CD5 hi TCR transgenic autoreactive T cells was weak. Studying T cell development in TCR transgenic mice and looking at TCR downstream signaling could be misleading due to transgenic expression of TCR at all developmental stages.

      Strengths:

      (1) Demonstrating that CD5 hi cells in naïve CD8 T cell compartment express markers of T cell activation, proliferation and cytotoxicity at a higher level

      (2) Using gene expression analysis, study showed CD5 hi cells among naïve CD8 T cells are transcriptionally poised to develop into effector or memory T cells.

      (3) Study showed that CD5 hi cells have higher basal TCR signaling compared to CD5 lo T cells.

      (4) Key evidence of pathogenicity of autoreactive CD5 hi T cells was provided by doing the adoptive transfer of CD5 hi and CD5 lo CD8 T cells into NOD Rag1-/- mice and comparing them.

      Weaknesses:

      (1) Although CD5 can be used as a marker for self-reactivity and T cell signal strength during thymic development, it can also be regulated in the periphery by tonic TCR signaling or when T cells are activated by its cognate antigen. Hence, TCR signals in the periphery could also prime the T cells towards effector/memory differentiation. That's why from the evidence presented here it cannot be concluded that this predisposition of T cells towards effector/memory differentiation is programmed due to higher reactivity towards self-MHC molecules in the thymus, as stated in the title.

      (2) Flow cytometry data needs to be revisited for the gating strategy, biological controls and interpretation.

      (3) Evidence linking CD5 hi cells to more effector phenotype using gene enrichment scores is very weak.

      (4) Experiments done in this study did not address why CD5 hi T cells could be negatively regulated in NOD mice when PTPN22 is overexpressed resulting in protection from diabetes but the same cannot be achieved in NOD8.3 mice.

      (5) Experimental evidence provided to show that PTPN22 overexpression does not regulate TCR signaling in NOD8.3 T cells is weak.

      (6) TCR sequencing analysis does not conclusively show that CD5 hi population is linked with autoreactive T cells. Doing single-cell RNAseq and TCR seq analysis would have helped address this question.

      (7) When analysing data from CD5 hi T cells from the pancreatic lymph node, it is difficult to discriminate if the phenotype is just because of T cells that would have just encountered the cognate antigen in the draining lymph node or if it is truly due to basal TCR signaling.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Review #1 (Public review):

      Figures 1 through 4 contain data that largely recapitulate published findings (Fulton et al., 2015; Lee et al., 2024; Swee et al., 2016; Dong et al., 2021); it is noted that there is value in confirming phenotypic differences between naive CD5lo and CD5hi CD8 T cells in the NOD background. It is important to contextualize the data while being wary of making parallels with results obtained from CD5lo and CD5hi CD4 T cells. There should also be additional attention paid to the wording in the text describing the data (e.g., the authors assert that, in Figure 4C, the “CD5hi group exhibited higher percentages of CD8+ T cells producing TNF-α, IFN-γ and IL-2” though there is no difference in IL-2 nor consistent differences in TNF-α between the CD5lo and CD5hi population<sup>hi</sup> CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup> T cells have been previously characterized in other genetic backgrounds. In our study, we aimed to confirm and extend these observations specifically in the autoimmune-prone NOD background, which had not been systematically addressed. Additionally, we carefully reviewed the text describing Figure 4C and revised the wording to accurately reflect the observed data (line 263-264). Specifically, we now state that the CD5<sup>hi</sup> group exhibited higher levels of IFN-γ and a trend toward increased TNF-α, while IL-2 production did not show a significant difference.

      The comparison of CD5 across thymocyte populations is cautioned due to variation in developmental stages, particularly in transgenic models. The reported differences may reflect maturation stages rather than self-reactivity.

      We appreciate the reviewer’s important point regarding the interpretation of CD5 levels across thymocyte subsets. In our revised manuscript (lines 455–471), we have added clarification that CD5 expression in DN and DP subsets reflects pre-TCR and TCR signaling events during thymic development. We also acknowledge that differences in maturation stages, especially in the NOD8.3 transgenic model, may influence CD5 expression. We now discuss this caveat and interpret our results with caution, particularly emphasizing that our data support but do not sufficiently define their differential self-reactivity.

      The conclusion that PTPN22 overexpression does not inhibit the diabetogenic potential of CD5<sup>hi</sup>CD8<sup>+</sup> T cells is potentially confounded by differences between polyclonal and TCR transgenic systems.

      We thank the reviewer for raising this concern. We acknowledge that this system introduces confounders due to differences in precursor frequencies and clonal expansion compared to polyclonal repertoires. These differences may affect the responsiveness to phosphatase-mediated attenuation of signaling. Therefore, while our results support that high-affinity autoreactive CD8<sup>+</sup> T cells may be less sensitive to PTPN22 overexpression, we do not claim that this finding generalizes to all autoreactive CD8<sup>+</sup> T cells. Rather, it highlights a potential inability of peripheral tolerance in T cells with strong intrinsic self-reactivity.

      TCR sequencing data shows variability; is this representative of the overall repertoire?

      We appreciate the reviewer’s comment. We acknowledge that data from bulk TCR sequencing has potential limitations, including variability across experiments and limited resolution at the clonotype level. To improve representativeness and reduce sampling bias, we performed TCR repertoire analysis in two independent experiments. In each experiment, naïve CD5<sup>hi</sup> CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup> T cells were sorted from pooled peripheral lymph nodes of at least 20 individual NOD mice per group. This approach allowed us to capture a broader range of clonotypes and ensured that the resulting repertoire profiles reflect the characteristics of the overall CD5<sup>hi</sup> and CD5<sup>lo</sup> populations, rather than isolated outliers. Despite some variability, we observed consistent trends in key features, such as shorter CDR3β length, altered TRAV/TRBV usage and reduced diversity in the CD5<sup>hi</sup> subset across both experiments. To enhance resolution and directly assess clonotype-specific reactivity, we plan to perform single-cell RNA and TCR sequencing in future studies, as noted in the revised Discussion (lines 466–471).

      Clarifications are requested regarding naive gating, controls, gMFI reporting, and missing methods.

      We thank the reviewer for these specific suggestions. We have revised figure legends to better describe gating strategies and included appropriate controls in Figures or Supplementary Figures. Regarding gMFI reporting, we have now shown in the figure legends whether values are reported as gMFI. Additionally, we have added the missing methods for cytokine staining, EdU incorporation, overlapped count matrix construction and TCR repertoire diversity metrics.

      Review #2 (Public review):

      Summary Comment:

      The study is nicely performed, but the definition of naive T cells using only CD44 and CD62L may be oversimplified. CD5hi naive T cells express higher CD44 than CD5lo cells.

      We thank the reviewer for the critical evaluation and thoughtful comment. As noted, we defined naïve CD8<sup>+</sup> T cells using a well-established gating strategy based on CD44<sup>lo</sup> and CD62L<sup>hi</sup> expression, consistent with previous studies (Immunity. 2010; 32(2):214–26; Nat Immunol. 2015; 16(1):107–17). We acknowledge that CD44 is expressed along a continuum, and indeed, within the naïve gate, CD5<sup>hi</sup> CD8<sup>+</sup> T cells exhibited slightly higher CD44 levels compared to their CD5<sup>lo</sup> counterparts. However, both subsets remained well below the CD44 expression observed in conventional effector/memory CD8<sup>+</sup> T cells, supporting their classification as naïve. To further validate this, we assessed additional markers associated with activation and memory differentiation, including CD69, PD-1, KLRG1 and CD25. These analyses confirmed that the sorted CD5<sup>hi</sup> and CD5<sup>lo</sup> populations retained a phenotypically naïve profile while exhibiting meaningful differences in baseline activation readiness (Figure 1F).

      Review #3 (Public review):

      CD5 can be regulated by peripheral signals. Therefore, it cannot be concluded that predisposition to effector/memory differentiation is solely programmed in the thymus.

      We thank the reviewer for this important point. We agree that CD5 expression can be dynamically regulated in the periphery by tonic TCR signals and antigen encounter, as also reflected in our own data that cells with high CD5 level display elevated activation potential upon encountering antigen (e.g., Figure 3L). To minimize the confounding effects of pre-existing peripheral activation, we performed an adoptive T cell transfer experiment (Figure 4). In this experiment, naïve CD5<sup>hi</sup>CD<sup>+</sup>and CD5<sup>lo</sup>CD8<sup>+</sup>T cells were sorted from the peripheral lymph nodes of young (6–8-week-old) prediabetic NOD mice and transferred into NOD Rag1<sup>–/–</sup> recipients. After 4 weeks, we compared the disease phenotypes and functional profiles of CD8<sup>+</sup> T cells from these two groups. This approach allowed us to evaluate the stability and differentiation capacity of CD5<sup>hi</sup> versus CD5<sup>lo</sup> cells in a lymphopenic environment, while excluding the possibility that the observed differences were due to already activated CD8<sup>+</sup>T cells at the time of isolation. We have revised the Discussion (lines 440–450) to acknowledge these experimental limitations and clarify that, while our findings demonstrate functional differences between CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>T cells, we cannot fully exclude contributions from peripheral influences.

      Experiments do not explain why PTPN22 overexpression protects in polyclonal T cells but not in NOD8.3 mice.

      We appreciate this critical comment. Our findings support that autoreactive T cells with high-affinity TCRs as in NOD8.3 mice receive strong signaling that even PTPN22 overexpression is insufficient to attenuate their activation and effector function. We acknowledge that further mechanistic studies are needed to fully elucidate the differential effects of PTPN22 in polyclonal versus TCR-transgenic settings.

      Evidence that PTPN22 does not regulate TCR signaling in NOD8.3 T cells is weak.

      We thank the reviewer for this critical comment. Our data show that NOD8.3 T cells with an intrinsic high CD5-associated self-reactivity are more resistant to transgenic Pep-mediated change in the phosphorylation status of TCR signaling molecules CD3ζ and Erk and CD5 expression (Figure 6, B-D). However, we agree that additional functional assays would strengthen this conclusion.

      TCR sequencing does not conclusively link CD5hi cells with autoreactivity; single-cell analysis is needed.

      We agree with this critical comment. Bulk TCR sequencing revealed repertoire features associated with autoreactivity, but cannot definitively link specific TCRs to function. We have acknowledged this in the discussion (lines 466–471) and highlighted plans to perform single-cell analysis.

      CD5hi cells in the PLNs may reflect antigen exposure rather than basal signaling.

      We thank the reviewer for this insightful comment. As also noted in Figure 3L, CD5 expression can be influenced by peripheral tonic TCR signals and recent antigen exposure. To minimize the contribution of peripheral activation, we particularly characterized naïve CD8<sup>+</sup>T cells isolated from the peripheral lymph nodes of young (6–8-week-old) prediabetic NOD mice before the onset of overt autoimmunity. Furthermore, we performed an adoptive transfer experiment (Figure 4) using sorted naïve CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>T cells from these mice and characterized their disease phenotype after 4 weeks in lymphopenic NOD Rag1<sup>–/–</sup> recipients and evaluated the effector function of CD8<sup>+</sup>T cells. This approach allowed us to compare the differentiation potential of these subsets in a controlled setting, independent of their activation status at the time of isolation. We have revised the Discussion (lines 440–450) to emphasize that, while our data support functional differences between CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>T cells, we cannot fully exclude the role of peripheral cues in shaping CD5 expression.

      Provide proper gating controls and representative flow plots.

      We thank the reviewer for this comment. We have revised figure legends to better describe gating strategies and included representative flow cytometry plots and appropriate gating controls in Figures or Supplementary Figures.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The authors):

      (1) The figure presentation is inconsistent and the labels/font are often too small to read easily.

      As Reviewer suggested, the figure presentation has been revised for consistency. Labels and fonts have been adjusted for improved readability. Specific figures that were difficult to read have been reformatted with larger fonts and clearer legends.

      (2) A careful review of the text to ensure clarity of the content is suggested (e.g., “gratitude” at line 91, “were generally lied” at line 123).

      Thanks for Reviewer’s comments. The text has been carefully reviewed for clarity and grammatical accuracy. Corrections have been made, including changing “gratitude” to “magnitude” (line 47) and “were generally lied” to “fell between” (line 79).

      Reviewer #2 (Recommendations For The Authors):

      (1) The definition of naïve T cells based solely on CD44low and CD62Lhigh staining may be oversimplistic. Indeed, even within this definition, naïve CD5high CD8 T cells express much higher levels of CD44 than CD5low CD8 T cells.

      Thanks for Reviewer’s comments. We used a literature-supported gating strategy (Immunity. 2010; 32(2):214–26; Nat Immunol. 2015; 16(1):107–17) to define naïve T cells based on CD44<sup>low</sup> and CD62L<sup>high</sup> expression. It is important to note that CD44 expression exists along a continuum. While we were initially surprised to observe that CD5<sup>lo</sup>CD8<sup>+</sup>T cells expressed relatively higher levels of CD44 than CD5<sup>lo</sup>CD8<sup>+</sup>T cells within the naïve gate, both populations still exhibited significantly lower CD44 expression compared to conventional effector/memory CD8<sup>+</sup>T cells. To further validate the distinction between CD5<sup>hi</sup> and CD5 subsets, we also examined additional markers such as CD69, PD1, KLRG1 and CD25, which supported their phenotypic differences within the naïve compartment (Figure 1F).

      (2) Figure 1G should show the proportion of IGRP-tetramer+ in the three groups of CD8 T cells. Additionally, it would be useful to assess reactivity against a pool of other islet autoantigens using a similar strategy.

      As suggested by the reviewer, the revised manuscript now includes additional data showing the proportion of IGRP-tetramer+ cells (Supplementary Figure 1D), as well as reactivity against another islet autoantigen, insulin-1/insulin-2 (Insulin B15–23) (Supplementary Figure 1E). The description of these results, including the proportions of IGRP-tetramer<sup>+</sup> and Insulin B15–23<sup>+</sup> CD8<sup>+</sup>Tcells, has been added to lines 126–129 of the revised manuscript.

      (3) The resolution of Figure 2 is suboptimal and at places poorly visible. Figure 2D is stated to show “two significant pathways stand out.” In fact, the data are barely significant, and the authors may want to correct their statement.

      The resolution of Figure 2 has been improved. As Reviewer suggested, the text has been revised to state “two potential pathways stand out” (line 187) instead of “two significant pathways stand out”.

      (4) Figure 3C-F and 3H, showing fold change over baseline values would be much easier for the reader to grasp the data.

      As Reviewer suggested, data in Figures 3C-F and 3H now are shown in fold change over baseline values for clarity. Baseline gMFI is the mean of each group (total CD<sup>+</sup> , CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>) at 0 μg/ml anti-CD3, with fold changes calculated for stimulation conditions (0.625-10 μg/ml anti-CD3). The figure legend has been updated accordingly.

      (5) Figure 4A, it would be much more valuable to show the diabetes frequency upon transfer of CD25- CD4 T cells alone and upon transfer of CD5high CD8 T cells alone. The word “spontaneous” in the Figure 4A legend seems inappropriate.

      Thanks for the Reviewer’s comment. We apologize for not including the data for the CD25 CD4<sup>+</sup> T cell transfer group in the original manuscript. While this group was part of our initial experimental design, we had considered it a control group and unintentionally omitted it from the figure. The revised manuscript now includes this group in Figure 4A. In addition, the term “spontaneous” has been replaced with “diabetes incidence” in the Figure 4A legend and manuscript (line 248). Regarding the suggestion to assess CD5<sup>hi</sup>CD8<sup>+</sup>T cells transfer alone, we appreciate the Reviewer’s point. However, previous studies have shown that CD8<sup>+</sup> T cells alone are not effective and sufficient to induce diabetes in adoptive transfer models, and that effective β-cell destruction typically requires both CD4<sup>+</sup> and CD8<sup>+</sup> T cell subsets. For instance, Christianson et al. (1993) demonstrated that enriched CD8<sup>+</sup> T cells from NOD mice fail to transfer diabetes on their own, while CD4<sup>+</sup> T cells—particularly from diabetic donors—can induce disease only under specific conditions and are significantly potentiated by co-transfer of CD8<sup>+</sup>cells. These findings have contributed to the widely available standard of co-transferring both subsets when studying diabetogenic potential in NOD models (Diabetes. 1993;42(1):44–55).

      (6) Line 257-258, please remove “indicating superior in vivo proliferation by the CD5hi subset.” Indeed, several other possibilities may explain the phenotype, including survival, migration, etc.

      As Reviewer suggested, the phrase “indicating superior in vivo proliferation by the CD5<sup>hi</sup> subset” has been replaced with “implying increased expansion and activation/effector potential” (line 261).

      (7) Figure 5A, it is unclear to this referee what is the significance of CD5 and pCD3zeta expression on DN thymocytes. Do these cells express rearranged alpha/beta TCR? Is it signaling through pre-TCRalpha/TCRbeta pairs?

      Thanks a lot for this important question. In the revised manuscript, we have expanded the discussion (line 455–471) to address the developmental significance of CD5 and pCD3ζ expression on DN thymocytes. CD5 expression at this stage reflects pre-TCR signaling strength during early selection, which occurs following successful TCRβ rearrangement. The associated phosphorylation of CD3ζ indicates activation of downstream signaling through the pre-TCRα/TCRβ complex. As discussed in the revised text, these early signals play a critical role in determining lineage progression and self-reactivity tuning. We now acknowledge that signaling at the DN stage occurs through the pre-TCRα/TCRβ heterodimer, not a fully rearranged αβ TCR, and that CD5 expression serves as a marker of the strength of these initial pre-selection signals (Sci Signal. 2022;15(736):eabj9842.). These developmental checkpoints are essential for calibrating TCR sensitivity and ensuring proper thymocyte maturation. This has been clarified in the revised discussion (line 455–471).

      (8) Figure 5F, could the DP TCRbeta- CD69- thymocytes from 8.3-TCR NOD mice already express low levels of the self-reactive TCR at this stage to explain their high expression of CD5? Addressing the question experimentally would be useful.

      Thanks a lot for this useful comment. According to a review by Huseby et al. (2022), expression of a functional TCRβ chain begins at the DN3 stage, initiating progression through the β-selection checkpoint. This is followed by TRAV locus recombination, resulting in the generation of αβ TCR-expressing double-positive 1 (DP-1) thymocytes. At the DP-1 stage, the quality of TCR signaling driven by self-pMHC interactions governs both positive and negative selection, as well as the development of nonconventional T cell lineages. We hypothesize that in transgenic NOD8.3 mice, which express pre-rearranged Tcra and Tcrb transgenes derived from the islet-reactive CD8<sup>+</sup>T cell clone NY8.3, thymocytes undergo allelic exclusion and lack the clonal diversity seen in non-transgenic mice. As a result, NOD8.3 thymocytes may receive strong TCR signals from early developmental stages (DN3 and DP-1) even without undergoing normal selection checkpoints. While the elevated TCR signal observed in NOD8.3 is indeed artificial, this model provides a unique system to test our hypothesis—namely, whether a strongly self-reactive TCR can generate high basal signaling during thymic development that overrides the negative regulatory effects of phosphatases like Pep. This possibility has been acknowledged in the revised Discussion section, along with a plan to validate the hypothesis experimentally (line 455–471).

      (9) Figure 7, single-cell TCR-seq would be much more appropriate to tackle the question of self-reactivity of CD5hi vs. CD5low CD8 T cells.

      Thanks a lot for this useful comment. The limitations of bulk TCR-seq are acknowledged, and single-cell TCR-seq is proposed as a future direction (line 455–471).

      Note, for Reviewer #2 (Recommendations For The Authors) (7) (8) (9), the discussion paragraphs are included to address the reviewers’ questions (line 455–471).

      Reviewer #3 (Recommendations For The Authors):

      (1) Positive controls (activated T cells from PLN or spleen), gating controls (whole naïve T cells), and representative flow-cytometry plots are needed for T-bet, EOMES, GzmB, and cytokine staining in Figure 1.

      As Reviewer suggested, we added representative gating controls for T-bet, EOMES, GzmB and cytokine staining in Supplementary Figure 1 of revised manuscript.

      (2) For Figure 1F, MFI for activation markers for the CD44hiCD62Llo cells should be provided for the comparison of PLN data.

      As Reviewer suggested, MFI data for these markers have been included in Figure 1F of revised manuscript.

      (3) In many places and figure legends, it is not mentioned from which organ cells were collected, i.e., spleen or PLN.

      As Reviewer suggested, the origin of cells for each experiment has been explicitly indicated in the figure legends or figure content to ensure clarity.

      (4) In the pancreatic lymph node, autoreactive T cells might be upregulating CD5 because they are encountering antigens. This should be addressed in the discussion.

      As Reviewer suggested, this issue has been included in the discussion of revised manuscript (line 440-450).

      (5) It is not clear if T cells from the spleen and PLN were stimulated to detect the production of pro-inflammatory cytokines.

      Thanks for the critical comment. The stimulation protocol and cytokine staining method have been added to the Supplementary material’s Supplementary methods section Cytokine staining in revised manuscript.

      (6) Figure 4C-D: It is not clear if analysis was done on naïve T cells or if they were stimulated.

      Thanks for the comment. Additionally, the stimulation and cytokine staining methods used in Figure 4C-D have been described in detail in the Supplementary Materials section Cytokine staining of revised manuscript.

      (7) IGRP gating in Figure 4F should be revisited with negative controls.

      Thanks for the critical comment. Negative controls have been added and used to adjust IGRP gating, and this is now mentioned in the figure legend of revised manuscript.

      (8) Interpretation that only CD5hi cells form a central memory T cell population (Figure 4F) could be misleading.

      Thanks for this valuable comment. We agree with that in conventional CD8<sup>+</sup> T cell immune responses, both CD5<sup>hi</sup> and CD5<sup>lo</sup> subsets have the potential to differentiate into central memory T cells. In our experimental approach, we adoptively transferred sorted CD5<sup>hi</sup>CD8<sup>+</sup> or CD5<sup>lo</sup>CD8<sup>+</sup>cells into Rag1<sup>-/-</sup> recipients and specifically analyzed PLNs four weeks after transfer. Using CD44 and CD62L expression as conventional markers for central memory T cells, we barely observed a CD44<sup>hi</sup>CD62L<sup>hi</sup> population in CD5<sup>lo</sup>CD8<sup>+</sup>transferred group. Based on these results, we stated: “This analysis underscores that the central memory T cell population and the frequency of islet autoantigen-specific CD8<sup>+</sup>T cells are higher in the CD5<sup>hi</sup> transferred subset within the PLNs, implying more robust immune responses initiated by the CD5<sup>hi</sup>cells” (line 272–274). Importantly, we did not intend to imply that only CD5<sup>hi</sup> cells can form central memory T cells, but rather that they were more enriched for this phenotype under the specific conditions and time point analyzed. 

      (9) IL-2 gating representative plot should be provided for Figure 5A.

      As Reviewer suggested, a representative IL-2 gating plot has been included in the revised Supplementary Figure 3B.

    1. eLife Assessment

      This important study demonstrates that in Drosophila melanogaster, tachykinin (Tk) expression is regulated by the microbiota. The authors present convincing evidence that axenic flies raised with no microbiota are longer-lived than conventionally reared animals, and that Tk expression and Tk receptors in the nervous system are required for this effect. They further test individual bacterial strains for their role in these effects and connect the effect to loss of lipid stores and suggest that FOXO may be involved in the phenotype, results that are of interest to the fields of environmental perception, host microbiome interactions, and geroscience.

      [Editors' note: this paper was reviewed by Review Commons.]

    2. Reviewer #1 (Public review):

      Summary:

      In this study the authors use a Drosophila model to demonstrate that Tachykinin (Tk) expression is regulated by the microbiota. In Drosophila conventionally reared (CR) flies are typically shorter lived than those raised without a microbiota (axenic). Here, knockdown of Tk expression is found to prevent lifespan shortening by the microbiota and the reduction of lipid stores typically seen in CR flies when compared to axenic counterparts. It does so without reducing food intake or fecundity which are often seen as necessary trade-offs for lifespan extension. Further, the strength of the interaction between Tk and the microbiota is found to be bacteria specific and is stronger in Acetobacter pomorum (Ap) mono-associated flies compared to Levilactobacillus brevis (Lb) mono-association. The impact on lipid storage was also only apparent in Ap-flies.

      Building on these findings the authors show that gut specific knockdown is largely sufficient to explain these phenotypes. Knockdown of the Tk receptor, TkR99D, in neurons recapitulates the lifespan phenotype of intestinal Tk knockdown supporting a model whereby Tk from the gut signals to TkR99D expressing neurons to regulate lifespan. In addition, the authors show that FOXO may have a role in lifespan regulation by the Tk-microbiota interaction. However, they rule out a role for insulin producing cells or Akh-producing cells suggesting the microbiota-Tk interaction regulates lifespan through other, yet unidentified, mechanisms.

      Major comments:

      Overall, I find the key conclusions of the paper convincing. The authors present an extensive amount of experimental work, and their conclusions are well founded in the data. In particular, the impact of TkRNAi on lifespan and lipid levels, the central finding in this study, has been demonstrated multiple times in different experiments and using different genetic tools. As a result, I don't feel that additional experimental work is necessary to support the current conclusions.

      However, I find it hard to assess the robustness of the lifespan data from the other manipulations used (TkR99DRNAi, TkRNAi in dFoxo mutants etc.) because information on the population size and whether these experiments have been replicated is lacking. Can the authors state in the figure legends the numbers of flies used for each lifespan and whether replicates have been done? For all other data it is clear how many replicates have been done, and the methods give enough detail for all experiments to be reproduced.

      Significance:

      Overall, I find the key conclusions of the paper convincing. The authors present an extensive amount of experimental work, and their conclusions are well founded in the data. We have known that the microbiota influence lifespan for some time but the mechanisms by which they do so have remained elusive. This study identifies one such mechanism and as a result opens several avenues for further research. The Tk-microbiota interaction is shown to be important for both lifespan and lipid homeostasis, although it's clear these are independent phenotypes. The fact that the outcome of the Tk-microbiota interaction depends on the bacterial species is of particular interest because it supports the idea that manipulation of the microbiota, or specific aspects of the host-microbiota interaction, may have therapeutic potential.

      These findings will be of interest to a broad readership spanning host-microbiota interactions and their influence on host health. They move forward the study of microbial regulation of host longevity and have relevance to our understanding of microbial regulation of host lipid homeostasis. They will also be of significant interest to those studying the mechanisms of action and physiological roles of Tachykinins.

      Field of expertise: Drosophila, gut, ageing, microbiota, innate immunity

    3. Reviewer #2 (Public review):

      Summary:

      The main finding of this work is that microbiota impacts lifespan though regulating the expression of a gut hormone (Tk) which in turn acts on its receptor expressed on neurons. This conclusion is robust and based on a number of experimental observations, carefully using techniques in fly genetics and physiology: 1) microbiota regulates Tk expression, 2) lifespan reduction by microbiota is absent when Tk is knocked down in gut (specifically in the EEs), 3) Tk knockdown extends lifespan and this is recapitulated by knockdown of a Tk receptor in neurons. These key conclusions are very convincing. Additional data are presented detailing the relationship between Tk and insulin/IGF signalling and Akh in this context. These are two other important endocrine signalling pathways in flies. The presentation and analysis of the data are excellent.

      There are only a few experiments or edits that I would suggest as important to confirm or refine the conclusions of this manuscript. These are:

      (1) When comparing the effects of microbiota (or single bacterial species) in different genetic backgrounds or experimental conditions, I think it would be good to show that the bacterial levels are not impacted by the other intervention(s). For example, the lifespan results observed in Figure 2A are consistent with Tk acting downstream of the microbes but also with Tk RNAi having an impact on the microbiota itself. I think this simple, additional control could be done for a few key experiments. Similarly, the authors could compare the two bacterial species to see if the differences in their effects come from different ability to colonise the flies.

      (2) The effect of Tk RNAi on TAG is opposite in CR and Ax or CR and Ap flies, and the knockdown shows an effect in either case (Figure 2E, Figure 3D). Why is this? Better clarification is required.

      (3) With respect to insulin signalling, all the experiments bar one indicate that insulin is mediating the effects of Tk. The one experiment that does not is using dilpGS to knock down TkR99D. Is it possible that this driver is simply not resulting in an efficient KD of the receptor? I would be inclined to check this, but as a minimum I would be a bit more cautious with the interpretation of these data.

      (4) Is it possible to perform at least one lifespan repeat with the other Tk RNAi line mentioned? This would further clarify that there are no off-target effects that can account for the phenotypes.

      There are a few other experiments that I could suggest as I think they could enrich the current manuscript, but I do not believe they are essential for publication:

      (5) The manuscript could be extended with a little more biochemical/cell biology analysis. For example, is it possible to look at Tk protein levels, Tk levels in circulation, or even TkR receptor activation or activation of its downstream signalling pathways? Comparing Ax and CR or Ap and CR one would expect to find differences consistent with the model proposed. This would add depth to the genetic analysis already conducted. Similarly, for insulin signalling - would it be possible to use some readout of the pathway activity and compare between Ax and CR or Ap and CR?

      (6) The authors use a pan-acetyl-K antibody but are specifically interested in acetylated histones. Would it be possible to use antibodies for acetylated histones? This would have the added benefit that one can confirm the changes are not in the levels of histones themselves.

      (7) I think the presentation of the results could be tightened a bit, with fewer sections and one figure per section.

      Significance:

      The main contribution of this manuscript is the identification of a mechanism that links the microbiota to lifespan. This is very exciting and topical for several reasons:

      (1) The microbiota is very important for overall health but it is still unclear how. Studying the interaction between microbiota and health is an emerging, growing field, and one that has attracted a lot of interest, but one that is often lacking in mechanistic insight. Identifying mechanisms provides opportunities for therapies. The main impact of this study comes from using the fruit fly to identify a mechanism.

      (2) It is very interesting that the authors focus on an endocrine mechanism, especially with the clear clinical relevance of gut hormones to human health recently demonstrated with new, effective therapies (e.g. Wegovy).

      (3) Tk is emerging as an important fly hormone and this study adds a new and interesting dimension by placing TK between microbiota and lifespan.

      I think the manuscript will be of great interest to researchers in ageing, human and animal physiology and in gut endocrinology and gut function.

    4. Reviewer #3 (Public review):

      Summary:

      Marcu et al. demonstrate a gut-neuron axis that is required for the lifespan-shortening effects mediated by gut bacteria. They show that the presence of commensal bacteria-particularly Acetobacter pomorum-promotes Tk expression in the gut, which then binds to neuronal tachykinin receptors to shorten lifespan. Tk has also recently been reported to extend lifespan via adipokinetic hormone (Akh) signaling (Ahrentløv et al., Nat Metab 7, 2025), but the mechanism here appears distinct. The lifespan shortening by Ap via Tk seems to be partially dependent on foxo and independent of both insulin signaling and Akh-mediated lipid mobilization.

      Although the detailed mechanistic link to lifespan is not fully resolved, the experiment and its results clearly show the involvement of the molecules tested. This work adds a valuable dimension to our growing understanding of how gut bacteria influence host longevity. However, there are some points that should be addressed.

      (1) Tk+ EEC activity should be assessed directly, rather than relying solely on transcript levels. Approaches such as CaLexA or GCaMP could be used.

      (2) In Line243, the manuscript states that the reporter activity was not increased in the posterior midgut. However, based on the presented results in Fig4E, there is seemingly not apparent regional specificity. A more detailed explanation is necessary.

      (3) If feasible, assessing foxo activation would add mechanistic depth. This could be done by monitoring foxo nuclear localization or measuring the expression levels of downstream target genes.

      (4) Fig1C uses Adh for normalization. Given the high variability of the result, the authors should (1) check whether Adh expression levels changed via bacterial association and/or (2) compare the results using different genes as internal standard.

      (5) While the difficulty of maintaining lifelong axenic conditions is understandable, it may still be feasible to assess the induction of Tk (i.e.. Tk transcription or EE activity upregulation) by the microbiome on males.

      (6) We also had some concerns regarding the wording of the title.<br /> Fig6B and C suggests that TkR86C, in addition to TkR99D, may be involved in the A. pomorum-lifespan interaction. Consider revising the title to refer more generally to the "tachykinin receptor" rather than only TkR99D.<br /> The difference between "aging" and "lifespan" should also be addressed. While the study shows a role for Tk in lifespan, assessment of aging phenotypes (e.g. Climbing assay, ISC proliferation) beyond the smurf assay is required to make conclusions about aging.

      (7) The statement in Line 82 that EEs express 14 peptide hormones should be supported with an appropriate reference, if available.

      Significance:

      General assessment: The main strength of this study is the careful and extensive lifespan analyses, which convincingly demonstrate the role of gut microbiota in regulating longevity. The authors clarify an important aspect of how microbial factors contribute to lifespan control. The main limitation is that the study primarily confirms the involvement of previously reported signaling pathways, without identifying novel molecular players or previously unrecognized mechanisms of lifespan regulation.

      Advance: The lifespan-shortening effect of Acetobacter pomorum (Ap) has been reported previously, as has the lifespan-shortening effect of Tachykinin (Tk). However, this study is the first to link these two factors mechanistically, which represents a significant and original contribution to the field. The advance is primarily mechanistic, providing new insight into how microbial cues converge on host signaling pathways to influence ageing.

      Audience: This work will be of particular interest to a specialized audience of basic researchers in ageing biology. It will also attract interest from microbiome researchers who are investigating host-microbe interactions and their physiological consequences. The findings will be useful as a conceptual framework for future mechanistic studies in this area.

      Field of expertise: Drosophila ageing, lifespan, microbiome, metabolism

    5. Author response:

      (1) General Statements

      The goal of our study was to mechanistically connect microbiota to host longevity. We have done so using a combination of genetic and physiological experiments, which outline a role for a neuroendocrine relay mediated by the intestinal neuropeptide Tachykinin, and its receptor TkR99D in neurons. We also show a requirement for these genes in metabolic and healthspan effects of microbiota.

      The referees' comments suggest they find the data novel and technically sound. We have added data in response to numerous points, which we feel enhance the manuscript further, and we have clarified text as requested. Reviewer #3 identified an error in Figure 4, which we have rectified. We felt that some specific experiments suggested in review would not add significant further depth, as we articulate below.

      Altogether our reviewers appear to agree that our manuscript makes a significant contribution to both the microbiome and ageing fields, using a large number of experiments to mechanistically outline the role(s) of various pathways and tissues. We thank the reviewers for their positive contributions to the publication process.

      (2) Description of the planned revisions

      Reviewer #2:

      Not…essential for publication…is it possible to look at Tk protein levels?

      We have acquired a small amount of anti-TK antibody and we will attempt to immunostain guts associated with A. pomorum and L. brevis. We are also attempting the equivalent experiment in mouse colon reared with/without a defined microbiota. These experiments are ongoing, but we note that the referee feels that the manuscript is a publishable unit whether these stainings succeed or not.

      (3) Description of the revisions that have already been incorporated in the transferred manuscript

      Reviewer #1:

      Can the authors state in the figure legends the numbers of flies used for each lifespan and whether replicates have been done?

      We have incorporated the requested information into legends for lifespan experiments.

      Do the interventions shorten lifespan relative to the axenic cohort? Or do they prevent lifespan extension by axenic conditions? Both statements are valid, and the authors need to be consistent in which one they use to avoid confusing the reader.

      We read these statements differently. The only experiment in which a genetic intervention prevented lifespan extension by axenic conditions is neuronal TkR86C knockdown (Figure 6B-C). Otherwise, microbiota shortened lifespan relative to axenic conditions, and genetic knockdowns extend blocked this effect (e.g. see lines 131-133). We have ensured that the framing is consistent throughout, with text edited at lines 198-199, 298-299, 311-312, 345-347, 407-408, 424-425, 450, 497-503.

      TkRNAi consistently reduces lipid levels in axenic flies (Figs 2E, 3D), essentially phenocopying the loss of lipid stores seen in control conventionally reared (CR) flies relative to control axenic. This suggests that the previously reported role of Tk in lipid storage - demonstrated through increased lipid levels in TkRNAi flies (Song et al (2014) Cell Rep 9(1): 40) - is dependent on the microbiota. In the absence of the microbiota TkRNAi reduces lipid levels. The lack of acknowledgement of this in the text is confusing

      We have added text at lines 219-222 to address this point. We agree that this effect is hard to interpret biologically, since expressing RNAi in axenics has no additional effect on Tk expression (Figure S7). Consequently we can only interpret this unexpected effect as a possible off-target effect of RU feeding on TAG, specific to axenic flies. However, this possibility does not void our conclusion, because an off-target dimunition of TAG cannot explain why CR flies accumulate TAG following Tk<sup>RNAi</sup> induction. We hope that our added text clarifies.

      I have struggled to follow the authors logic in ablating the IPCs and feel a clear statement on what they expected the outcome to be would help the reader.

      We have added the requested statement at lines 423-424, explaining that we expected the IPC ablation to render flies constitutively long-lived and non-responsive to A pomorum.

      Can the authors clarify their logic in concluding a role for insulin signalling, and qualify this conclusion with appropriate consideration of alternative hypotheses?

      We have added our logic at lines 449-454. In brief, we conclude involvement for insulin signalling because FoxO mutant lifespan does not respond to Tk<sup>RNAi</sup>, and diminishes the lifespan-shortening effect of A. pomorum. However, we cannot state that the effects are direct because we do not have data that mechanistically connects Tk/TkR99D signalling directly in insulin-producing cells. The current evidence is most consistent with insulin signalling priming responses to microbiota/Tk/TkR99D, as per the newly-added text.

      Typographical errors

      We have remedied the highlighted errors, at lines 128-140.

      Reviewer #2:

      it would be good to show that the bacterial levels are not impacted [by TkRNAi]

      We have quantified CFUs in CR flies upon ubiquitous TkRNAi (Figure S5), finding that the RNAi does not affect bacterial load. New text at lines 138-139 articulates this point.

      The effect of Tk RNAi on TAG is opposite in CR and Ax or CR and Ap flies, and the knockdown shows an effect in either case (Figure 2E, Figure 3D). Why is this?

      As per response to Reviewer #1, we have added text at lines 219-222 to address this point.

      Is it possible to perform at least one lifespan repeat with the other Tk RNAi line mentioned?

      We have added another experiment showing longevity upon knockdown in conventional flies, using an independent TkRNAi line (Figure S3).

      Reviewer #3:

      In Line243, the manuscript states that the reporter activity was not increased in the posterior midgut. However, based on the presented results in Fig4E, there is seemingly not apparent regional specificity. A more detailed explanation is necessary.

      We thank the reviewer sincerely for their keen eye, which has highlighted an error in the previous version of the figure. In revisiting this figure we have noticed, to our dismay, that the figures for GFP quantification were actually re-plots of the figures for (ac)K quantification. This error led to the discrepancy between statistics and graphics, which thankfully the reviewer noticed. We have revised the figure to remedy our error, and the statistics now match the boxplots and results text.

      Fig1C uses Adh for normalization. Given the high variability of the result, the authors should (1) check whether Adh expression levels changed via bacterial association

      We selected Adh on the basis of our RNAseq analysis, which showed it was not different between AX and CV guts, whereas many commonly-used “housekeeping” genes were. We have now added a plot to demonstrate (Figure S2).

      The statement in Line 82 that EEs express 14 peptide hormones should be supported with an appropriate reference

      We have added the requested reference (Hung et al, 2020) at line 86.

      (4) Description of analyses that authors prefer not to carry out

      Reviewer #1:

      I'd encourage the authors to provide lifespan plots that enable comparison between all conditions

      We have avoided this approach because the number of survival curves that would need to be presented on the same axis (e.g. 16 for Figure 5) is not legible. However we have ensured that axes on faceted plots are equivalent and with grid lines for comparison. Moreover, our approach using statistical coefficients (EMMs) enables direct quantitative comparison of the differences among conditions.

      Reviewer #2:

      Is it possible that this driver is simply not resulting in an efficient KD of the receptor? I would be inclined to check this

      This comment relates to Figure 7G. We do see an effect of the knockdown in this experiment, so we believe that the knockdown is effective. However the direction of response is not consistent with our hypothesis so the experiment is not informative about the role of these cells. We therefore feel there is little to be gained by testing efficacy of knockdown, which would also be technically challenging because the cells are a small population in a larger tissue which expresses the same transcripts elsewhere (i.e. necessitating FISH).

      Would it be possible to use antibodies for acetylated histones?

      The comment relates to Figure 4C-E. The proposed studies would be a significant amount of work because, to our knowledge, the specific histone marks which drive activation in TK+ cells remain unknown. On the other hand, we do not see how this information would enrich the present story, rather such experiments would appear to be the beginning of something new. We therefore agree with Reviewer #1 (in cross-commenting) that this additional work is not justified.

      Reviewer #3:

      Tk+ EEC activity should be assessed directly, rather than relying solely on transcript levels. Approaches such as CaLexA or GCaMP could be used.

      We agree with reviewers 1-2 (in cross-commenting) that this proposal is non-trivial and not justified by the additional insight that would be gained. As described above, we are attempting to immunostain Tk, which if successful will provide a third line of evidence for regulation of Tk+ cells. However we note that we already have the strongest possible evidence for a role of these cells via genetic analysis (Figure 5).

      While the difficulty of maintaining lifelong axenic conditions is understandable, it may still be feasible to assess the induction of Tk (ie. Tk transcription or EE activity upregulation) by the microbiome on males.

      As the reviewer recognises, maintaining axenic experiments for months on end is not trivial. Given the tendency for males either to simply mirror female responses to lifespan-extending interventions, or to not respond at all, we made the decision in our work to only study females. We have instead emphasised in the manuscript that results are from female flies.

      TkR86C, in addition to TkR99D, may be involved in the A. pomorum-lifespan interaction. Consider revising the title to refer more generally to the "tachykinin receptor" rather than only TkR99D.

      We disagree with this interpretation: the results do not show that TkR86C-RNAi recapitulates the effect of enteric Tk-RNAi. A potentially interesting interaction is apparent, but the data do not support a causal role for TkR86C. A causal role is supported only for TkR99D, knockdown of which recapitulates the longevity of axenic flies and Tk<sup>RNAi</sup> flies_._ Therefore we feel that our current title is therefore justified by the data, and a more generic version would misrepresent our findings.

      The difference between "aging" and "lifespan" should also be addressed.

      The smurf phenotype is a well-established metric of healthspan. Moreover, lifespan is the leading aggregate measure of ageing. We therefore feel that the use of “ageing” in the title is appropriate.

      If feasible, assessing foxo activation would add mechanistic depth. This could be done by monitoring foxo nuclear localization or measuring the expression levels of downstream target genes.

      Foxo nuclear localisation has already been shown in axenic flies (Shin et al, 2011). We have added text and citation at lines 401-402.

    1. eLife Assessment

      In this important manuscript, the authors establish a vertebrate model for studying the development of circuits that control heart rate. This contribution uses a combination of experimental techniques to provide compelling information for scientists looking to understand how heart rate regulation emerges during development.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript by Hernandez-Nunez et al. provides a comprehensive characterization of how heart-brain circuits develop in a vertebrate brain, namely the zebrafish. The characterization is performed using a combination of modern and sophisticated imaging and neural manipulation techniques and achieves unprecedented clarity and detail in how the heart-brain communication develops early in life. The paper describes a three-stage program, where first an efferent-circuit from the motor vagus to the heart develops, followed by sympathetic innervation, and lastly sensory neurons innervate the heart.

      Strengths:

      The paper is very clearly and nicely written. The findings are novel and of high quality and relevance. The presentations are very clear and nicely interpreted. The analyses are well presented and applied.

      Weaknesses:

      From the heart rate traces, heart rate variability seems to be prominent and changes across days post-fertilization (dpf). That would be a useful dependent variable, considering that the variation captured by the models does not fully explain heart rate, both for sympathetic and parasympathetic efferents. Given the strong autorhythmicity of nodal tissue in neurogenic hearts, modulatory inputs could potentially predict heart rate variability with higher precision.

    3. Reviewer #2 (Public review):

      Hernandez-Nunez et al. investigate the development and function of neural circuits involved in the regulation of heart rate in larval zebrafish. Using conserved genetic markers, they identify neural pathways involved in the bidirectional control of heart rate and in providing sensory feedback, potentially enabling more precise tuning. The main observation is that the different elements of this circuit are laid down in a developmentally staggered manner.

      At 4 days old, the heart rate is invariant to a range of sensory stimuli, and the vagal motor or sympathetic pathways could not be seen to innervate the heart. Progressively through development, the heart is first innervated by the vagal motor pathway, whose axons are cholinergic, before the formation of phox2bb+ intracardiac neurons (ICNs). At this stage, before the first ICNs are observed, activation of the vagal motor pathway by optogenetic activation of a localized population of cholinergic hindbrain neurons leads to bradycardia. After the vagal motor innervation begins, the sympathetic pathway innervates the heart, which could be visualized in the form of TH+ fibers from the anterior paravertebral ganglia (APG). The activity of the TH+ APG neurons was diverse and showed proportional, integral, and derivative-like relationships to the heart rate, suggesting a role in more precise tuning of the rate than what could be achieved through the vagal pathway alone. The sensory vagus innervation of the heart was identified to be the last stage to develop; however, neurons in the nodose ganglion exhibited diverse responses tuned to the heart rate well before the innervation reached the heart. The authors attribute this to the fact that other indirect sensory cues from the gills or vasculature could be used to sense heart rate prior to innervation.

      This study identifies key components of the control loop required for the regulation of heart rate in zebrafish. The control mechanism appears to be independent of the cues that trigger heart rate changes, indicating that the circuit is indeed part of an interoceptive pathway for heart rate control. Evidence for the staggered development of the vagal-motor, sympathetic, and sensory pathways is conclusive, and as the authors discuss, this phenomenon progressively allows for finer-grained control of the heart rate. This could be achieved through proportional-integral-derivative-like control properties emerging in a diverse set of neurons in the APG and sensory feedback of the state of the heart. In line with these findings, the baseline variability of heart rate prior to innervation at 4 days old appears to be comparatively lower than the later stages (Figure 1C, D, Supplementary Figure 1C-F) and increases over development.

      Based on this observation and the time courses of the kernels identified by the GLMs, I would expect heart rate fluctuations of a finer time scale, ultimately limited by the time course of GCaMP6s, to be captured by the models in Figures 3, 5, and 7, in addition to the stimulus-locked changes that are highlighted. While the models yield valuable insight in the form of the activation kernels and their potential roles, in one instance, this captures the potential contribution of either the motor vagus or the APG to the change in heart rate. This makes it challenging to identify where it falls short and the potential functions of pathways that are yet to be discovered.

      Lastly, the proposed anatomical connectivity of the heart-brain circuit is based on tracts observed in this study as well as those inferred from function and from previous studies.

      (1) It is not clear from the images presented here whether the VSNs send feedback projections to the brainstem VPN.

      (2) Do the brainstem neurons identified by their functional roles send efferent projections via the motor vagus nerve? This is unclear from the results presented and needs to be clarified in the text.

      (3) Add appropriate clarifying annotations to Figure 9 and a section of text discussing the potential unknowns in the proposed circuit diagram.

    4. Author response:

      We thank the reviewers for their thoughtful, constructive, and generous evaluations of our manuscript. We are encouraged by their overall assessment of the clarity, novelty, and significance of the work, and we appreciate the opportunity to further strengthen the manuscript.

      Both reviewers highlight the central contribution of this study: a developmental, circuitlevel dissection of how heart–brain signaling emerges in a vertebrate. We are pleased that the evidence supporting the staggered assembly of vagal motor, sympathetic, and sensory pathways was found to be compelling, and that the computational and experimental framework was viewed as appropriate and informative.

      Below, we briefly outline how we plan to address the main points raised in the reviews.

      Heart rate variability and temporal structure

      Both reviewers note that heart rate variability (HRV) changes across development and suggest that HRV may provide additional insight into the function of autonomic circuits. We agree that HRV is an important physiological readout and that its developmental changes are consistent with the progressive emergence of autonomic control.

      In the revised manuscript, we plan to (i) discuss heart rate variability more explicitly in the context of circuit maturation and (ii) clarify the temporal scales captured by our experiments and modeling framework. In particular, we will emphasize that our analyses focus on relationships between neural activity and heart-rate trajectories at timescales accessible given imaging rate and indicator kinetics, rather than beat-to-beat variability. We will also consider adding a supplementary analysis of the variability that can be reliably measured within these constraints, and, where appropriate, how neural activity predicts that measurable variation.

      Scope and interpretation of the computational models

      Reviewer #2 raises thoughtful points regarding what the generalized linear models can and cannot disambiguate, particularly when multiple efferent pathways may contribute to heart-rate dynamics. We will revise the text to more clearly distinguish between functional encoding relationships inferred from the models and anatomical connectivity that is directly demonstrated.

      Our intent is to frame the kernels identified in the motor and sympathetic pathways as computational motifs that capture distinct dynamical contributions, rather than as exclusive or complete explanations of heart-rate control. We will clarify these limitations explicitly in the Results and Discussion.

      Circuit diagram and anatomical interpretation

      We appreciate the reviewer’s careful reading of the proposed circuit schematic. In the revised manuscript, we will revise the figure and accompanying text to clearly annotate which connections are directly observed, which are functionally inferred, and which remain hypothetical. We will also expand the Discussion to explicitly address open questions, including unresolved feedback pathways and the potential for additional nodes in the circuit.

      We believe these revisions will improve clarity without altering the core conclusions of the study. We thank the reviewers again for their insightful feedback and look forward to submitting a revised version of the manuscript that addresses these points in detail.

    1. eLife Assessment

      This paper presents an important advance in genetically encoded voltage imaging of the developing zebrafish spinal cord in vivo, capturing voltage dynamics in neuronal populations, single cells, and subcellular compartments inaccessible to patch clamp, and diverse spike waveforms and subthreshold voltage dynamics inaccessible to calcium imaging. The work identifies a developmental progression from irregular voltage fluctuations to coordinated contralateral and ipsilateral activity, providing insight into how electrical dynamics and cellular morphology evolve during circuit formation. The strength of evidence is solid, with imaging data supporting the main conclusions, although the manuscript would be strengthened by more complete methodological documentation and clearer context relative to earlier calcium imaging studies. Overall, this study provides a resource that is of importance for researchers investigating neural development and circuit assembly, illustrating the value of voltage imaging as a general tool for probing bioelectric mechanisms in morphogenesis and circuit development.

    2. Reviewer #1 (Public review):

      Summary:

      This paper demonstrates the first application of voltage imaging using a genetically encoded voltage indicator, ArcLight, for recording the spontaneous activity of the developing spinal cord in zebrafish. This technology enabled better temporal resolution compared to what has been demonstrated with calcium imaging in past studies (Muto et al., 2011; Warp et al., 2012; Wan et al., 2019 ), which led to the discovery of the maturation process of "firing" shapes in spinal neurons. This maturation process occurs simultaneously with axonal elongation and network integration. Thus, voltage imaging revealed new biological details of the development of the spinal circuits.

      Strengths:

      The use of voltage imaging instead of calcium imaging revealed biological details of the functional maturation of spinal cord neurons in developing zebrafish.

      Weaknesses:

      This manuscript lacks many basic components and explanations necessary for understanding the methodologies used in this study.

    3. Reviewer #2 (Public review):

      The authors present highly impressive in vivo voltage‐imaging data, demonstrating neuronal activity at subcellular, cellular, and population levels in a developing organism. The approach provides excellent spatial and temporal resolution, with sufficient signal-to-noise to detect hyperpolarizations and subthreshold events. The visualization of contralateral synchrony and its developmental loss over time is particularly compelling. The observation that ipsilateral synchrony persists despite contralateral desynchronization is a striking demonstration of the power of GEVIs in vivo. While I outline several points that should be addressed, I consider this among the strongest demonstrations of in vivo GEVI imaging to date.

      Major points:

      (1) Clarification of GEVI performance characteristics

      There is a widespread misconception in the GEVI field that response speed is the dominant or primary determinant of sensor performance. Although fast kinetics are certainly desirable, they are not the only (or even necessarily the limiting) factor for effective imaging. Kinetic speed specifies the time to reach ~63% of the maximal ΔF/F for a given voltage step (typically 100 mV, approximating the amplitude of a neuronal action potential), but in practical imaging, a slower sensor with a large ΔF/F can outperform a faster sensor with a small ΔF/F. In this context, the authors' use of ArcLight is actually instructive. ArcLight is one of the slower GEVIs in common use, yet Figures S1a-b clearly show that it still reports voltage transients in vivo very well. I therefore strongly recommend moving these panels into the main text to emphasize that robust in vivo imaging can be achieved even with a relatively slow GEVI, provided the signal amplitude and SNR are adequate. This will help counteract the common misunderstanding in the field.

      (2) ArcLight's voltage-response range

      ArcLight is shifted toward more negative potentials (V₁/₂ ≈ −30 mV). This improves subthreshold detection but makes distinguishing action potentials from subthreshold transients more challenging. The comparison with GCaMP is helpful because the Ca²⁺ signal largely reflects action potentials. Panels S1c-f show similar onset kinetics but a longer decay for GCaMP. Surprisingly, the ΔF/F amplitudes are comparable; typically, GCaMP changes are larger. To support lines 193-194, the authors should include a table summarizing the onset/offset kinetics and ΔF/F ranges for neurons expressing ArcLight versus GCaMP.

      Additionally, the expected action-potential amplitude in zebrafish neurons should be stated. In Figure S1b, a 40 mV change appears to produce ~0.5% ΔF/F, but this should be quantified and noted. Could this comparison to GCaMP help resolve action potentials from subthreshold bursts?

      (3) Axonal versus somatic amplitudes (Line 203)

      The manuscript states that voltage amplitudes are "slightly smaller" in axons than in somata; this requires quantitative values and statistical testing. More importantly, differences in optical amplitude reflect factors such as expression levels, background fluorescence, and optical geometry, not necessarily true differences in voltage amplitude. The axonal signals are clearly present, but their relative magnitude should not be interpreted without correction.

      (4) Figure 4C: need for an off-ROI control

      Figure 4C should include a control ROI located away from ROI3 to demonstrate that the axonal signal is not due to background fluctuations, similar to the control shown in Figure S3. Although the ΔF image suggests localization, showing the trace explicitly would strengthen the point. The fluorescence-change image in Figure 4c should also be fully explained in the legend.

      (5) Figure 5: hyperpolarization signals

      Figure 5 is particularly impressive. It appears that Cell 2 at 18.5 hpf and Cell 1 at 18 hpf exhibit hyperpolarizing events. The authors should confirm that these are true hyperpolarizations by giving some indication of how often they were observed.

      (6) SNR comparison (Lines 300-302)

      The claim that ArcLight and GCaMP exhibit comparable SNR requires statistical support across multiple cells.

    4. Reviewer #3 (Public review):

      Summary:

      The authors aimed to establish a long-term voltage imaging platform to investigate how coordinated neuronal activity emerges during spinal cord development in zebrafish embryos. Using the genetically encoded voltage indicator ArcLight, they tracked membrane potential dynamics in motor neurons at population, single-cell, and subcellular levels from 18 to 23 hours post-fertilization (hpf), revealing relationships between firing maturation, waveform characteristics, and axonal outgrowth.

      Strengths:

      (1) Technical advancement in developmental voltage imaging:

      This study demonstrates voltage imaging of motor neurons in the developing vertebrate spinal cord. The approach successfully captures voltage dynamics at multiple spatial scales-neuronal population, single-cell, and subcellular compartments.

      (2) Insights into the relationship between morphological and functional maturation:

      The work reveals important relationships between voltage dynamics maturation and morphological changes.

      (3) Kinetics analysis of membrane potential waveform enabled by voltage imaging:

      The characterization of "immature" versus "mature" firing based on quantitative waveform parameters provides insights into functional maturation that are inaccessible by calcium imaging. This analysis reveals a maturation process in the biophysical properties of developing neurons.

      (4) Matching of voltage indicator kinetics to biological signal:

      The authors' choice of ArcLight, despite its slow kinetics compared to newer GEVIs, proved well-suited to the low-frequency activity patterns in developing spinal neurons (frequency ~0.3 Hz).

      Weaknesses:

      (1) Insufficient comparison with prior calcium imaging studies:

      While the authors state that voltage imaging provides superior temporal resolution compared to calcium imaging (lines 192-196, 301), and this is generally true, the current manuscript does not adequately cite or discuss previous calcium imaging studies. Since neural activity occurs at low frequency in the developing spinal cord, calcium imaging is adequate for characterizing the emergence of coordinated activity patterns in the developing zebrafish spinal cord. Notably, Wan et al. (2019, Cell) performed a comprehensive single-cell reconstruction of emerging population activity in the entire developing zebrafish spinal cord using calcium imaging. This work should be properly acknowledged and compared. The specific advantages of voltage imaging over these prior studies need to be more clearly articulated, e.g. detection of subthreshold events and membrane potential waveform kinetics.

      (2) Considerations for generalizability of the ArcLight-based voltage imaging approach:

      While this study successfully demonstrates voltage imaging using ArcLight in the developing spinal cord, the generalizability of this approach to later developmental stages and other neural systems warrants discussion. ArcLight exhibits relatively slow kinetics (rise time ~100-200 ms, decay τ ~200-300 ms). In the current study, these kinetics are well-suited to the developmental activity patterns observed (firing frequency ~0.3 Hz), representing appropriate matching of indicator properties to biological timescales. However, the same approach may be less suitable for later developmental stages when neural activity occurs at higher frequencies.

      (3) Incomplete methodological descriptions:

      As a paper establishing a new imaging approach, several critical details are missing or unclear.

      (a) Imaging system specifications: The imaging setup description lacks essential information, including light source specifications, excitation wavelength/filter sets, and light power at the sample. The authors should also clarify whether wide-field optics was used rather than confocal or selective plane imaging.

      (b) Long-term imaging protocol: Whether neurons were imaged continuously or with breaks between imaging sessions is not explicitly stated. The current phrasing could be interpreted as a continuous 4.5-hour recording, which would be technically impressive but may not be what was actually done.

      (c) Image processing procedures: Denoising and bleach correction procedures are mentioned but not described, which is critical for a methods-focused paper.

      (d) The waveform classification (Supplementary Figure S6) shows overlapping kinetics between "immature" and "mature" firing, yet the classification method is not adequately justified.

      (e) Given that photostability and toxicity are critical considerations for long-term voltage imaging, these aspects warrant further clarification. While the figures suggest stable ArcLight fluorescence during the experiments, the manuscript lacks quantification of photobleaching, a discussion of potential toxicity concerns associated with the indicator, and information regarding the maximum duration over which the ArcLight signal can faithfully report physiological voltage dynamics.

      (4) Incomplete data representation and quantification:

      (a) The claim of "reduced variability" in calcium imaging (line 194) is not clearly demonstrated in Supplementary Figure S1.

      (b) Amplitude distributions for cell/subcellular compartments are not systematically quantified. Figure S3 shows ~5% changes in some axons versus ~2% in others, but it remains unclear whether these variabilities reflect differences between axonal compartments within the same cell, between individual cells, or between individual fish.

    1. eLife Assessment

      This study presents a valuable and practical approach for one-photon imaging through GRIN lenses. By scanning a low numerical aperture (NA) beam and collecting fluorescence with a high NA, the method expands the usable field of view and yields clearer cellular signals. The evidence is solid overall, with strong qualitative demonstrations, but some claims would benefit from additional quantitative tests. The work will interest researchers who need simple, scalable tools for large‑area cellular imaging in the brain.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript reported a method for deep brain imaging with a GRIN lens that combines "low-NA telecentric scanning (LNTS) of laser excitation with high-NA fluorescence collection" to achieve a larger FOV than conventional approaches.

      Strengths:

      The manuscript presented in vivo structural images and calcium activity results in side-by-side comparison to wide-field epi fluorescence imaging through a GRIN lens and two-photon scanning imaging.

      Weaknesses:

      (1) Lack of sufficient technique information on the "high-NA (1.0) fluorescence collection". Is it custom-made or an off-the-shelf component? The only optical schematic, Figure 1, shows two lenses and a Si-PMT as the collection apparatus. There is no information about the lenses and the spacing between each component.

      (2) There is no discussion about the speed limitation of the LNTS method, which, as a scanning-based method, is limited by the scanner speed. At a 10 Hz frame rate, the LNTS, although it has a better FOV, is much slower than widefield fluorescence imaging. The 10 Hz speed is not sufficient for some fast calcium activities.

      (3) Supplementary Figure 5 is irrelevant to the main claim of the manuscript. This is a preliminary simulation related to the authors' proposed future work.

    3. Reviewer #2 (Public review):

      Summary:

      This study introduces a simple optical strategy for one-photon imaging through GRIN lenses that prioritizes coverage while maintaining practical signal quality. By using low-NA telecentric scanned excitation together with high-NA collection, the approach aims to convert nearly the full lens facet into a usable field of view (FOV) with uniform contrast and visible somata. The method is demonstrated in 4-µm fluorescent bead samples and mouse brain, with qualitative comparisons to widefield and two-photon (2P) imaging. Because the configuration relies on standard components and a minimalist optical layout, it may enable broader access to large-area cellular imaging in the deep brain across neuroscience laboratories.

      Strengths:

      (1) This method mitigates off-axis aberrations and enlarges the usable FOV. It achieves near full-facet usable FOV with consistent centre-to-edge contrast, as evidenced by 4-µm fluorescent bead samples (uniform visibility to the edge) and in vivo microglia imaging (resolvable somata across the field).

      (2) The optical design is simple and supports efficient photon collection, lowering the barrier to adoption relative to adaptive optics (AO) or lens design-based correction. Using standard components and treating the GRIN lens as a high-NA (~1.0) light pipe increases collection efficiency for ballistic and scattered fluorescence. Figure annotations report the illumination energy required to reach a fixed detected-photon target (e.g., ~1000 detected photons per bead/cell for the 500-µm FOV condition), and under this equal-output criterion, the LNTS configuration achieves comparable or better image quality at lower illumination energy than conventional wide-field imaging, supporting improved photon efficiency and implying reduced bleaching and heating for equivalent signal levels.

      (3) The in vivo functional recordings are stable and exhibit strong signals. In vivo calcium imaging shows high-SNR ΔF/F₀ traces that remain stable over ~30-minute sessions with only modest baseline drift reported, supporting physiological measurements without heavy denoising and enabling large-scale data collection.

      (4) The low-NA excitation provides an extended focal depth, enabling more neurons to be tracked concurrently within a single FOV while maintaining practical signal quality. It reduces sensitivity to axial motion and minor misalignment and enhances overall experimental efficiency.

      Weaknesses:

      (1) Quantitative characterization is limited. Resolution and contrast are not comprehensively mapped as functions of field position and depth, and a clear, operational definition of "usable FOV" is not specified with threshold criteria.

      (2) The claim of approximately 100% usable FOV is largely supported by qualitative images; standardized metrics (e.g., PSF/MTF maps, contrast-to-noise ratio profiles, cell-detection yield versus radius) are needed to calibrate expectations and enable comparison across systems.

      (3) The trade-off inherent to low NA excitation, namely a broader axial PSF and possible neuropil/background contamination, is acknowledged qualitatively but not quantified. Analyses that separate in-focus from out-of-focus signal would help readers judge single-cell fidelity across the field.

      (4) Generalizability remains to be established. Performance across multiple GRIN models (e.g., diameter, NA), wavelengths, is not yet demonstrated. Longer-session photobleaching, heating, and phototoxicity, particularly near the edge of the FOV, also require fuller evaluation.

      Readers should view it as a coverage-first strategy that enlarges the FOV while accepting a modest trade-off in resolution due to the low-NA excitation and the extended axial PSF.

    1. eLife Assessment

      This study provides a valuable advance in understanding how decision boundaries may change over time during simple choices by introducing a method that uses information about non-decision components to improve parameter estimates. The evidence supporting the main claims is convincing, with clear demonstrations on simulated and real data, although additional model comparison work would further strengthen confidence. The findings will be of interest to researchers studying human decision processes and the methods used to analyse them.

    2. Reviewer #1 (Public review):

      Summary:

      This paper proposes a non-decision time (NDT)-informed approach to estimating time-varying decision thresholds in diffusion models of decision making. The manuscript motivates the method well, outlines the identifiability issues it is intended to address, and evaluates it using simulations and two empirical datasets. The aim is clear, the scope is deliberately focused, and the manuscript is well written. The core idea is interesting, technically grounded, and a meaningful contribution to ongoing work on collapsing thresholds.

      Strengths:

      The manuscript is logically structured and easy to follow. The emphasis on parameter recovery is appropriate and appreciated. The finding that the exponential NDT-informed function produces substantially better recovery than the hyperbolic form is useful, given the importance placed on identifiability earlier in the paper. The threshold visualisations are also helpful for interpreting what the models are doing. Overall, the work offers a well-defined, methodologically oriented contribution that will interest researchers working on time-varying thresholds.

      Weaknesses / Areas for Clarification:

      A few points would benefit from clarification, additional analysis, or revised presentation:

      (1) It would help readers to see a concrete demonstration of the trade-off between NDT and collapsing thresholds, to give a sense of the scale of the identifiability problem motivating the work.

      (2) Before moving to the empirical datasets, the manuscript really needs a simulation-based model-recovery comparison, since all major conclusions of the empirical applications rely on model comparison. One approach might be to simulate from (a) an FT model with across-trial drift variability and (b) one of the CT models, then fit both models to each of the simulated data sets. This would address a longstanding issue: sometimes CT models are preferred even when the estimated collapse in the thresholds is close to zero. A recovery study would confirm that model selection behaves sensibly in the new framework.

      (3) An additional subtle point is that BIC is defined in terms of the maximised log-likelihood of the model for the data being modelled. In the joint model, the parameter estimates maximise the combined likelihood of behavioural and non-decision-time data. This means the behavioural log-likelihood evaluated at the joint MLEs is not the behavioural MLE. If BIC is being computed for the behavioural data only, this breaks the assumptions underlying BIC. The only valid BIC here would be one defined for the joint model using the joint likelihood.

      (4) Table 1 sets up the Study 1 comparisons, but there's no row for the FT model. Similarly, Figures 10 and 13 would be more informative if they included FT predictions. This matters because, in Study 1, the FT model appears to fit aggregate accuracy better than the BIC-preferred collapsing model, currently shown only in Appendix 5. Some discussion of why would strengthen the argument.

      (5) In Figure 7, the degree of decay underestimation is obscured by using a density plot rather than a scatterplot, consistent with the other panels of the same figure. Presenting it the same way would make the mis-recovery more transparent. The accompanying text may also need clarification: when data are generated from an FT model with across-trial drift variability, the NDT-informed model seems to infer FT boundaries essentially. If that's correct, the model must be misfitting the simulated data. This is actually a useful result as it suggests across-trial drift variability in FT models is discriminable from collapsing-threshold models. It would be good to make this explicit.

      (6) Given the large recovery advantage of the exponential NDT-informed function over the hyperbolic one, the authors may want to consider whether the results favour adopting the former more generally. Given these findings, I would consider recommending the exponential NDT-informed model for future use.

      (7) In Study 2 (Figure 13), all models qualitatively miss an interesting empirical pattern: under speed emphasis, errors are faster than corrects, while under accuracy emphasis, errors become slower. The error RT distribution in the speed condition is especially poorly captured. It would be helpful for the authors to comment, as it suggests that something theoretically relevant is missing from all models tested.

      (8) The threshold visualisations extend to 3 seconds, yet both datasets show decisions mostly finishing by ~1.5 seconds. Shortening the x-axis would better reflect the empirical RT distributions and avoid unintentionally overstating the timescale of the empirical decision processes.

    3. Reviewer #2 (Public review):

      Summary:

      The authors use simulations and empirical data fitting in order to demonstrate that informing a decision model on estimates of single-trial non-decision time can guide the model to more reliable parameter estimates, especially when the model has collapsing bounds.

      Strengths:

      The paper is well written and motivated, with clear depth of knowledge in the areas of neurophysiology of decision-making, sequential sampling models, and, in particular, the phenomenon of collapsing decision bounds.

      Two large-scale simulations are run to test parameter recovery, and two empirical datasets are fit and assessed; the fitting procedures themselves are state-of-the-art, and the study makes use of a very new and well-designed ERP decomposition algorithm that provides single-trial estimates of the duration of diffusion; the results provide inferences about the operation of decision bound collapse - all of this is impressive.

      Weaknesses:

      This is an interesting and promising idea, but a very important issue is not clear: it is an intuitive principle that information from an external empirical source can enhance the reliability of parameter estimates for a given model, but how can the overall BIC improve, unless it is in fact a different model? Unfortunately, it is not clear whether and how the model structure itself differs between the NDT-informed and non-NDT-informed cases. Ideally, they are the same actual model, but with one getting extra guidance on where to place the tau and/or sigma parameters from external measurements. The absence of sigma (non-decision time variance) estimates for the non-NDT-informed model, however, suggests it is different in structure, not just in its lack of constraints. If they were the same model, whether they do or do not possess non-decision time variability (which is not currently clear), the only possible reason that the NDT-informed model could achieve better BIC is because the non-NDT-informed model gets lost in the fitting procedure and fails to find the global optimum. If they are in fact different models - for example, if the NDT-informed model is endowed with NDT variability, while the non-NDT-informed model is not - then the fit superiority doesn't necessarily say anything about an NDT-informed reliability boost, but rather just that a model with NDT variability fits better than one without.

      One reason this is unclear is that Footnote 4 says that this study did not allow trial-to-trial variability in nondecision time, but the entire premise of using variable external single-trial estimates of nondecision times (illustrated in Figure 2) assumes there is nondecision time variability and that we have access to its distribution.

      It is good that there is an Intro section to explain how the tradeoff between NDT and collapsing bound parameters renders them difficult to simultaneously identify, but I think it needs more work to make it clear. First of all, it is not impossible to identify both, in the same way as, say, pre- and post-decisional nondecision time components cannot be resolved from behaviour alone - the intro had already talked about how collapsing bounds impact RT distribution shapes in specific ways, and obviously mean (or invariant) NDT can't do that - it can only translate the whole distribution earlier/later on the time axis. This is at odds with the phrasing "one CANNOT estimate these three parameters simultaneously." So it should be first clarified that this tradeoff is not absolute. Second, many readers will wonder if it is simply a matter of characterising the bound collapse time course as beginning at accumulation onset, instead of stimulus offset - does that not sidestep the issue? Third, assuming the above can be explained, and there is a reason to keep the collapse function aligned to stimulus onset, could the tradeoff be illustrated by picking two distinct sets of parameter values for non-decision time, starting threshold, and decay rate, which produce almost identical bound dynamics as a function of RT? It is not going to work for most readers to simply give the formula on line 211 and say "There is a tradeoff." Most readers will need more hand-holding.

      A lognormal distribution is used as line 231 says it "must" produce a right-skew. Why? It is unusual for non-decision time distribution to be asymmetric in diffusion modeling, so this "must" statement must be fully explained and justified. Would I be right in saying that if either fixed or symmetrically distributed nondecision times were assumed, as in the majority of diffusion models, then the non-identifiability problem goes away? If the issue is one faced only by a special class of DDMs with lognormal NDT, this should be stated upfront.

      In the simulation study methods, is the only difference between NDT-informed and non-informed models that the non-NDT-informed must also estimate tau and sigma, whereas the NDT-informed model "knows" these two parameters and so only has the other three to estimate? And is it the exact same data that the two models are fit to, in each of the simulation runs? Why is sigma missing from the uninformed part of Figure 4? If it is nondecision time variability, shouldn't the model at least be aware of the existence of sigma and try to estimate it, in order for this to be a meaningful comparison?

      I am curious to know whether a linear bound collapse suffers from the same identifiability issues with NDT, or was it not considered here because it is so suboptimal next to the hyperbolic/exponential?

      The approach using HMP rests on the assumption that accumulation onset is marked by the peak of a certain neural event, but even if it is highly predictive of accumulation onset, depending on what it reflects, it could come systematically earlier or later than the actual accumulation onset. Could the authors comment on what implications this might have for the approach?

      Figure 7: for this simulation, it would be helpful to know the degree to which you can get away with not equipping the model to capture drift rate variability, when the degree of that d.r. variability actually produces appreciable slow error rates. The approach here is to sample uniformly from ranges of the parameters, but how many of these produce data that can be reasonably recognised as similar to human behaviour on typical perceptual decision tasks? The authors point out that only 5% of fits estimate an appreciable bound collapse but if there are only 10% of the parameter vectors that produce data in a typical RT range with typical error rates etc, and half of these produce an appreciable downturn in accuracy for slower RT, and all of the latter represent that 5%, then that's quite a different story. An easy fix would be to plot estimated decay as a scatter plot against the rate of decline of accuracy from the median RT to the slowest RT, to visualise the degree to which slow errors can be absorbed by the no-dr-var model without falsely estimating steep bound collapse. In general, I'm not so sure of the value of this section, since, in principle, there is no getting around the fact that if what is in truth a drift-variability source of slow errors is fit with a model that can only capture it with a collapsing bound, it will estimate a collapsing bound, or just fail to capture those slow errors.

    4. Reviewer #3 (Public review):

      The current paper addresses an important issue in evidence accumulation models: many modelers implement flat decision boundaries because the collapsing alternatives are hard to reliably estimate. Here, using simulations, the authors demonstrate that parameter recovery can be drastically improved by providing the model with additional data (specifically, an EEG-informed estimate of non-decision time). Moreover, in two empirical datasets, it is shown that those EEG-informed models provide a better fit to the data. The method seems sound and promising and might inform future work on the debate regarding flat vs collapsing choice boundaries. As an evidence-accumulation enthusiast, I am quite excited about this work, although for a broader audience, the immediate applicability of this approach seems limited because it does require EEG data (i.e. limiting widespread use of the method or e.g., answering questions about individual differences that require a very large N).

    1. eLife Assessment

      This study provides important evidence that myristate, a fatty acid commonly present in soil environments, is taken up by arbuscular mycorrhizal fungi during symbiosis with a plant host. The evidence presented is solid, with multiple experimental approaches including stable isotope tracing, transcriptional analysis, and physiological measurements across different plant species and phosphorus conditions. However, the main claims are only partially supported.

    2. Reviewer #1 (Public review):

      Summary:

      Two major breakthroughs in the field of arbuscular mycorrhiza (AM) were the discoveries that first AM fungi obtain lipids (not only carbohydrates) from their plant hosts (Bravo et al 2017; Jiang et al 2017; Keymer et al 2017; Luginbuehl et al 2017) and second that presumably obligate biotrophic AM fungi can produce spores in the absence of host plants when exposed to myristate (Sugiura et al 2020; Tanaka et al 2022).

      For this manuscript, Chen et al asked the question of whether myristate in the soil may also play a role in AM symbiosis when AM fungi live in symbiosis with their plant hosts. They show that myristate occurs in natural as well as agricultural soils, probably as a component of root exudates. Further, they treat AM fungi with myristate when grown in symbiosis in a Petri dish system with carrot hairy roots or in pots with alfalfa or rice to describe which effect the exogenous myristate has on symbiosis. Using 13C labelling, they show that myristate is taken up by AM fungi, although they can obtain sugars and lipids from the plant host. They also show that myristate leads to an increase in root colonization as well as expression of fungal genes involved in FA assimilation.

      Interestingly, the effect of myristate on colonization depends on the plant species and the level of phosphate fertilization provided to the plant. The reason for this remains unknown.

      Strengths:

      The findings are interesting and provide an advance in our understanding of lipid use by the extraradical mycelium of AM fungi.

      Weaknesses:

      However, there are some misconceptions in the writing, and some experimental results remain poorly clear as they are presented in a highly descriptive manner without interpretation or explanation.

    3. Reviewer #2 (Public review):

      Summary:

      Arbuscular mycorrhizal fungi (AMF) are among the most widely distributed soil microorganisms, forming symbiotic relationships (AM symbiosis) with approximately 70% of terrestrial vascular plants. AMF are considered obligate biotrophs that rely on host-derived symbiotic carbohydrates. However, it remains unclear whether symbiotic AMF can access exogenous non-symbiotic carbon sources. By conducting three interconnected and complementary experiments, Chen et al. investigated the direct uptake of exogenous 13C1-labeled myristate by symbiotic Rhizophagus irregularis, R. intraradices, and R. diaphanous, and assessed their growth responses using AMF-carrot hairy root co-culture systems (Experiments 1 and 2). They also explored the environmental distribution of myristate in plant and soil substrates, and evaluated the impact of exogenous myristate on the symbiotic carbon-phosphorus exchange between R. irregularis and alfalfa or rice in a greenhouse experiment (Experiment 3). Given that the AM symbiosis not only plays a significant role in the biogeochemical cycling of C and P elements but also acts as a key driver of plant community structure and productivity. The topic of this manuscript is relevant. The study is well-designed, and the manuscript is well-written. I find it easy and interesting to follow the entire narrative.

      Strengths:

      The manuscript provides evidence from 13C labeling and molecular analyses showing that symbiotic AMF can absorb non-symbiotic C sources like myristate in the presence of plant-derived symbiotic carbohydrates, challenging the traditional assumption that AMF exclusively rely on symbiotic carbon sources supplied from associated host plants. This finding advances our understanding of the nutritional interactions between AMF and host plants. Furthermore, the manuscript reveals that myristate is widely present in diverse soil and plant components; however, exogenous myristate disrupts the carbon-phosphorus exchange in arbuscular mycorrhizal symbiosis. These insights have significant implications for the application and regulation of the AM symbiosis in sustainable agriculture and ecological restoration.

      Weaknesses:

      The limitations of this study include:

      (1) The absorption of myristate by symbiotic AMF was observed only after exogenous application under artificial conditions, which may not accurately reflect natural environments.

      (2) The investigation into the mechanism by which myristate disrupts C-P exchange in AM symbiosis remains preliminary.

      Nevertheless, the authors have adequately discussed these limitations in the manuscript.

    4. Reviewer #3 (Public review):

      Summary:

      The authors have addressed a major question since the discovery of myristate uptake from AM fungi as a non-symbiotic C source. Myristate has been used to grow some AM fungi axenically, but the biological significance of this saprobic attitude in natural or agronomical environments remained unexplored. The results of this research soundly demonstrate that myristate-derived C is used by AM fungi, leading to improved development of both extraradical and intraradical mycelium (at least under low P conditions). However, this does not lead to obvious advantages for the plant, since symbiotic nutrient exchange (carbon and phosphorus) is reduced upon myristate application. Furthermore, myristate-treated plants quench their defence responses.

      Strengths:

      The study is extensive, based on a solid experimental setup and methodological approach, combining several state-of-the-art techniques. The conclusions are novel and of high relevance for the scientific community. The writing is fluent and clear.

      Weaknesses:

      Some of the figures should be improved for clarity. The conclusions do not express a conclusive remark that, in my opinion, emerges clearly from the results: myristate application in agriculture does not seem to be a very promising approach, since it unbalances the symbiosis nutritional equilibrium and may weaken plant immunity. This is a very important point (albeit rather unpleasant for applicative scientists) that should be stressed in the conclusions.

    1. eLife Assessment

      This important study reports on the relationships between cerebral haemodynamics and a number of factors that relate to genetics, lifestyle, and medical history using data from a large cohort. Compelling evidence suggests that brief arterial spin labelling MRI acquisition can lead to both expected observations about brain health, as manifested in cerebral blood flow, and biomarkers for use in diagnosis and treatment monitoring. The results can be used as a starting point for hypothesis generation and further evaluation of conditions expected to affect haemodynamics in the brain.

    2. Reviewer #1 (Public review):

      Summary:

      In this work, Okell et al. describe the imaging protocol and analysis pipeline pertaining to the arterial spin labeling (ASL) MRI protocol acquired as part of the UK Biobank imaging study. In addition, they present preliminary analyses of the first 7000+ subjects in whom ASL data were acquired, and this represents the largest such study to date. Careful analyses revealed expected associations between ASL-based measures of cerebral hemodynamics and non-imaging-based markers, including heart and brain health, cognitive function, and lifestyle factors. As it measures physiology and not structure, ASL-based measures may be more sensitive to these factors compared with other imaging-based approaches.

      Strengths:

      This study represents the largest MRI study to date to include ASL data in a wide age range of adult participants. The ability to derive arterial transit time (ATT) information in addition to cerebral blood flow (CBF) is a considerable strength, as many studies focus only on the latter.

      Some of the results (e.g., relationships with cardiac output and hypertension) are known and expected, while others (e.g., lower CBF and longer ATT correlating with hearing difficulty in auditory processing regions) are more novel and intriguing. Overall, the authors present very interesting physiological results, and the analyses are conducted and presented in a methodical manner.

      The analyses regarding ATT distributions and the potential implications for selecting post-labeling delays (PLD) for single PLD ASL are highly relevant and well-presented.

      Weaknesses:

      At a total scan duration of 2 minutes, the ASL sequence utilized in this cohort is much shorter than that of a typical ASL sequence (closer to 5 minutes as mentioned by the authors). However, this implementation also included multiple (n=5) PLDs. As currently described, it is unclear how any repetitions were acquired at each PLD and whether these were acquired efficiently (i.e., with a Look-Locker readout) or whether individual repetitions within this acquisition were dedicated to a single PLD. If the latter, the number of repetitions per PLD (and consequently signal-to-noise-ratio, SNR) is likely to be very low. Have the authors performed any analyses to determine whether the signal in individual subjects generally lies above the noise threshold? This is particularly relevant for white matter, which is the focus of several findings discussed in the study.

      Hematocrit is one of the variables regressed out in order to reduce the effect of potential confounding factors on the image-derived phenotypes. The effect of this, however, may be more complex than accounting for other factors (such as age and sex). The authors acknowledge that hematocrit influences ASL signal through its effect on longitudinal blood relaxation rates. However, it is unclear how the authors handled the fact that the longitudinal relaxation of blood (T1Blood) is explicitly needed in the kinetic model for deriving CBF from the ASL data. In addition, while it may reduce false positives related to the relationships between dietary factors and hematocrit, it could also mask the effects of anemia present in the cohort. The concern, therefore, is two-fold: (1) Were individual hematocrit values used to compute T1Blood values? (2) What effect would the deconfounding process have on this?

      The authors leverage an observed inverse association between white matter hyperintensity volume and CBF as evidence that white matter perfusion can be sensitively measured using the imaging protocol utilized in this cohort. The relationship between white matter hyperintensities and perfusion, however, is not yet fully understood, and there is disagreement regarding whether this structural imaging marker necessarily represents impaired perfusion. Therefore, it may not be appropriate to use this finding as support for validation of the methodology.

    3. Reviewer #2 (Public review):

      Summary:

      Okell et al. report the incorporation of arterial spin-labeled (ASL) perfusion MRI into the UK Biobank study and preliminary observations of perfusion MRI correlates from over 7000 acquired datasets, which is the largest sample of human perfusion imaging data to date. Although a large literature already supports the value of ASL MRI as a biomarker of brain function, this important study provides compelling evidence that a brief ASL MRI acquisition may lead to both fundamental observations about brain health as manifested in CBF and valuable biomarkers for use in diagnosis and treatment monitoring.

      ASL MRI noninvasively quantifies regional cerebral blood flow (CBF), which reflects both cerebrovascular integrity and neural activity, hence serves as a measure of brain function and a potential biomarker for a variety of CNS disorders. Despite a highly abbreviated ASL MRI protocol, significant correlations with both expected and novel demographic, physiological, and medical factors are demonstrated. In many such cases, ASL was also more sensitive than other MRI-derived metrics. The ASL MRI protocol implemented also enables quantification of arterial transit time (ATT), which provides stronger clinical correlations than CBF in some factors. The results demonstrate both the feasibility and the efficacy of ASL MRI in the UK Biobank imaging study, which expects to complete ASL MRI in up to 60,000 richly phenotyped individuals. Although a large literature already supports the value of ASL MRI as a biomarker of brain function, this important study provides compelling evidence that a brief ASL MRI acquisition may lead to both fundamental observations about brain health as manifested in CBF and valuable biomarkers for use in diagnosis and treatment monitoring.

      Strengths:

      A key strength of this study is the use of an ASL MRI protocol incorporating balanced pseudocontinuous labeling with a background-suppressed 3D readout, which is the current state-of-the-art. To compensate for the short scan time, voxel resolution was intentionally only moderate. The authors also elected to acquire these data across five post-labeling delays, enabling ATT and ATT-corrected CBF to be derived using the BASIL toolbox, which is based on a variational Bayesian framework. The resulting CBF and ATT maps shown in Figure 1 are quite good, especially when combined with such a large and deeply phenotyped sample.

      Another strength of the study is the rigorous image analysis approach, which included covariation for a number of known CBF confounds as well as correction for motion and scanner effects. In doing so, the authors were able to confirm expected effects of age, sex, hematocrit, and time of day on CBF values. These observations lend confidence in the veracity of novel observations, for example, significant correlations between regional ASL parameters and cardiovascular function, height, alcohol consumption, depression, and hearing, as well as with other MRI features such as regional diffusion properties and magnetic susceptibility. They also provide valuable observations about ATT and CBF distributions across a large cohort of middle-aged and older adults.

      Weaknesses:

      This study primarily serves to illustrate the efficacy and potential of ASL MRI as an imaging parameter in the UK Biobank study, but some of the preliminary observations will be hypothesis-generating for future analyses in larger sample sizes. However, a weakness of the manuscript is that some of the reported observations are difficult to follow. In particular, the associations between ASL and resting fMRI illustrated in Figure 7 and described in the accompanying Results text are difficult to understand. It could also be clearer whether the spatial maps showing ASL correlates of other image-derived phenotypes in Figure 6B are global correlations or confined to specific regions of interest. Finally, while addressing partial volume effects in gray matter regions by covarying for cortical thickness is a reasonable approach, the Methods section seems to imply that a global mean cortical thickness is used, which could be problematic given that cortical thickness changes may be localized.

    4. Reviewer #3 (Public review):

      Summary:

      This is an extremely important manuscript in the evolution of cerebral perfusion imaging using Arterial Spin Labelling (ASL). The number of subjects that were scanned has provided the authors with a unique opportunity to explore many potential associations between regional cerebral blood flow (CBF) and clinical and demographic variables.

      Strengths:

      The major strength of the manuscript is the access to an unprecedentedly large cohort of subjects. It demonstrates the sensitivity of regional tissue blood flow in the brain as an important marker of resting brain function. In addition, the authors have demonstrated a thorough analysis methodology and good statistical rigour.

      Weaknesses:

      This reviewer did not identify any major weaknesses in this work.

    5. Author response:

      We thank the editors and reviewers for their generally positive and thoughtful feedback on this work. Below are provisional responses to some of the concerns raised:

      Reviewer 1:

      At a total scan duration of 2 minutes, the ASL sequence utilized in this cohort is much shorter than that of a typical ASL sequence (closer to 5 minutes as mentioned by the authors). However, this implementation also included multiple (n=5) PLDs. As currently described, it is unclear how any repetitions were acquired at each PLD and whether these were acquired efficiently (i.e., with a Look-Locker readout) or whether individual repetitions within this acquisition were dedicated to a single PLD. If the latter, the number of repetitions per PLD (and consequently signal-to-noise-ratio, SNR) is likely to be very low. Have the authors performed any analyses to determine whether the signal in individual subjects generally lies above the noise threshold? This is particularly relevant for white matter, which is the focus of several findings discussed in the study.

      We agree that this was a short acquisition compared to most ASL protocols, necessitated by the strict time-keeping requirements for running such a large study. We apologise if this was not clear in the original manuscript, but due to this time constraint and the use of a segmented readout (which was not Look-Locker) there was only time available for a single average at each PLD. This does mean that the perfusion weighted images at each PLD are relatively noisy, although the image quality with this sequence was still reasonable, as demonstrated in Figure 1, with perfusion weighted images visibly above the noise floor. In addition, as has been demonstrated theoretically and experimentally in recent work (Woods et al., 2023, 2020), even though the SNR of each individual PLD image might be low in multi-PLD acquisitions, this is effectively recovered during the model fitting process, giving it comparable or greater accuracy than a protocol which collects many averages at a single (long) PLD. As also noted by the reviewers, this approach has the further benefit of allowing ATT estimation, which has proven to provide useful and complementary information to CBF. Finally, the fact that many of the findings in this study pass strict statistical thresholds for significance, despite the many multiple comparisons performed, and that the spatial patterns of these relationships are consistent with expectations, even in the white matter (e.g. Figure 6B), give us confidence that the perfusion estimation is robust. However, we will consider adding some additional metrics around SNR or fitting uncertainty in a revised manuscript, as well as clarifying details of the acquisition.

      Hematocrit is one of the variables regressed out in order to reduce the effect of potential confounding factors on the image-derived phenotypes. The effect of this, however, may be more complex than accounting for other factors (such as age and sex). The authors acknowledge that hematocrit influences ASL signal through its effect on longitudinal blood relaxation rates. However, it is unclear how the authors handled the fact that the longitudinal relaxation of blood (T1Blood) is explicitly needed in the kinetic model for deriving CBF from the ASL data. In addition, while it may reduce false positives related to the relationships between dietary factors and hematocrit, it could also mask the effects of anemia present in the cohort. The concern, therefore, is two-fold: (1) Were individual hematocrit values used to compute T1Blood values? (2) What effect would the deconfounding process have on this?

      We agree this is an important point to clarify. In this work we decided not to use the haematocrit to directly estimate the T1 of blood for each participant a) because this would result in slight differences in the model fitting for each subject, which could introduce bias (e.g. the kinetic model used assumes instantaneous exchange between blood water and tissue, so changing the T1 of blood for each subject could make us more sensitive to inaccuracies in this assumption); and b) because typically the haematocrit measures were quite some time (often years) prior to the imaging session, leading to an imperfect correction. We therefore took the pragmatic approach to simply regress each subject’s average haematocrit reading out of the IDP and voxelwise data to prevent it contributing to apparent correlations caused by indirect effects on blood T1. However, we agree with the reviewer that this certainly would mask the effects of anaemia in this cohort, so for researchers interested in this condition a different approach should be taken. We will update the revised manuscript to try to clarify these points.

      The authors leverage an observed inverse association between white matter hyperintensity volume and CBF as evidence that white matter perfusion can be sensitively measured using the imaging protocol utilized in this cohort. The relationship between white matter hyperintensities and perfusion, however, is not yet fully understood, and there is disagreement regarding whether this structural imaging marker necessarily represents impaired perfusion. Therefore, it may not be appropriate to use this finding as support for validation of the methodology.

      We appreciate the reviewer’s point that there is still debate about the relationship between white matter hyperintensities and perfusion. We therefore agree that this observed relationship therefore does not validate the methodology in the sense that it is an expected finding, but it does demonstrate that the data quality is sufficient to show significant correlations between white matter hyperintensity volume and perfusion, even in white matter regions, which would not be the case if the signal there were dominated by noise. Similarly, the clear spatial pattern of perfusion changes in the white matter that correlate with DTI measures in the same regions also suggests there is sensitivity to white matter perfusion. However, we will update the wording in the revised manuscript to try to clarify this point.

      Reviewer 2:

      This study primarily serves to illustrate the efficacy and potential of ASL MRI as an imaging parameter in the UK Biobank study, but some of the preliminary observations will be hypothesis-generating for future analyses in larger sample sizes. However, a weakness of the manuscript is that some of the reported observations are difficult to follow. In particular, the associations between ASL and resting fMRI illustrated in Figure 7 and described in the accompanying Results text are difficult to understand. It could also be clearer whether the spatial maps showing ASL correlates of other image-derived phenotypes in Figure 6B are global correlations or confined to specific regions of interest. Finally, while addressing partial volume effects in gray matter regions by covarying for cortical thickness is a reasonable approach, the Methods section seems to imply that a global mean cortical thickness is used, which could be problematic given that cortical thickness changes may be localized.

      We apologise if any of the presented information was unclear and will try to improve this in our revised manuscript. To clarify, the spatial maps associated with other (non-ASL) IDPs were generated by calculating the correlation between the ASL CBF or ATT in every voxel in standard space with the non-ASL IDP of interest, not the values of the other imaging modality in the same voxel. No region-based masking was used for this comparison. This allowed us to examine whether the correlation with this non-ASL IDP was only within the same brain region or if the correlations extended to other regions too.

      We also agree that the associations between ASL and resting fMRI are not easy to interpret. We therefore tried to be clear in the manuscript that these were preliminary findings that may be of interest to others, but clearly further study is required to explore this complex relationship further. However, we will try to clarify how the results are presented in the revised manuscript.

      In relation to partial volume effects, we did indeed use only a global measure of cortical thickness in the deconfounding and we acknowledged that this could be improved in the discussion: [Partial volume effects were] “mitigated here by the inclusion of cortical thickness in the deconfounding process, although a region-specific correction approach that is aware of the through-slice blurring (Boscolo Galazzo et al., 2014) is desirable in future iterations of the ASL analysis pipeline.” As suggested here, although this is a coarse correction, we did not feel that a more comprehensive partial volume correction approach could be used without properly accounting for the through-slice blurring effects from the 3D-GRASE acquisition (that will vary across different brain regions), which is not currently available, although this is an area we are actively working on for future versions of the image analysis pipeline. We again will try to clarify this point further in the revised manuscript.

      References

      Woods JG, Achten E, Asllani I, Bolar DS, Dai W, Detre J, Fan AP, Fernández-Seara M, Golay X, Günther M, Guo J, Hernandez-Garcia L, Ho M-L, Juttukonda MR, Lu H, MacIntosh BJ, Madhuranthakam AJ, Mutsaerts HJ, Okell TW, Parkes LM, Pinter N, Pinto J, Qin Q, Smits M, Suzuki Y, Thomas DL, Van Osch MJP, Wang DJ, Warnert EAH, Zaharchuk G, Zelaya F, Zhao M, Chappell MA. 2023. Recommendations for Quantitative Cerebral Perfusion MRI using Multi-Timepoint Arterial Spin Labeling: Acquisition, Quantification, and Clinical Applications (preprint). Open Science Framework. doi:10.31219/osf.io/4tskr

      Woods JG, Chappell MA, Okell TW. 2020. Designing and comparing optimized pseudo-continuous Arterial Spin Labeling protocols for measurement of cerebral blood flow. NeuroImage 223:117246. doi:10.1016/j.neuroimage.2020.117246

    1. eLife Assessment

      This valuable study uses state-of-the-art neural encoding and video reconstruction methods to achieve a substantial improvement in video reconstruction quality from mouse neural data. It provides a convincing demonstration of how reconstruction performance can be improved by combining these methods. The goal of the study was improving reconstruction performance rather than advancing theoretical understanding of neural processing, so the results will be of practical interest to the brain decoding community.

    2. Reviewer #2 (Public review):

      Summary:

      This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of mouse visual cortex.

      Strengths:

      This is a great start for a project addressing visual reconstruction. It is based on physiological data obtained at a single-cell resolution, the stimulus movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. There appear to be no major technical flaws in the study, and some potential confounds were addressed upon revision. The study is an enjoyable read.

      Weaknesses:

      The study is technically competent and benchmark-focused, but without significant conceptual or theoretical advances. The inclusion of neuronal data broadens the study's appeal, but the work does not explore potential principles of neural coding, which limits its relevance for neuroscience and may create some disappointment to some neuroscientists. The authors are transparent that their goal was methodological rather than explanatory, but this raises the question of why neuronal data were necessary at all, as more significant reconstruction improvements might be achievable using noise-less artificial video encoders alone (network-to-network decoding approaches have been done well by teams such as Han, Poggio, and Cheung, 2023, ICML). Yet, even within the methodological domain, the study does not articulate clear principles or heuristics that could guide future progress. The finding that more neurons improve reconstruction aligns with well-established results in the literature that show that higher neuronal numbers improve decoding in general (for example, Hung, Kreiman, Poggio, and DiCarlo, 2005) and thus may not constitute a novel insight.

      Specific issues:

      (1) The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I was left with the question: okay, does this mean that we should all switch to DNEM for our investigations of mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301...single trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best theoretical score, given noise and other limitations? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own, if it clarified how its findings depended on this model.

      The revision helpfully added context to the Methods about the range of scores achieved by other models, but this information remains absent from the Abstract and other important sections. For instance, the Abstract states, "We achieve a pixel-level correlation of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses," yet this point estimate (presented without confidence intervals or comparisons to controls) lacks meaning for readers who are not told how it compares to prior work or what level of performance would be considered strong. Without such context, the manuscript undercuts potentially meaningful achievements.

      (2) Along those lines, the authors conclude that "the number of neurons in the dataset and the use of model ensembling are critical for high-quality reconstructions." If true, these principles should generalize across network architectures. I wondered whether the same dependencies would hold for other network types, as this could reveal more general insights. The authors replied that such extensions are expected (since prior work has shown similar effects for static images) but argued that testing this explicitly would require "substantial additional work," be "impractical," and likely not produce "surprising results." While practical difficulty alone is not a sufficient reason to leave an idea untested, I agree that the idea that "more neurons would help" would be unsurprising. The question then becomes: given that this is a conclusion already in the field, what new principle or understanding has been gained in this study?

      (3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1000 neurons and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that 7000 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields are too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? Originally, this question was meant to prompt deeper analysis of the neural data, but the authors did not engage with it, suggesting a limited understanding of the neuronal aspects of the dataset.

      (4) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this originally further raised questions: what is the theoretical capability for reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? In the revision, this concern was addressed nicely in the review in Supplementary Figure 3C. Also, one appreciates that as a follow up, the team produced error maps (New Figure 6) that highlight where in the frames the reconstruction are likely to fail. But the maps went unanalyzed further, and I am not sure if there was a systematic trend in the errors.

      (5) I was encouraged by Figure 4, which shows how the reconstructions succeeded or failed across different spatial frequencies. The authors note that "the reconstruction process failed at high spatial frequencies," yet it also appears to struggle with low spatial frequencies, as the reconstructed images did not produce smooth surfaces (e.g., see the top rows of Figures 4A and 4B). In regions where one would expect a single continuous gradient, the reconstructions instead display specular, high-frequency noise. This issue is difficult to overlook and might deserve further discussion.

    3. Reviewer #3 (Public review):

      Summary:

      This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration.

      Strengths:

      The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and number of recorded neurons will be useful to those planning future experiments.

      Weaknesses:

      The main contribution is methodological, and the methodology combines pre-existing components without any new original component.

    4. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews: 

      Reviewer #2 (Public review): 

      Summary: 

      This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of mouse visual cortex. 

      Strengths: 

      This is a great start for a project addressing visual reconstruction. It is based on physiological data obtained at a single-cell resolution, the stimulus movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. There appear to be no major technical flaws in the study, and some potential confounds were addressed upon revision. The study is an enjoyable read. 

      Weaknesses: 

      The study is technically competent and benchmark-focused, but without significant conceptual or theoretical advances. The inclusion of neuronal data broadens the study's appeal, but the work does not explore potential principles of neural coding, which limits its relevance for neuroscience and may create some disappointment to some neuroscientists. The authors are transparent that their goal was methodological rather than explanatory, but this raises the question of why neuronal data were necessary at all, as more significant reconstruction improvements might be achievable using noise-less artificial video encoders alone (network-to-network decoding approaches have been done well by teams such as Han, Poggio, and Cheung, 2023, ICML). Yet, even within the methodological domain, the study does not articulate clear principles or heuristics that could guide future progress. The finding that more neurons improve reconstruction aligns with well-established results in the literature that show that higher neuronal numbers improve decoding in general (for example, Hung, Kreiman, Poggio, and DiCarlo, 2005) and thus may not constitute a novel insight. 

      We thank the reviewer for this second round of comments and hope we were able to address the remaining points below. 

      Indeed, using surrogate noiseless data is interesting and useful when developing such methods, or to demonstrate that they work in principle. But in order to evaluate if they really work in practice, we need to use real neuronal data. While we did not try movie reconstruction from layers within artificial neural networks as surrogate data, in Supplementary Figure 3C we provide the performance of our method using simulated/predicted neuronal responses from the dynamic neural encoding model alongside real neuronal responses.

      Specific issues: 

      (1)The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I was left with the question: okay, does this mean that we should all switch to DNEM for our investigations of mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301...single trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best theoretical score, given noise and other limitations? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own, if it clarified how its findings depended on this model. 

      The revision helpfully added context to the Methods about the range of scores achieved by other models, but this information remains absent from the Abstract and other important sections. For instance, the Abstract states, "We achieve a pixel-level correlation of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses," yet this point estimate (presented without confidence intervals or comparisons to controls) lacks meaning for readers who are not told how it compares to prior work or what level of performance would be considered strong. Without such context, the manuscript undercuts potentially meaningful achievements. 

      We appreciate that the additional information about the performance of the SOTA DNEM to predict neural responses could be made more visible in the paper and will therefore move it from the methods to the results section instead: 

      Line 348 “This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.” will be moved to the results.

      With regard to the lack of context for the performance of our reconstruction in the abstract, we may have overcorrected in the previous revision round and have tried to find a compromise which gives more context to the pixel-level correlation value: 

      Abstract: “We achieve a pixel-level correlation of 0.57 (95% CI [0.54, 0.60]) between ground-truth movies and single-trial reconstructions. Previous reconstructions based on awake mouse V1 neuronal responses to static images achieved a pixel-level correlation of 0.238 over a similar retinotopic area.”

      (2) Along those lines, the authors conclude that "the number of neurons in the dataset and the use of model ensembling are critical for high-quality reconstructions." If true, these principles should generalize across network architectures. I wondered whether the same dependencies would hold for other network types, as this could reveal more general insights. The authors replied that such extensions are expected (since prior work has shown similar effects for static images) but argued that testing this explicitly would require "substantial additional work," be "impractical," and likely not produce "surprising results." While practical difficulty alone is not a sufficient reason to leave an idea untested, I agree that the idea that "more neurons would help" would be unsurprising. The question then becomes: given that this is a conclusion already in the field, what new principle or understanding has been gained in this study? 

      As mentioned in our previous round of revisions, we chose not to pursue the comparison of reconstructions using different model architectures in this manuscript because we did not think it would add significant insights to the paper given the amount of work it would require, and we are glad the reviewer agrees. 

      While the fact that more neurons result in better reconstructions is unsurprising, how quickly performance drops off will depend on the robustness of the method, and on the dimensionality of the decoding/reconstruction task (decoding grating orientation likely requires fewer neurons than gray scale image reconstruction, which in turn likely requires fewer neurons than full color movie reconstruction). How dependent input optimization based image/movie reconstruction is on population size has not been shown, so we felt it was useful for readers to know how well movie reconstruction works with our method when recording from smaller numbers of neurons. 

      (3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1000 neurons and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that 7000 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields are too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? Originally, this question was meant to prompt deeper analysis of the neural data, but the authors did not engage with it, suggesting a limited understanding of the neuronal aspects of the dataset. 

      We apologize that we did not engage with this comment enough in the previous round. We assumed that the question arose because there was a misunderstanding about figure 5: 1000 not 1 neuron is sufficient to reconstruct the movies to a pixel-level correlation of 0.344. Of course, the fact that increasing the number of neurons from 1000 to 8000 only increased the reconstruction performance from 0.344 to 0.569 (65% increase in correlation) is still worth discussing. To illustrate this drop in performance qualitatively, we show 3 example frames from movie reconstructions using 1000-8000 neurons in Author response image 1.

      Author response image 1.

      3 example frames from reconstructions using different numbers of neurons. 

      As the reviewer points out, the diminishing returns of additional neurons to reconstruction performance is at least partly because there is redundancy in how a population of neurons represents visual stimuli. In supplementary figure S2, we inferred the on-off receptive fields of the neurons and show that visual space is oversampled in terms of the receptive field positions in panel C. However, the exact slope/shape of the performance vs population size curve we show in Figure 5 will also depend on the maximum performance of our reconstruction method, which is limited in spatial resolution (Figure 4 & Supplementary Figure S5). It is possible that future reconstruction approaches will require fewer neurons than ours, so we interpret this curve rather as a description of the reconstruction method itself than a feature of the underlying neuronal code. For that reason, we chose caution and refrained from making any claims about neuronal coding principles based on this plot. 

      (4) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this originally further raised questions: what is the theoretical capability for reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? In the revision, this concern was addressed nicely in the review in Supplementary Figure 3C. Also, one appreciates that as a follow up, the team produced error maps (New Figure 6) that highlight where in the frames the reconstruction are likely to fail. But the maps went unanalyzed further, and I am not sure if there was a systematic trend in the errors. 

      We are happy to hear that we were able to answer the reviewers’ question of what the maximum theoretical performance of our reconstruction process is in figure 3C. Regarding systematic trends in the error maps, we also did not observe any clear systematic trends. If anything, we noticed that some moving edges were shifted, but we do not think we can quantify this effect with this particular dataset. 

      (5) I was encouraged by Figure 4, which shows how the reconstructions succeeded or failed across different spatial frequencies. The authors note that "the reconstruction process failed at high spatial frequencies," yet it also appears to struggle with low spatial frequencies, as the reconstructed images did not produce smooth surfaces (e.g., see the top rows of Figures 4A and 4B). In regions where one would expect a single continuous gradient, the reconstructions instead display specular, high-frequency noise. This issue is difficult to overlook and might deserve further discussion. 

      Thank you for pointing this out, this is indeed true. The reconstructions do have high frequency noise. We mention this briefly in line 102 “Finally, we applied a 3D Gaussian filter with sigma 0.5 pixels to remove the remaining static noise (Figure S3) and applied the evaluation mask.” In revisiting this sentence, we think it is more appropriate to replace “remove” with “reduce”. This noise is more visible in the Gaussian noise stimuli (Figure 4) because we did not apply the 3D Gaussian filter to these reconstructions, in case it interfered with the estimates of the reconstruction resolution limits. 

      Given that the Gaussian noise and drifting grating stimuli reconstructions were from predicted activity (“noise-free”), this high-frequency noise is not biological in origin and must therefore come from errors in our reconstruction process. This kind of high-frequency noise has previously been observed in feature visualization (optimizing input to maximize the activity of a specific node within a neural network to visualize what that node encodes; Olah, et al., "Feature Visualization", https://distill.pub/2017/feature-visualization/, 2017). It is caused by a kind of overfitting, whereby a solution to the optimization is found that is not “realistic”. Ways of combating this kind of noise include gradient smoothing, image smoothing, and image transformations during optimization, but these methods can restrict the resolution of the features that are recovered. Since we were more interested in determining the maximum resolution of stimuli that can be reconstructed in Figure 4 and Supplementary Figures 5-6, we chose not to apply these methods.

      Reviewer #3 (Public review): 

      Summary: 

      This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration. 

      Strengths: 

      The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and number of recorded neurons will be useful to those planning future experiments. 

      Weaknesses: 

      The main contribution is methodological, and the methodology combines pre-existing components without any new original component. 

      We thank the reviewer for their balanced assessment of our manuscript.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      This paper presents a method for reconstructing videos from mouse visual cortex neuronal activity using a state-of-the-art dynamic neural encoding model. The authors achieve high-quality reconstructions of 10-second movies at 30 Hz from two-photon calcium imaging data, reporting a 2-fold increase in pixel-by-pixel correlation compared to previous methods. They identify key factors for successful reconstruction including the number of recorded neurons and model ensembling techniques. 

      Strengths: 

      (1) A comprehensive technical approach combining state-of-the-art neural encoding models with gradient-based optimization for video reconstruction. 

      (2) Thorough evaluation of reconstruction quality across different spatial and temporal frequencies using both natural videos and synthetic stimuli. 

      (3) Detailed analysis of factors affecting reconstruction quality, including population size and model ensembling effects. 

      (4) Clear methodology presentation with well-documented algorithms and reproducible code. 

      (5) Potential applications for investigating visual processing phenomena like predictive coding and perceptual learning. 

      We thank the reviewer for taking the time to provide this valuable feedback. We would like to add that in our eyes one additional main contribution is the step of going from reconstruction of static images to dynamic videos. We trust that in the revised manuscript, we have now made the point more explicit that static image reconstruction relies on temporally averaged responses, which negates the necessity of having to account for temporal dynamics altogether. 

      Weaknesses: 

      The main metric of success (pixel correlation) may not be the most meaningful measure of reconstruction quality: 

      High correlation may not capture perceptually relevant features.

      Different stimuli producing similar neural responses could have low pixel correlations The paper doesn't fully justify why high pixel correlation is a valuable goal 

      This is a very relevant point. In retrospect, perhaps we did not justify this enough. Sensory reconstruction typically aims to reconstruct sensory input based on brain activity as faithfully as possible. A brain-to-image decoder might therefore be trained to produce images as close to the original input as possible. The loss function to train the decoder would therefore be image similarity on the pixel level. In that case, evaluating reconstruction performance based on pixel correlation is somewhat circular. 

      However, when reconstructing videos, we optimize the input video in terms of its perceptual similarity to the original video and only then evaluate pixel-level similarity. The perceptual similarity metric we optimize for is the estimate of how the neurons in mouse V1 respond to that video. We then evaluate the similarity of this perceptually optimized video to the original input video with pixel-level correlation. In other words, we optimize for perceptual similarity and then evaluate pixel similarity. If our method optimized pixel-level similarity, then we would agree that perceptual similarity is a more relevant evaluation metric. We do not think it was clear in our original submission that our optimization loss function is a perceptual loss function, and have now made this clearer in Figure 1C-D and have clarified this in the results section, line 70:

      “In effect, we optimized the input video to be perceptually similar with respect to the recorded neurons.”

      And in line 110: 

      “Because our optimization of the movies was based on a perceptual loss function, we were interested in how closely these movies matched the originals on the pixel level.”

      We chose to use pixel correlation to measure pixel-level similarity for several reasons. 1) It has been used in the past to evaluate reconstruction performance (Yoshida et al., 2020), 2) It is contrast and luminance insensitive, 3) correlation is a common metric so most readers will have an intuitive understanding of how it relates to the data. 

      To further highlight why pixel similarity might be interesting to visualize, we have included additional analysis in Figure 6 illustrating pixel-level differences between reconstructions from experimentally recorded activity and predicted activity. 

      We expect that the type of perceptual similarity the reviewer is alluding to is pretrained neural network image embedding similarity (Zhang et al., 2018: https://doi.org/10.48550/arXiv.1801.03924). While these metrics seem to match human perceptual similarity, it is unclear if they reflect mouse vision. We did try to compare the embedding similarity from pretrained networks such as VGG16, but got results suggesting the reconstructed frames were no more similar to the ground truth than random frames, which is obviously not true. This might be because the ground truth videos were too different in resolution from the training data of these networks and because these metrics are typically very sensitive to decreases in resolution. 

      The best alternative approach to evaluate mouse perceptual similarity would be to show the reconstructed videos to the same animals while recording the same neurons and to compare these neural activation patterns to those evoked by the original ground truth videos. This has been done for static images in the past: Cobos et al., bioRxiv 2022, found that static image reconstructions generated using gradient descent evoked more similar trial-averaged (40 trials) responses to those evoked by ground truth images compared to other reconstruction methods. Unfortunately, we are currently not able to perform these in vivo experiments, which is why we used publicly available data for the current paper. We plan to use this method in the future. But this method is also not flawless as it assumes that the average response to an image is the best reflection of how that image is represented, which may not be the case for an individual trial.

      As far as we are aware, there is currently no method that, given a particular activity pattern in response to an image/video, can produce an image/video that induces a neural activity pattern that is closer to the original neural response than simply showing the same image/video again. Hypothetically, such a stimulus exists because of various visual processing phenomena we mention in our discussion (e.g., predictive coding and selective attention), which suggest that the image that is represented by a population of neurons likely differs from the original sensory input. In other words, what the brain represents is an interpretation of reality not a pure reflection. Experimentally verifying this is difficult, as these variations might be present on a single trial level. The first step towards establishing a method that captures the visual representation of a population of neurons is sensory reconstruction, where the aim is to get as close as possible to the original sensory input. We think pixel-level correlation is a stringent and interpretable metric for this purpose, particularly when optimizing for perceptual similarity rather than image similarity directly.

      Comparison to previous work (Yoshida et al.) has methodological concerns: Direct comparison of correlation values across different datasets may be misleading; Large differences in the number of recorded neurons (10x more in the current study); Different stimulus types (dynamic vs static) make comparison difficult; No implementation of previous methods on the current dataset or vice versa. 

      Yes, we absolutely agree that direct comparison to previous static image reconstruction methods is problematic. We primarily do so because we think it is standard practice to give related baselines. We agree that direct comparison of the performance of video reconstruction methods to image reconstruction methods is not really possible. It does not make sense to train and apply a dynamic model on a static image data set where neural activity is time-averaged, as the temporal kernels could not be learned. Conversely, for a static model, which expects a single image as input and predicts time averaged responses, it does not make sense to feed it a series of temporally correlated movie frames and to simply concatenate the resulting activity perdition. The static model would need to be substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have now added these caveats in line 119:

      “However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

      We have also toned down the language, emphasising the comparison to previous image reconstruction performance in the abstract, results, and conclusion. 

      Abstract: We removed “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” and replaced with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

      Discussion: we removed “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” and replaced with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

      We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring). 

      Limited exploration of how the reconstruction method could provide insights into neural coding principles beyond demonstrating technical capability. 

      The aim of this paper was not to reveal principles of neural coding. Instead, we aimed to achieve the best possible performance of video reconstructions and to quantify the limitations. But to highlight its potential we have added two examples of how sensory reconstruction has been applied in human vision research in line 321: 

      “Although fMRI-based reconstruction techniques are starting to be used to investigate visual phenomena in humans (such as illusions [Cheng et al., 2023] and mental imagery [Shen et al., 2019; Koide-Majima et al., 2024; Kalantari et al., 2025]), visual processing phenomena are likely difficult to investigate using existing fMRI-based reconstruction approaches, due to the low spatial and temporal resolution of the data.”

      We have also added a demonstration of how this method could be used to investigate which parts of a reconstruction from a single trial response differs from the model's prediction (Figure  6). We do this by calculating pixel-level differences between reconstructions from the recorded neural activity and reconstructions from the expected neural activity (predicted activity by the neural encoding model). Although difficult to interpret, this pixel-by-pixel error map could represent trial-by-trial deviations of the neural code from pure sensory representation. But at this point we cannot know whether these errors are nothing more than errors in the reconstruction process. To derive meaningful interpretations of these maps would require a substantial amount of additional work and in vivo experiments and so is outside the scope of this paper, but we include this additional analysis now to highlight a) why pixel-level similarity might be interesting to quantify and visualize and b) to demonstrate how video reconstruction could be used to provide insights into neural coding, namely as a tool to identify how sensory representations differ from a pure reflection of the visual input.  

      The claim that "stimulus reconstruction promises a more generalizable approach" (line 180) is not well supported with concrete examples or evidence. 

      What we mean by generalizable is the ability to apply reconstruction to novel stimuli, which is not possible for stimulus classification. We now explain this better in the paragraph in line 211: 

      “Stimulus identification, i.e. identifying the most likely stimulus from a constrained set, has been a popular approach for quantifying whether a population of neurons encodes the identity of a particular stimulus [Földiák, 1993, Kay et al., 2008]. This approach has, for instance, been used to decode frame identity within a movie [Deitch et al., 2021, Xia et al., 2021, Schneider et al., 2023, Chen et al.,2024]. Some of these approaches have also been used to reorder the frames of the ground truth movie [Schneider et al., 2023] based on the decoded frame identity. Importantly, stimulus identification methods are distinct from stimulus reconstruction where the aim is to recreate what the sensory content of a neuronal code is in a way that generalizes to new sensory stimuli [Rakhimberdina et al., 2021]. This is inherently a more demanding task because the range of possible solutions is much larger. Although stimulus identification is a valuable tool for understanding the information content of a population code, stimulus reconstruction could provide a more generalizable approach, because it can be applied to novel stimuli.”

      All the stimuli we reconstructed were not in the training set of the model, i.e., novel. We have also downed down the claim: we have replaced “promises” with “could provide”. 

      The paper would benefit from addressing how the method handles cases where different stimuli produce similar neural responses, particularly for high-speed moving stimuli where phase differences might be lost in calcium imaging temporal resolution. 

      Thank you for this suggestion, we think this is a great question. Calcium dynamics are slow and some of the high temporal frequency information could indeed be lost, particularly phase information. In other words, when the stimulus has high temporal frequency information, it is harder to decode spatial information because of the slow calcium dynamics. Ideally, we would look at this effect using the drifting grating stimuli; however, this is problematic because we rely on predicted activity from the SOTA DNEM, and due to the dilation of the first convolution, the periodic grating stimulus causes aliasing. At 15Hz, when the temporal frequency of the stimulus is half the movie frame rate, the model is actually being given two static images, and so the predicted activity is the interleaved activity evoked by two static images. We therefore do not think using the grating stimuli is a good idea. But we have used the Gaussian stimuli as it is not periodic, and is therefore less of a problem. 

      We have now also reconstructed phase-inverted Gaussian noise stimuli and plotted the video correlation between the reconstructions from activity evoked by phase-inverted stimuli. On the one hand, we find that even for the fastest changing stimuli, the correlation between the reconstructions from phase inverted stimuli are negative, meaning phase information is not lost at high temporal frequencies. On the other hand, for the highest spatial frequency stimuli, the correlation is negative. So, the predicted neural activity (and therefore the reconstructions) are phase-insensitive when the spatial frequency is higher than the reconstruction resolution limit we identified (spatial length constant of 1 pixel, or 3.38 degrees). Beyond this limit, the DNEM predicts activity in response to phase-inverted stimuli, which, when used for reconstruction, results in movies which are more similar to each other than the stimulus that actually evokes them. 

      However, not all information is lost at these high spatial frequencies. If we plot the Shannon entropy in the spatial domain or the motion energy in the temporal domain, we find that even when the reconstructions fail to capture the stimulus at a pixel-specific level (spatial length constant of 1 pixel, or 3.38 degrees), they do capture the general spatial and temporal qualities of the videos. 

      We have added these additional analyses to Figure 4 and Supplementary Figure 5.

      Reviewer #2 (Public review): 

      This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of the mouse visual cortex. 

      This is a great project - the physiological data were measured at a single-cell resolution, the movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. Overall, it is great that teams are working towards exploring image reconstruction. Arguably, reconstruction may serve as an endgame method for examining the information content within neuronal ensembles - an alternative to training interminable numbers of supervised classifiers, as has been done in other studies. Put differently, if a reconstruction recovers a lot of visual features (maybe most of them), then it tells us a lot about what the visual brain is trying to do: to keep as much information as possible about the natural world in which its internal motor circuits may act consequently. 

      While we enjoyed reading the manuscript, we admit that the overall advance was in the range of those that one finds in a great machine learning conference proceedings paper. More specifically, we found no major technical flaws in the study, only a few potential major confounds (which should be addressable with new analyses), and the manuscript did not make claims that were not supported by its findings, yet the specific conceptual advance and significance seemed modest. Below, we will go through some of the claims, and ask about their potential significance. 

      We thank the reviewer for the positive feedback on our paper.

      (1) The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I am left with the question: okay, does this mean that we should all switch to DNEM for our investigations of the mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301... single-trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best achievable score, in theory, given data noise? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own if clarified how its findings depended on this model. 

      This is a very good point. We do not think that everyone should switch to using this particular DNEM to investigate the mouse visual cortex, but we think DNEMs and stimulus reconstruction in general has a lot of potential. We think static neural encoding models have already been demonstrated to be an extremely valuable tool to investigate visual coding (Walker et al., 2019; Yoshida et al., 2021; Willeke et al., bioRxiv 2023). DNEMs are less common, largely because they are very large and are technically more demanding to train and use. That makes static encoding models more practical for some applications, but they do not have temporal kernels and are therefore only used for static stimuli. They cannot, for instance, encode direction tuning, only orientation tuning. But both static and dynamic encoding models have advantages over stimulus classification methods which we outline in our discussion. Here we provide the first demonstration that previous achievements in static image reconstruction are transferable to movies.

      It has been shown in the past for static neural encoding models that choosing a better-performing model produces reconstructed static images that are closer to the original image (Pierzchlewicz et al., 2023). The factors in choosing this particular DNEM were its capacity to predict neural activity (benchmarked against other models), it was open source, and the data it was designed for was also available. 

      To give more context to the model used in the paper, we have included the following, line 348:

      “This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.” 

      Concerning biologically inspired model design. The winning model contained 3 fully connected layers comprising the “Cortex” just before the final readout of neural activity, but we would consider this level of biological inspiration as minor. We do not think that the exact architecture of the model is particularly important, as the crucial aspect of such neural encoders is their ability to predict neural activity irrespective of how they achieve it. There has been a move towards creating foundation models of the brain (Wang et al., 2025) and the priority so far has been on predictive performance over mechanistic interpretability or similarity to biological structures and processes. 

      Finally, we would like to note that we do not know what the maximum theoretical score for single-trial responses might be, and don't think there is a good way of estimating it in this context. 

      (2) Along those lines, two major conclusions were that "critical for high-quality reconstructions are the number of neurons in the dataset and the use of model ensembling." If true, then these principles should be applicable to networks with different architectures. How well can they do with other network types? 

      This is a good question. Our method critically relies on the accurate prediction of neural activity in response to new videos. It is therefore expected that a model that better predicts neural responses to stimuli will also be better at reconstructing those stimuli given population activity. This was previously shown for static images (Pierzchlewicz et al., 2023). It is also expected that whenever the neural activity is accurately predicted, the corresponding reconstructed frames will also be more similar to the ground truth frames. We have now demonstrated this relationship between prediction accuracy and reconstruction accuracy in supplementary figure 4.

      Although it would be interesting to compare the movie reconstruction performance of many different models with different architectures and activity prediction performances, this would involve quite substantial additional work because movie reconstruction is very resource- and time-intensive. Finding optimal hyperparameters to make such a comparison fair and informative would therefore be impractical and likely not yield surprising results. 

      We also think it is unlikely that ensembling would not improve reconstruction performance in other models because ensembling across model predictions is a common way of improving single-model performance in machine learning. Likewise, we think it is unlikely that the relationship between neural population size and reconstruction performance would differ substantially when using different models, because using more neurons means that a larger population of noisy neurons is “voting” on what the stimulus is. However, we would expect that if the model were worse at predicting neural activity, then more neurons are needed for an equivalent reconstruction performance. In general, we would recommend choosing the best possible DNEM available, in terms of neural activity prediction performance, when reconstructing movies using input optimization through gradient descent. 

      (3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1 neuron and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that ~7,999 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields were too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? 

      In the population ablation experiments, we compared the performance using ~1000, ~2000, ~4000, ~8000 neurons, and found an attenuation of 39.5% in video correlation when dropping 87.5% of the neurons (~1000 neurons remaining), we did not try reconstruction using just 1 neuron. 

      (4) On a related note, the authors address the confound of RF location and extent. The study resorted to the use of a mask on the image during reconstruction, applied during training and evaluation (Line 87). The mask depends on pixels that contribute to the accurate prediction of neuronal activity. The problem for me is that it reads as if the RF/mask estimate was obtained during the very same process of reconstruction optimization, which could be considered a form of double-dipping (see the "Dead salmon" article, https://doi.org/10.1016/S1053-8119(09)71202-9). This could inflate the reconstruction estimate. My concern would be ameliorated if the mask was obtained using a held-out set of movies or image presentations; further, the mask should shift with eye position, if it indeed corresponded to the "collective receptive field of the neural population." Ideally, the team would also provide the characteristics of these putative RFs, such as their weight and spatial distribution, and whether they matched the biological receptive fields of the neurons (if measured independently). 

      We can reassure the reviewer that there is no double-dipping. We would like to clarify that the mask was trained only on videos from the training set of the DNEM and not the videos which were reconstructed. We have added the sentence, line 91: 

      “None of the reconstructed movies were used in the optimization of this transparency mask.”

      Making the mask dependent on eye position would be difficult to implement with the current DNEM, where eye position is fed to the model as an additional channel. When using a model where the image is first transformed into retinotopic coordinates in an eye position-dependent manner (such as in Wang et al., 2025) the mask could be applied in retinotopic coordinates and therefore be dependent on eye position. 

      Effectively, the alpha mask defines the relative level of influence each pixel contributes to neural activity prediction. We agree it is useful to compare the shape of the alpha mask with the location of traditional on-off receptive fields (RFs) to clarify what the alpha mask represents and characterise the neural population available for our reconstructions. We therefore presented the DNEM with on-off patches to map the receptive fields of single neurons in an in silico experiment (the experimentally derived RF are not available). As expected, there is a rough overlap between the alpha mask (Supplementary Figure 2D), the average population receptive field (Supplementary Figure 2B), and the location of receptive field peaks (Supplementary Figure 2C). In principle, all three could be used during training or evaluation for masking, but we think that defining a mask based on the general influence of images on neural activity, rather than just on off patch responses, is a more elegant solution.

      One idea of how to go a step further would be to first set the alpha mask threshold during training based on the % loss of neural activity prediction performance that threshold induces (in our case alpha=0.5 corresponds to ~3% loss in correlation between predicted vs recorded neural responses, see Supplementary Figure 3D), and second base the evaluation mask on a pixel correlation threshold (see example pixel correlation map in Supplementary Figure 2E) instead to avoid evaluating areas of the image with low image reconstruction confidence. 

      We referred to this figure in the result section, line 83:

      “The transparency masks are aligned with but not identical to the On-Off receptive field distribution maps using sparse-noise (Figure S2).” 

      We have also done additional analysis on the effect of masking during training and evaluation with different thresholds in Supplementary Figure 3.

      (5) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this further raised questions: what is the theoretical capability for the reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? 

      That’s a very interesting point. It is very hard to know what the theoretical best reconstruction performance of the model would be. Reconstruction performance could be decreased due to neural variability, experimental noise, the temporal kernel of the calcium indicator and the imaging frame rate, information compression along the visual hierarchy, visual processing phenomena (such as predictive coding and selective attention), failure of the model to predict neural activity correctly, or failure of the reconstruction process to find the best possible image which explains the neural activity. We don't think we can disentangle the contribution of all these sources, but we can provide a theoretical maximum assuming that the model and the reconstruction process are optimal. To that end, we performed additional simulations and reconstructed the natural videos using the predicted activity of the neurons in response to the natural videos as the target (similar to the synthetic stimuli) and got a correlation of 0.766. So, the single trial performance of 0.569 is ~75% of this theoretical maximum. This difference can be interpreted as a combination of the losses due to neuronal variability, measurement noise, and actual deviations in the images represented by the brain compared to reality. 

      We thank the reviewer for this suggestion, as it gave us the idea of looking at error maps (Figure 6), where the pixel-level deviation of the reconstructions from recorded vs predicted activity is overlaid on the ground truth movie.

      (6) As the authors mentioned, this reconstruction method provided a more accurate way to investigate how neurons process visual information. However, this method consisted of two parts: one was the state-of-the-art (SOTA) dynamic neural encoding model (DNEM), which predicts neuronal activity from the input video, and the other part reconstructed the video to produce a response similar to the predicted neuronal activity. Therefore, the reconstructed video was related to neuronal activity through an intermediate model (i.e., SOTA DNEM). If one observes a failure in reconstructing certain visual features of the video (for example, high-spatial frequency details), the reader does not know whether this failure was due to a lack of information in the neural code itself or a failure of the neuronal model to capture this information from the neural code (assuming a perfect reconstruction process). Could the authors address this by outlining the limitations of the SOTA DNEM encoding model and disentangling failures in the reconstruction from failures in the encoding model? 

      To test if a better neural prediction by the DNEM would result in better reconstructions, we ran additional simulations and now show that neural activity prediction performance correlates with reconstruction performance (Supplementary Figure 4B). This is consistent with Pierzchlewicz et al., (2023) who showed that static image reconstructions using better encoding models leads to better reconstruction performance. As also mentioned in the answer to the previous comment, untangling the relative contributions of reconstruction losses is hard, but we think that improvements to the DNEM performance are key. Two suggestions to improving the DNEM we used would be to translate the input image into retinotopic coordinates and shift this image relative to eye position before passing it to the first convolutional layer (as is done in Wang et al. 2025), to use movies which are not spatially down sampled as heavily, to not use a dilation of 2 in the temporal convolution of the first layer and to train on a larger dataset. 

      (7) The authors mentioned that a key factor in achieving high-quality reconstructions was model assembling. However, this averaging acts as a form of smoothing, which reduces the reconstruction's acuity and may limit the high-frequency content of the videos (as mentioned in the manuscript). This averaging constrains the tool's capacity to assess how visual neurons process the low-frequency content of visual input. Perhaps the authors could elaborate on potential approaches to address this limitation, given the critical importance of high-frequency visual features for our visual perception. 

      This is exactly what we also thought. To answer this point more specifically, we ran additional simulations where we also reconstruct the movies using gradient ensembling instead of reconstruction ensembling. Here, the gradients of the loss with respect to each pixel of the movie is calculated for each of the model instances and are averaged at every iteration of the reconstruction optimization. In essence, this means that one reconstruction solution is found, and the averaging across reconstructions, which could degrade high-frequency content, is skipped. The reconstructions from both methods look very similar, and the video correlation is, if anything, slightly worse (Supplemental Figure 3A&C). This indicates that our original ensembling approach did not limit reconstruction performance, but that both approaches can be used, depending on what is more convenient given hardware restrictions. 

      Reviewer #3 (Public review): 

      Summary: 

      This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration. 

      Strengths: 

      The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and the number of recorded neurons will be useful to those planning future experiments. 

      Weaknesses: 

      The main contribution is methodological, and the methodology combines pre-existing components without any new original components. 

      We thank the reviewer for taking the time to review our paper and for their overall positive assessment. We would like to emphasise that combining pre-existing machine learning techniques to achieve top results in a new modality does require iteration and innovation. While gradient-based input optimization by backpropagating the brain-encoding error through a neural encoding model has been used in 2D static image optimization to generate maximally exciting images and reconstruct static images, we are the first to have applied it to movies which required accounting for the time domain. Previous methods used time averaged responses and were limited to the reconstruction of static images presented with fixed image intervals.

      The movie reconstructions include a learned "transparency mask" to concentrate on the most informative area of the frame; it is not clear how this choice impacts the comparison with prior experiments. Did they all employ this same strategy? If not, shouldn't the quantitative results also be reported without masking, for a fair comparison? 

      Yes, absolutely. All reconstruction approaches limit the field of view in some way, whether this is due to the size of the screen, the size of the image on the screen, or cropping of the presented/reconstructed images during analysis due to the retinotopic coverage of the recorded neurons. Note that we reconstruct a larger field of view than Yoshida et al. In Yoshida et al., the reconstructed field of view was 43 by 43 retinal degrees. we show the size of an example evaluation mask in comparison. 

      To address the reviewer’s concern more specifically, we performed additional simulations and now also show the performance using a variety of different training and evaluation masks, including different alpha thresholds for training and evaluation masks as well as the effective retinotopic coverage at different alpha thresholds. Despite these comparisons, we would also like to highlight that the comparison to the benchmark is problematic itself. This is because image and movie reconstruction are not directly comparable. It does not make sense to train and apply a dynamic model on a static image dataset where neural activity is time averaged. Conversely, it does not make sense to train or apply a static model that expects time-averaged neural responses on continuous neural activity unless it is substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have therefore de-emphasised the phrasing comparing our method to previous publications in the abstract, results, and discussion. 

      Abstract: “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

      Results: “This represents a ~2x higher pixel-level correlation over previous single-trial static image reconstructions from V1 in awake mice (image correlation 0.238 +/- 0.054 s.e.m for awake mice) [Yoshida et al., 2020] over a similar retinotopic area (~43° x 43°) while also capturing temporal dynamics. However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

      Discussion: “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

      We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring). 

      We believe that we have given enough information in our paper now so that readers can make an informed decision whether our movie reconstruction method is appropriate for the questions they are interested in.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors): 

      (1) "Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth." This was not clear: was it done by the investigating team? I imagine that one of the most easily captured visual features is luminance and contrast, why wouldn't the optimization titrate these well? 

      The contrast and luminance matching of the reconstructions to the ground truth videos was done by us, but this was only done to help readers assess the quality of the reconstructions by eye. Our performance metrics (frame and video correlation) are contrast and luminance insensitive. To clarify this, we have also added examples of non-adjusted frames in Supplementary Figure 3A, and added a sentence in the results, line 103: 

      “When presenting videos in this paper we normalize the mean and standard deviation of the reconstructions to the average and standard deviation of the corresponding ground truth movie before applying the evaluation masks, but this is not done for quantification except in Supplementary Figure 3D.”

      We were also initially surprised that contrast and luminance are not captured well by our reconstruction method, but this makes sense as V1 is largely luminance invariant (O’Shea et al., 2025 https://doi.org/10.1016/j.celrep.2024.115217 ) and contrast only has a gain effect on V1 activity (Tring et al., 2024 https://journals.physiology.org/doi/full/10.1152/jn.00336.2024). Decoding absolute contrast is likely unreliable because it is probably not the only factor modulating the overall gain of the neural population.

      To address the reviewer’s comment more fully, we ran additional experiments. More specifically, to test why contrast and luminance are not recovered in the reconstructions, we checked how the predicted activity between the reconstruction and the contrast/luminance corrected reconstructions differs. Contrast and luminance adjustment had little impact on predicted response similarity on average. This makes the reconstruction optimization loss function insensitive to overall contrast and luminance so it cannot be decoded. There is a small effect on activity correlation, however, so we cannot completely rule out that contrast and luminance could be reconstructed with a different loss function. 

      (2) The authors attempted to investigate the variability in reconstruction quality across different movies and 10-second snippets of a movie by correlating various visual features, such as video motion energy, contrast, luminance, and behavioral factors like running speed, pupil diameter, and eye movement, with reconstruction success. However, it would also be beneficial if the authors correlated the response loss (Poisson loss between neural responses) with reconstruction quality (video correlation) for individual videos, as these metrics are expected to be correlated if the reconstruction captures neural variance. 

      We thank the reviewer for this suggestion. We have now included this analysis and find that if the neural activity was better predicted by the DNEM then the reconstruction of the video was also more similar to the ground truth video. We further found that this effect is shift-dependent (in time), meaning the prediction of activity based on proximal video frames is more influential on reconstruction performance. 

      Reviewer #3 (Recommendations for the authors): 

      (1) I was confused about the choice of applying a transparency mask thresholded with alpha>0.5 during training and alpha>1 during evaluation. Why treat the two situations differently? Also, shouldn't we expect alpha to be in the [0,1] range, in which case, what is the meaning of alpha>1? (And finally, as already described in "Weaknesses", how does this choice impact the comparison with prior experiments? Did they also employ a similar masking strategy?) 

      We found that applying a mask during training increased performance regardless of the size of the evaluation mask. Using a less stringent mask during training than during evaluation increases performance slightly, but also allows inspection of the reconstruction in areas where the model will be less confident without sacrificing performance, if this is desired. The thresholds of 0.5 and 1 were chosen through trial and error, but the exact values do not hold intrinsic meaning. The alpha mask values can go above 1 during their optimization. We could have clipped alpha during the training procedure (algorithm 1), but we decided this was not worth redoing at this stage, as the alphas used for testing were not above 1. All reconstruction approaches in previous publications limit the field of view in some form, whether this is due to the size of the screen, the size of the image on the screen, or the cropping of the presented/reconstructed images during analysis. 

      To address the reviewer’s comment in detail, we have added extensive additional analysis to evaluate the coverage of the reconstruction achieved in this paper and how different masking strategies affect performance, as well as how the mask relates to more traditional receptive field mapping.  

      (2) I would not use the word "imagery" in the first sentence of the abstract, because this might be interpreted by some readers as reconstruction of mental imagery, a very distinct question. 

      We changed imagery to images in the abstract.

      (3) Line 145-146: "<1 frame, or <30Hz" should be "<1 frame, or >30Hz". 

      We have corrected the error.

      (4) Algorithm 1, Line 5, a subscript variable 'g' should be changed to 'h'

      We have corrected the error.

      Additional Changes

      (1) Minor grammatical errors

      (2) Addition of citations: We were previously not aware of a bioRxiv preprint from 2022 (Cobos et al., 2022), which used gradient descent-based input optimization to reconstruct static images but without the addition of a diffusion model. Instead, we had cited for this method Pierzchlewicz et al., 2023 bioRxiv/NeurIPS. In Cobos et al., 2022, they compare static image reconstruction similarity to ground truth images and the similarity of the in vivo evoked activity across multiple reconstruction methods. Performance values are only given for reconstructions from trial-averaged responses across ~40 trials (in the absence of original data or code we are also not able to retrospectively calculate single-trial performance). The authors find that optimizing for evoked activity rather than image similarity produces image reconstructions that evoke more similar in vivo responses compared to reconstructions optimized for image similarity itself. We have now added and discussed the citation in the main text. 

      (3) Workaround for error in the open-source code from https://github.com/lRomul/sensorium for video hashing function in the SOTA DNEM: By checking the most correlated first frame for each reconstructed movie, we discovered there was a bug in the open-source code and 9/50 movies we originally used for reconstruction were not properly excluded from the training data between DNEM instances. The reason for this error was that some of the movies are different by only a few pixels, and the video hashing function used to split training and test set folds in the original DNEM code classified these movies as different and split them across folds. We have replaced these 9 movies and provide a figure below showing the next closest first frame for every movie clip we reconstruct. This does not affect our claims. Excluding these 9 movie clips, did not affect the reconstruction performance (video correlation went from 0.563 to 0.568), so there was no overestimation of performance due to test set contamination. However, they should still be removed so some of the values in the paper have changed slightly. The only statistical test that was affected was the correlation between video correlation and mean motion energy (Supplementary Figure 4A), which went from p = 0.043 to 0.071. 

      Author response image 2.

      exclusion of movie clips with duplicates in the DNEM training data. A) example frame of a reconstructed movie (ground truth) and the most correlated first frame from the training data. b) all movie clips and their corresponding most correlated clip from the training data. Red boxes indicate excluded duplicates. 

    1. eLife Assessment

      This important study demonstrates the significance of incorporating biological constraints in training neural networks to develop models that make accurate predictions under novel conditions. By comparing standard sigmoid recurrent neural networks (RNNs) with biologically constrained RNNs, the manuscript offers compelling evidence that biologically grounded inductive biases enhance generalization to perturbed conditions. This manuscript will appeal to a wide audience in systems and computational neuroscience.

    2. Reviewer #1 (Public review):

      This manuscript introduces a biologically informed RNN (bioRNN) that predicts the effects of optogenetic perturbations in both synthetic and in vivo datasets. By comparing standard sigmoid RNNs (σRNNs) and bioRNNs, the authors make a compelling case that biologically grounded inductive biases improve generalization to perturbed conditions. This work is innovative, technically strong, and grounded in relevant neuroscience, particularly the pressing need for data-constrained models that generalize causally.

      Comments on revisions:

      The authors have addressed all my concerns.

    3. Reviewer #2 (Public review):

      Sourmpis et al. present a study in which the importance of including certain inductive biases in the fitting of recurrent networks is evaluated with respect to the generalization ability of the networks when exposed to untrained perturbations.

      The work proceeds in three stages:

      (i) a simple illustration of the problem is made. Two reference (ground-truth) networks with qualitatively different connectivity, but similar observable network dynamics, are constructed, and recurrent networks with varying aspects of design similarity to the reference networks are trained to reproduce the reference dynamics. The activity of these trained networks during untrained perturbations is then compared to the activity of the perturbed reference networks. It is shown that, of the design characteristics that were varied, the enforced sign (Dale's law) and locality (spatial extent) of efference were especially important.

      (ii) The intuition from the constructed example is then extended to networks that have been trained to reproduce certain aspects of multi-region neural activity recorded from mice during a detection task with a working-memory component. A similar pattern is demonstrated, in which enforcing the sign and locality of efference in the fitted networks has an influence on the ability of the trained networks to predict aspects of neural activity during unseen (untrained) perturbations.

      (iii) The authors then illustrate the relationship between the gradient of the motor readout of trained networks with respect to the net inputs to the network units, and the sensitivity of the motor readout to small perturbations of the input currents to the units, which (in vivo) could be controlled optogenetically. The paper is concluded with a proposed use for trained networks, in which the models could be analyzed to determine the most sensitive directions of the network and, during online monitoring, inform a targeted optogenetic perturbation to bias behavior.

      The authors do not overstate their claims, and in general, I find that I agree with their conclusions.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      Major:

      (1) In line 76, the authors make a very powerful statement: 'σRNN simulation achieves higher similarity with unseen recorded trials before perturbation, but lower than the bioRNN on perturbed trials.' I couldn't find a figure showing this. This might be buried somewhere and, in my opinion, deserves some spotlight - maybe a figure or even inclusion in the abstract.

      We agree with the reviewer that these results are important. The failure of σRNN on perturbed data could be inferred from the former Figures 1E, 2C-E, and 3D. Following the reviewers' comments, we have tried to make this the most prominent message of Figure 1, in particular with the addition of the new panel E. We also moved Table 1 from the  Supplementary to the main text to highlight this quantitatively. 

      (2) It's mentioned in the introduction (line 84) and elsewhere (e.g., line 259) that spiking has some advantage, but I don't see any figure supporting this claim. In fact, spiking seems not to matter (Figure 2C, E). Please clarify how spiking improves performance, and if it does not, acknowledge that. Relatedly, in line 246, the authors state that 'spiking is a better metric but not significant' when discussing simulations. Either remove this statement and assume spiking is not relevant, or increase the number of simulations.

      We could not find the exact quote from the reviewer, and we believe that he intended to quote “spiking is better on all metrics, but without significant margins”. Indeed, spiking did not improve the fit significantly on perturbed trials, this is particularly true in comparison with the benefits of Dale’s law and local inhibition. As suggested by the reviewer, we rephrased the sentence from this quote and more generally the corresponding paragraphs in the intro (lines 83-87) and in the results (lines 245-271). Our corrections in the results sections are also intended to address the minor point (4) raised by the same reviewer.

      (3) The authors prefer the metric of predicting hits over MSE, especially when looking at real data (Figure 3). I would bring the supplementary results into the main figures, as both metrics are very nicely complementary. Relatedly, why not add Pearson correlation or R2, and not just focus on MSE Loss?

      In Figure 3 for the in-vivo data, we do not have simultaneous electrophysiological recordings and optogenetic stimulation in this dataset.  The two are performed on different recording sessions. Therefore, we can only compare the effect of optogenetics on the behavior, and we cannot compute Pearson correlation or R2 of the perturbed network activity. To avoid ambiguity, we wrote “For the sessions of the in vivo dataset with optogenetic perturbation that we considered, only the behavior of an animal is recorded” on line 294. 

      (4) I really like the 'forward-looking' experiment in closed loop! But I felt that the relevance of micro perturbations is very unclear in the intro and results. This could be better motivated: why should an experimentalist care about this forward-looking experiment? Why exactly do we care about micro perturbation (e.g., in contrast to non-micro perturbation)? Relatedly, I would try to explain this in the intro without resorting to technical jargon like 'gradients'.

      As suggested, we updated the last paragraph of the introduction (lines 88 - 95) to give better motivation for why algorithmically targeted acute spatio-temporal perturbations can be important to dissect the function of neural circuits. We also added citations to recent studies with targeted in vivo optogenetic stimulation. As far as we know the existing previous work targeted network stimulation mostly using linear models, while we used non-linear RNNs and their gradients.

      Minor:

      (1) In the intro, the authors refer to 'the field' twice. Personally, I find this term odd. I would opt for something like 'in neuroscience'.

      We implemented the suggested change: l.27 and l.30

      (2) Line 45: When referring to previous work using data-constrained RNN models, Valente et al. is missing (though it is well cited later when discussing regularization through low-rank constraints)

      We added the citation: l.45

      (3) Line 11: Method should be methods (missing an 's').

      We fixed the typo.

      (4) In line 250, starting with 'So far', is a strange choice of presentation order. After interpreting the results for other biological ingredients, the authors introduce a new one. I would first introduce all ingredients and then interpret. It's telling that the authors jump back to 2B after discussing 2C.

      We restructured the last two paragraphs of section 2.1, and we hope that the presentation order is now more logical.

      (5) The black dots in Figure 3E are not explained, or at least I couldn't find an explanation.

      We added an explanation in the caption of Figure 3E.

      Reviewer #2 (Public review):

      (1) Some aspects of the methods are unclear. For comparisons between recurrent networks trained from randomly initialized weights, I would expect that many initializations were made for each model variant to be compared, and that the performance characteristics are constructed by aggregating over networks trained from multiple random initializations. I could not tell from the methods whether this was done or how many models were aggregated.

      The expectation of the reviewer is correct, we trained multiple models with different random seeds (affecting both the weight initialization and the noise of our model) for each variant and aggregated the results. We have now clarified this in Methods 4.6. lines 658-662.

      (2) It is possible that including perturbation trials in the training sets would improve model performance across conditions, including held-out (untrained) perturbations (for instance, to units that had not been perturbed during training). It could be noted that if perturbations are available, their use may alleviate some of the design decisions that are evaluated here.

      In general, we agree with the reviewer that including perturbation trials in the training set would likely improve model performance across conditions. One practical limitation explaining partially why we did not do it with our dataset is the small quantity of perturbed trials for each targeted cortical area: the number of trials with light perturbations is too scarce to robustly train and test our models.

      More profoundly, to test hard generalizations to perturbations (aka perturbation testing), it will always be necessary that the perturbations are not trivially represented in the training data. Including perturbation trials during training would compromise our main finding: some biological model constraints improve the generalization to perturbation. To test this claim, it was necessary to keep the perturbations out of the training data.

      We agree that including all available data of perturbed and non-perturbed recordings would be useful to build the best generalist predictive system. It could help, for instance, for closed-loop circuit control as we studied in Figure 5. Yet, there too, it will be important for the scientific validation process to always keep some causal perturbations of interest out of the training set. This is necessary to fairly measure the real generalization capability of any model. Importantly, this is why we think out-of-distribution “perturbation testing” is likely to have a recurring impact in the years to come, even beyond the case of optogenetic inactivation studied in detail in our paper.

      Recommendation for the authors:

      Reviewer #1 (Recommendation for the authors):

      The code is not very easy to follow. I know this is a lot to ask, but maybe make clear where the code is to train the different models, which I think is a great contribution of this work? I predict that many readers will want to use the code and so this will improve the impact of this work.

      We updated the code to make it easier to train a model from scratch.

      Reviewer #2 (Recommendation for the authors):

      The figures are really tough to read. Some of that small font should be sized up, and it's tough to tell in the posted paper what's happening in Figure 2B.

      We updated Figures 1 and 2 significantly, in part to increase their readability. We also implemented the "Superficialities" suggestions.

    1. eLife Assessment

      This valuable study explores the role of the chromatin regulator ATAD2 in mouse spermatogenesis. The data convincingly demonstrate that ATAD2 is essential for proper chromatin remodeling in haploid spermatids, influencing gene accessibility, H3.3-mediated transcription, and histone eviction. Using Atad2 knockout (KO) mice, the authors link ATAD2 to the DNA-replication-independent incorporation of sperm-specific proteins like protamines and histone H3.3. Although the findings highlight chromatin abnormalities and impaired in vitro fertilization in KO mice, natural fertility remains unaffected, suggesting possible in vivo compensatory mechanisms. Future experiments will be needed to tease out the precise molecular role of ATAD2 in spermatogenesis. This work will be of interest to the epigenetics and developmental fields.

    2. Reviewer #1 (Public review):

      Summary:

      The authors analyzed the expression of ATAD2 protein in post-meiotic stages and characterized the localization of various testis-specific proteins in the testis of the Atad2 knockout (KO). By cytological analysis as well as the ATAC sequencing, the study showed that increased levels of HIRA histone chaperone, accumulation of histone H3.3 on post-meiotic nuclei, defective chromatin accessibility and also delayed deposition of protamines. Sperm from the Atad2 KO mice reduces the success of in vitro fertilization. The work was performed well, and most of the results are convincing. However, this manuscript does not suggest a molecular mechanism for how ATAD2 promotes the formation of testis-specific chromatin.

      Strengths:

      The paper describes the role of ATAD2 AAA+ ATPase in the proper localization of sperm-specific chromatin proteins such as protamine, suggesting the importance of the DNA replication-independent histone exchanges with the HIRA-histone H3.3 axis.

      Weaknesses:

      The work was performed well, and most of the results are convincing. However, this manuscript does not suggest a molecular mechanism for how ATAD2 promotes the formation of testis-specific chromatin.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript by Liakopoulou et al. presents a comprehensive investigation into the role of ATAD2 in regulating chromatin dynamics during spermatogenesis. The authors elegantly demonstrate that ATAD2, via its control of histone chaperone HIRA turnover, ensures proper H3.3 localization, chromatin accessibility, and histone-to-protamine transition in post-meiotic male germ cells. Using a new well-characterized Atad2 KO mouse model, they show that ATAD2 deficiency disrupts HIRA dynamics, leading to aberrant H3.3 deposition, impaired transcriptional regulation, delayed protamine assembly, and defective sperm genome compaction. The study bridges ATAD2's conserved functions in embryonic stem cells and cancer to spermatogenesis, revealing a novel layer of epigenetic regulation critical for male fertility.

      Strengths:

      The MS first demonstration of ATAD2's essential role in spermatogenesis, linking its expression in haploid spermatids to histone chaperone regulation by connecting ATAD2-dependent chromatin dynamics to gene accessibility (ATAC-seq), H3.3-mediated transcription, and histone eviction. Interestingly and surprisingly, sperm chromatin defects in Atad2 KO mice impair only in vitro fertilization but not natural fertility, suggesting unknown compensatory mechanisms in vivo.

      Weaknesses:

      The MS is robust and there are not big weaknesses

      The authors have addressed all the queries successfully.

    4. Reviewer #3 (Public review):

      Summary:

      The authors generated knockout mice for Atad2, a conserved bromodomain-containing factor expressed during spermatogenesis. In Atad2 KO mice, HIRA, a chaperone for histone variant H3.3, was upregulated in round spermatids, accompanied by an apparent increase in H3.3 levels. Furthermore, the sequential incorporation and removal of TH2B and PRM1 during spermiogenesis were partially disrupted in the absence of ATAD2, possibly due to delayed histone removal. Despite these abnormalities, Atad2 KO male mice were able to produce offspring normally.

      Strengths:

      The manuscript addresses the biological role of ATAD2 in spermatogenesis using a knockout mouse model, providing a valuable in vivo framework to study chromatin regulation during male germ cell development. The observed redistribution of H3.3 in round spermatids is clearly presented and suggests a previously unappreciated role of ATAD2 in histone variant dynamics. The authors also document defects in the sequential incorporation and removal of TH2B and PRM1 during spermiogenesis, providing phenotypic insight into chromatin transitions in late spermatogenic stages. Overall, the study presents a solid foundation for further mechanistic investigation into ATAD2 function.

      Weaknesses:

      While the manuscript reports the gross phenotype of Atad2 KO mice, the findings remain largely superficial and do not convincingly demonstrate how ATAD2 deficiency affects chromatin.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      The authors analyzed the expression of ATAD2 protein in post-meiotic stages and characterized the localization of various testis-specific proteins in the testis of the Atad2 knockout (KO). By cytological analysis as well as the ATAC sequencing, the study showed that increased levels of HIRA histone chaperone, accumulation of histone H3.3 on post-meiotic nuclei, defective chromatin accessibility and also delayed deposition of protamines. Sperm from the Atad2 KO mice reduces the success of in vitro fertilization. The work was performed well, and most of the results are convincing. However, this manuscript does not suggest a molecular mechanism for how ATAD2 promotes the formation of testis-specific chromatin. 

      We would like to take this opportunity to highlight that the present study builds on our previously published work, which examined the function of ATAD2 in both yeast S. pombe and mouse embryonic stem (ES) cells (Wang et al., 2021). In yeast, using genetic analysis we showed that inactivation of HIRA rescues defective cell growth caused by the absence of ATAD2. This rescue could also be achieved by reducing histone dosage, indicating that the toxicity depends on histone over-dosage, and that HIRA toxicity, in the absence of ATAD2, is linked to this imbalance.

      Furthermore, HIRA ChIP-seq performed in mouse ES cells revealed increased nucleosome-bound HIRA, particularly around transcription start sites (TSS) of active genes, along with the appearance of HIRA-bound nucleosomes within normally nucleosome-free regions (NFRs). These findings pointed to ATAD2 as a major factor responsible for unloading HIRA from nucleosomes. This unloading function may also apply to other histone chaperones, such as FACT (see Wang et al., 2021, Fig. 4C).

      In the present study, our investigations converge on the same ATAD2 function in the context of a physiologically integrated mammalian system—spermatogenesis. Indeed, in the absence of ATAD2, we observed H3.3 accumulation and enhanced H3.3-mediated gene expression. Consistent with this functional model of ATAD2— unloading chaperones from histone- and non-histone-bound chromatin—we also observed defects in histone-toprotamine replacement.

      Together, the results presented here and in Wang et al. (2021) reveal an underappreciated regulatory layer of histone chaperone activity. Previously, histone chaperones were primarily understood as factors that load histones. Our findings demonstrate that we must also consider a previously unrecognized regulatory mechanism that controls assembled histone-bound chaperones. This key point was clearly captured and emphasized by Reviewer #2 (see below).

      Strengths:

      The paper describes the role of ATAD2 AAA+ ATPase in the proper localization of sperm-specific chromatin proteins such as protamine, suggesting the importance of the DNA replication-independent histone exchanges with the HIRA-histone H3.3 axis. 

      Weaknesses: 

      (1) Some results lack quantification. 

      We will consider all the data and add appropriate quantifications where necessary.

      (2) The work was performed well, and most of the results are convincing. However, this manuscript does not suggest a molecular mechanism for how ATAD2 promotes the formation of testis-specific chromatin. 

      Please see our comments above.

      Reviewer #2 (Public review): 

      Summary:

      This manuscript by Liakopoulou et al. presents a comprehensive investigation into the role of ATAD2 in regulating chromatin dynamics during spermatogenesis. The authors elegantly demonstrate that ATAD2, via its control of histone chaperone HIRA turnover, ensures proper H3.3 localization, chromatin accessibility, and histone-toprotamine transition in post-meiotic male germ cells. Using a new well-characterized Atad2 KO mouse model, they show that ATAD2 deficiency disrupts HIRA dynamics, leading to aberrant H3.3 deposition, impaired transcriptional regulation, delayed protamine assembly, and defective sperm genome compaction. The study bridges ATAD2's conserved functions in embryonic stem cells and cancer to spermatogenesis, revealing a novel layer of epigenetic regulation critical for male fertility. 

      Strengths:

      The MS first demonstration of ATAD2's essential role in spermatogenesis, linking its expression in haploid spermatids to histone chaperone regulation by connecting ATAD2-dependent chromatin dynamics to gene accessibility (ATAC-seq), H3.3-mediated transcription, and histone eviction. Interestingly and surprisingly, sperm chromatin defects in Atad2 KO mice impair only in vitro fertilization but not natural fertility, suggesting unknown compensatory mechanisms in vivo. 

      Weaknesses:

      The MS is robust and there are not big weaknesses 

      Reviewer #3 (Public review): 

      Summary: 

      The authors generated knockout mice for Atad2, a conserved bromodomain-containing factor expressed during spermatogenesis. In Atad2 KO mice, HIRA, a chaperone for histone variant H3.3, was upregulated in round spermatids, accompanied by an apparent increase in H3.3 levels. Furthermore, the sequential incorporation and removal of TH2B and PRM1 during spermiogenesis were partially disrupted in the absence of ATAD2, possibly due to delayed histone removal. Despite these abnormalities, Atad2 KO male mice were able to produce offspring normally. 

      Strengths:

      The manuscript addresses the biological role of ATAD2 in spermatogenesis using a knockout mouse model, providing a valuable in vivo framework to study chromatin regulation during male germ cell development. The observed redistribution of H3.3 in round spermatids is clearly presented and suggests a previously unappreciated role of ATAD2 in histone variant dynamics. The authors also document defects in the sequential incorporation and removal of TH2B and PRM1 during spermiogenesis, providing phenotypic insight into chromatin transitions in late spermatogenic stages. Overall, the study presents a solid foundation for further mechanistic investigation into ATAD2 function. 

      Weaknesses:

      While the manuscript reports the gross phenotype of Atad2 KO mice, the findings remain largely superficial and do not convincingly demonstrate how ATAD2 deficiency affects chromatin dynamics. Moreover, the phenotype appears too mild to elucidate the functional significance of ATAD2 during spermatogenesis. 

      We respectfully disagree with the statement that our findings are largely superficial. Based on our investigations of this factor over the years, it has become evident that ATAD2 functions as an auxiliary factor that facilitates mechanisms controlling chromatin dynamics (see, for example, Morozumi et al., 2015). These mechanisms can still occur in the absence of ATAD2, but with reduced efficiency, which explains the mild phenotype we observed.

      This function, while not essential, is nonetheless an integral part of the cell’s molecular biology and should be studied and brought to the attention of the broader biological community, just as we study essential factors. Unfortunately, the field has tended to focus primarily on core functional actors, often overlooking auxiliary factors. As a result, our decade-long investigations into the subtle yet important roles of ATAD2 have repeatedly been met with skepticism regarding its functional significance, which has in turn influenced editorial decisions.

      We chose eLife as the venue for this work specifically to avoid such editorial barriers and to emphasize that facilitators of essential functions do exist. They deserve to be investigated, and the underlying molecular regulatory mechanisms must be understood.

      (1) Figures 4-5: The analyses of differential gene expression and chromatin organization should be more comprehensive. First, Venn diagrams comparing the sets of significantly differentially expressed genes between this study and previous work should be shown for each developmental stage. Second, given the established role of H3.3 in MSCI, the effect of Atad2 knockout on sex chromosome gene expression should be analyzed. Third, integrated analysis of RNA-seq and ATAC-seq data is needed to evaluate how ATAD2 loss affects gene expression. Finally, H3.3 ChIP-seq should be performed to directly assess changes in H3.3 distribution following Atad2 knockout.  

      (1) In the revised version, we will include Venn diagrams to illustrate the overlap in significantly differentially expressed genes between this study and previous work. However, we believe that the GSEAs presented here provide stronger evidence, as they indicate the statistical significance of this overlap (p-values). In our case, we observed p-value < 0.01 (**) and p < 0.001 (***).

      (2) Sex chromosome gene expression was analyzed and is presented in Fig. 5C.

      (3) The effect of ATAD2 loss on gene expression is shown in Fig. 4A, B, and C as histograms, with statistical significance indicated in the middle panels.

      (4) Although mapping H3.3 incorporation across the genome in wild-type and Atad2 KO cells would have been informative, the available anti-H3.3 antibody did not work for ChIP-seq, at least in our hands. The authors of Fontaine et al., 2022, who studied H3.3 during spermatogenesis in mice, must have encountered the same problem, since they tagged the endogenous H3.3 gene to perform their ChIP experiments.

      (2) Figure 3: The altered distribution of H3.3 is compelling. This raises the possibility that histone marks associated with H3.3 may also be affected, although this has not been investigated. It would therefore be important to examine the distribution of histone modifications typically associated with H3.3. If any alterations are observed, ChIP-seq analyses should be performed to explore them further.

      Based on our understanding of ATAD2’s function—specifically its role in releasing chromatin-bound HIRA—in the absence of ATAD2 the residence time of both HIRA and H3.3 on chromatin increases. This results in the detection of H3.3 not only on sex chromosomes but across the genome. Our data provide clear evidence of this phenomenon. The reviewer is correct in suggesting that the accumulated H3.3 would carry H3.3-associated histone PTMs; however, we are unsure what additional insights could be gained by further demonstrating this point.

      (3) Figure 7: While the authors suggest that pre-PRM2 processing is impaired in Atad2 KO, no direct evidence is provided. It is essential to conduct acid-urea polyacrylamide gel electrophoresis (AU-PAGE) followed by western blotting, or a comparable experiment, to substantiate this claim. 

      Figure 7 does not suggest that pre-PRM2 processing is affected in Atad2 KO; rather, this figure—particularly Fig. 7B—specifically demonstrates that pre-PRM2 processing is impaired, as shown using an antibody that recognizes the processed portion of pre-PRM2. ELISA was used to provide a more quantitative assessment; however, in the revised manuscript we will also include a western blot image.

      (4) HIRA and ATAD2: Does the upregulation of HIRA fully account for the phenotypes observed in Atad2 KO? If so, would overexpression of HIRA alone be sufficient to phenocopy the Atad2 KO phenotype? Alternatively, would partial reduction of HIRA (e.g., through heterozygous deletion) in the Atad2 KO background be sufficient to rescue the phenotype? 

      These are interesting experiments that require the creation of appropriate mouse models, which are not currently available.

      (5) The mechanism by which ATAD2 regulates HIRA turnover on chromatin and the deposition of H3.3 remains unclear from the manuscript and warrants further investigation. 

      The Reviewer is absolutely correct. In addition to the points addressed in response to Reviewer #1’s general comments (see above), it would indeed have been very interesting to test the segregase activity of ATAD2 (likely driven by its AAA ATPase activity) through in vitro experiments using the Xenopus egg extract system described by Tagami et al., 2004. This system can be applied both in the presence and absence (via immunodepletion) of ATAD2 and would also allow the use of ATAD2 mutants, particularly those with inactive AAA ATPase or bromodomains. However, such experiments go well beyond the scope of this study, which focuses on the role of ATAD2 in chromatin dynamics during spermatogenesis.

      References:

      (1) Wang T, Perazza D, Boussouar F, Cattaneo M, Bougdour A, Chuffart F, Barral S, Vargas A, Liakopoulou A, Puthier D, Bargier L, Morozumi Y, Jamshidikia M, Garcia-Saez I, Petosa C, Rousseaux S, Verdel A, Khochbin S. ATAD2 controls chromatin-bound HIRA turnover. Life Sci Alliance. 2021 Sep 27;4(12):e202101151. doi: 10.26508/lsa.202101151. PMID: 34580178; PMCID: PMC8500222.

      (2) Morozumi Y, Boussouar F, Tan M, Chaikuad A, Jamshidikia M, Colak G, He H, Nie L, Petosa C, de Dieuleveult M, Curtet S, Vitte AL, Rabatel C, Debernardi A, Cosset FL, Verhoeyen E, Emadali A, Schweifer N, Gianni D, Gut M, Guardiola P, Rousseaux S, Gérard M, Knapp S, Zhao Y, Khochbin S. Atad2 is a generalist facilitator of chromatin dynamics in embryonic stem cells. J Mol Cell Biol. 2016 Aug;8(4):349-62. doi: 10.1093/jmcb/mjv060. Epub 2015 Oct 12. PMID: 26459632; PMCID: PMC4991664.

      (3) Fontaine E, Papin C, Martinez G, Le Gras S, Nahed RA, Héry P, Buchou T, Ouararhni K, Favier B, Gautier T, Sabir JSM, Gerard M, Bednar J, Arnoult C, Dimitrov S, Hamiche A. Dual role of histone variant H3.3B in spermatogenesis: positive regulation of piRNA transcription and implication in X-chromosome inactivation. Nucleic Acids Res. 2022 Jul 22;50(13):7350-7366. doi: 10.1093/nar/gkac541. PMID: 35766398; PMCID: PMC9303386.

      (4) Tagami H, Ray-Gallet D, Almouzni G, Nakatani Y. Histone H3.1 and H3.3 complexes mediate nucleosome assembly pathways dependent or independent of DNA synthesis. Cell. 2004 Jan 9;116(1):51-61. doi: 10.1016/s0092-8674(03)01064-x. PMID: 14718166.

      Recommendations for the authors:

      Reviewing Editor Comments:

      I note that the reviewers had mixed opinions about the strength of the evidence in the manuscript. A revision that addresses these points would be welcome.

      Reviewer #1 (Recommendations for the authors):  

      Major points: 

      (1) No line numbers: It is hard to point out the issues.

      The revised version harbors line numbers.

      (2) Given the results shown in Figure 3 and Figure 4, it is nice to show the chromosomal localization of histone H3.3 in spermatocytes or post-meiotic cells by Chromatin-immunoprecipitation sequencing (ChIP-seq).

      Although mapping H3.3 incorporation across the genome in wild-type and Atad2 KO cells would have been informative, the available anti-H3.3 antibody did not work for ChIP-seq in our hands. In fact, this antibody is not well regarded for ChIP-seq. For example, Fontaine et al. (2022), who investigated H3.3 during spermatogenesis in mice, circumvented this issue by tagging the endogenous H3.3 genes for their ChIP experiments.

      (3) Figure 7B and 8: Why the authors used ELISA for the protein quantification. At least, western blotting should be shown.

      ELISA is a more quantitative method than traditional immunoblotting. Nevertheless, as requested by the reviewer, we have now included a corresponding western blot in Fig. S3.

      (4) For readers, please add a schematic pathway of histone-protamine replacement in sperm formation in Fig.1 and it would be nice to have a model figure, which contains the authors' idea in the last figure.

      As requested by this reviewer, we have now included a schematic model in Figure 9 to summarize the main conclusions of our work.

      Minor points: 

      (1) Page 2, the second paragraph, "pre-PRM2: Please explain more about pre-PRM2 and/or PRM2 as well as PRM1 (Figure 6).

      More detailed descriptions of PRM2 processing are now given in this paragraph. 

      (2) Page 3, bottom paragraph, line 1: "KO" should be "knockout (KO)".

      Done.

      (3) Page 4, second paragraph bottom: Please explain more about the protein structure of germ-line-specific ATAD2S: how it is different from ATAD2L. Germ-line specific means it is also expressed in ovary?

      As Atad2 is predominantly expressed in embryonic stem cells and in spermatogenic cells, we replaced all through the text germ-line specific by more appropriate terms.

      (4) Figure 1C, western blotting: Wild-type testis extracts, both ATAD2L and -S are present. Does this mean that ATADS2L is expressed in both germ line as well as supporting cells. Please clarify this and, if possible, show the western blotting of spermatids well as spermatocytes.

      Figure 1D shows sections of seminiferous tubules from Atad2 KO mice, in which lacZ expression is driven by the endogenous Atad2 promoter. The results indicate that Atad2 is expressed mainly in post-meiotic cells. Most labeled cells are located near the lumen, whereas the supporting Sertoli cells remain unlabeled. Sertoli cells, which are anchored to the basal lamina, span the entire thickness of the germinal epithelium from the basal lamina to the lumen. Their nuclei, however, are usually positioned closer to the basal membrane. Thus, the observed lacZ expression pattern argues against substantial Atad2 expression in Sertoli cells. 

      (5) Figure 1C: Please explain a bit more about the reduction of ATAD2 proteins in heterozygous mice.

      Done

      (6) Figure 1C: Genotypes of the mice should be shown in the legend.

      Done 

      (7) Figure 1D: Please add a more magnified image of the sections to see the staining pattern in the seminiferous tubules.

      The magnification does not bring more information since we lose the structure of cells within tubules due the nature of treatment of the sections for X-gal staining. Please see comments to question 1C to reviewer 2

      (8) Page 5, first paragraph, line 2, histone dosage: What do the authors meant by the histone dosage? Please explain more or use more appropriate word.

      "Histone dosage" refers to the amount or relative abundance of histone proteins in a cell.

      (9) Figure 2A: Figure 2A: Given the result in Figure 1C, it is interesting to check the amount of HIRA in Atad2 heterozygous mice.

      In Atad2 heterozygous mice, we would expect an increase in HIRA, but only to about half the level seen in the Atad2 homozygous knockout shown in Figure 2A, which is relatively modest. Therefore, we doubt that detecting such a small change—approximately half of that in Figure 2A—would yield clear or definitive results. 

      (10) Figure 2A, legend (n=5): What does this "n" mean? The extract of testes from "5" male mice like Figure 2B. Or 5 independent experiments. If the latter is true, it is important to share the other results in the Supplements.

      “n” refers to five WT and five Atad2 KO males. The legend has been clarified as suggested by the reviewer.

      (11) Figure 2A, legend, line 2, Atad2: This should be italicized.

      Done

      (12) Figure 2B: Please show the quantification of amounts of HIRA protein like Fig. 2A.

      As indicated in the legend, what is shown is a pool of testes from 3 individuals per genotype.

      (13) Figure 2B shows an increased level of HIRA in Atad2 KO testis. This suggests the role of ATAD2 in the protein degradation of HIRA. This possibility should be mentioned or tested since ATAD2 is an AAA+ ATPase. 

      The extensive literature on ATAD2 provides no indication that it is involved in protein degradation. In our early work on ATAD2 in the 2000s, we hypothesized that, as a member of the AAA ATPase family, ATAD2 might associate with the 19S proteasome subunit (through multimerization with the other AAA ATPase member of this regulatory subunit). However, both our published pilot studies (Caron et al., PMID: 20581866) and subsequent unpublished work ruled out this possibility. Instead, since the amount of nucleosome-bound HIRA increases in the absence of ATAD2, we propose that chromatin-bound HIRA is more stable than soluble HIRA once it has been released from chromatin by ATAD2.

      (14) Page 6, second paragraph, line 5, ko: KO should be capitalized.

      Done

      (15) Page 6, second paragraph, line 2 from the bottom, chromatin dynamics: Throughout the text, the authors used "chromatin dynamics". However, all the authors analyzed in the current study is the localization of chromatin protein.  So, it is much easier to explain the results by using "chromatin status," etc. In this context, "accessibility" is better. 

      We changed the term “chromatin dynamics” into a more precise term according to the context used all through the text.

      (16) Figure 3: Please provide the quantification of signals of histone H3.3 in a nucleus or nuclear cytoplasm.

      This request is not clear to us since we do not observe any H3.3 signal in the cytoplasm.

      (17) Figure 3: As the control of specificity in post-meiotic cells, please show the image and quantification of the H3.3 signals in spermatocyte, for example.

      This request is not clear to us. What specificity is meant? 

      (18) Figure 3, bottom panels: Please show what the white lines indicate? 

      The white lines indicate the limit of cell nucleus and estimated by Hoechst staining. This is now indicated in the legend of the figure. 

      (19) Figure 4A: Please explain more about what kind of data is here. Is this wild-type and/or Atad2 KO? The label of the Y-axis should be "mean expression level". What is the standard deviation (SD) here on the X-axis. Moreover, there is only one red open circle, but the number of this class is 5611. All 5611 genes in this group show NO expression. Please explain more.

      The plot displays the mean expression levels (y-axis, labeled as "mean expression level") versus the corresponding standard deviations (x-axis), both calculated from three independent biological replicates of isolated round spermatids (Atad2 wild-type and Atad2 KO). The standard deviation reflects the variability of gene expression across biological replicates. Genes were grouped into four categories (grp1: blue, grp2: cyan, grp3: green, grp4: orange) according to the quartile of their mean expression. For grp4, all genes have no detectable expression, resulting in a mean expression of zero and a standard deviation of zero; consequently, the 5611 genes in this group are represented by a single overlapping point (red open circle) at the origin. 

      (20) Figure 4C: If possible, it would be better to have a statistical comparison between wild-type and the KO.  

      The mean profiles are displayed together with their variability (± 2 s.e.m.) across the four replicates for both ATAD2 WT (blue) and ATAD2 KO (red). For groups 1, 2, and 3, the envelopes of the curves remain clearly separated around the peak, indicating a consistent difference in signal between the two conditions. In contrast, group 4 does not present a strong signal and, accordingly, no marked difference is observed between WT and KO in this group.

      (21) Figure 5, GSEA panels: Please explain more about what the GSEA is in the legend.  The legend has been updated as follows:

      (A) Expression profiles of post-meiotic H3.3-activated genes. The heatmap (left panel) displays the normalized expression levels of genes identified by Fontaine and colleagues as upregulated in the absence of histone H3.3 (Fontaine et al. 2022) for Atad2 WT (WT) and Atad2 KO (KO) samples at days 20, 22, 24, and 26 PP (D20 to D26). The colour scale represents the z-score of log-transformed DESeq2-normalized counts. The middle panel box plots display, pooled, normalized expression levels, aggregated across replicates and genes, for each condition (WT and KO) and each time point (D20 to D26). Statistical significance between WT and KO conditions was determined using a two-sided t-test, with p-values indicated as follows: * for p-value<0.05, ** for p-value<0.01 and *** for p-value<0.001. The right panel shows the results of gene set enrichment analysis (GSEA), which assesses whether predefined groups of genes show statistically significant differences between conditions. Here, the post-meiotic H3.3-activated genes set, identified by Fontaine et al. (2022), is significantly enriched in Atad2 KO compared with WT samples at day 26 (p < 0.05, FDR < 0.25). Coloured vertical bars indicate the “leading edge” genes (i.e., those contributing most to the enrichment signal), located before the point of maximum enrichment score.  (B) As shown in (A) but for the "post-meiotic H3.3-repressed genes" gene set. (C) As shown in (A) but for the " sex chromosome-linked genes " gene set.

      (22) Figure 6. In the KO, the number of green cells is more than red and yellow cells, suggesting the delayed maturation of green (TH2B-positive) cells. It is essential to count the number of each cell and show the quantification.

      The green cells correspond to those expressing TH2B but lacking transition proteins (TP) and protamine 1 (Prm1), indicating that they are at earlier stages than elongating–condensing spermatids. Counting these green cells simply reflects the ratio of elongating/condensing spermatids to earlier-stage cells, which varies depending on the field examined. The key point in this experiment is that in wild-type mice, only red cells (elongating/condensing spermatids) and green cells (earlier stages) are observed. By contrast, in Atad2 KO testes, a significant proportion of yellow cells appears, which are never seen in wild-type tissue. The crucial metric here is the percentage of yellow cells relative to the total number of elongating/condensing spermatids (red cells). In wild-type testes, this value is consistently 0%, whereas in Atad2 KO testes it always ranges between 50% and 100% across all fields containing substantial numbers of elongating/condensing spermatids.

      (23) Figure 8A: Please show the images of sperm (heads) in the KO mice with or without decompaction.

      The requested image is now displayed in Figure S5.

      (24) Figure 8C: In the legend, it says n=5. However, there are more than 5 plots on the graph. Please explain the experiment more in detail.

      The experiment is now better explained in the legend of this Figure.

      Reviewer #2 (Recommendations for the authors): 

      While the study is rigorous and well performed, the following minor points could be addressed to strengthen the manuscript: 

      Figure 1C should indicate each of the different types of cells present in the sections. It would be of interest to show specifically the different post-meiotic germ cells.

      With this type of sample preparation, it is difficult to precisely distinguish the different cell types within the sections. Nevertheless, the staining pattern strongly indicates that most of the intensely stained cells are post-meiotic, situated near the tubule lumens and extending roughly halfway toward the basal membrane.

      In the absence of functional ATAD2, the accumulation of HIRA primarily occurs in round spermatids (Fig. 2B). If technically possible, it would be of great interest to show this by IHC of testis section. 

      Unfortunately, our antibody did not satisfactorily work in IHC.

      The increased of H3.3 signal in Atad2 KO spermatids (Fig. 3) is interpreted because of a reduced turnover. However, alternative explanations (e.g., H3.3 misincorporation or altered chaperone affinity) should not be ruled out. 

      The referee is correct that alternative explanations are possible. However, based on our previous work (Wang et al., 2021; PMID: 34580178), we demonstrated that in the absence of ATAD2, there is reduced turnover of HIRAbound nucleosomes, as well as reduced nucleosome turnover, evidenced by the appearance of nucleosomes in regions that are normally nucleosome-free at active gene TSSs. We have no evidence supporting any other alternative hypothesis.

      In the MS the reduced accessibility at active genes (Fig. 4) is attributed to H3.3 overloading. However, global changes in histone acetylation (e.g., H4K5ac) or other remodelers in KO cells could be also consider.

      In fact, we meant that histone overloading could be responsible for the altered accessibility. This has been clearly demonstrated in case of S. cerevisiae in the absence of Yta7 (S.  cerevisiae’ ATAD2) (PMID: 25406467).

      In relation with the sperm compaction assay (Fig. 8A), the DTT/heparin/Triton protocol may not fully reflect physiological decompaction. This could be validated with alternative methods (e.g., MNase sensitivity). 

      The referee is right, but since this is a subtle effect as it can be judged by normal fertility, we doubt that milder approaches could reveal significant differences between wildtype and Atad2 KO sperms.

      It is surprising that despite the observed alterations in the genome organization of the sperm, the natural fertility of the KO mice is not affected (Fig. 8C). This warrants deeper discussion: Is functional compensation occurring (e.g., by p97/VCP)? Analysis of epididymal sperm maturation or uterine environment could provide insights.

      As detailed in the Discussion section, this work, together with our previous study (Wang et al., 2021; PMID: 34580178), highlights an overlooked level of regulation in histone chaperone activity: the release of chromatinbound factors following their interaction with chromatin. This is an energy-dependent process, driven by ATP and the associated ATPase activity of these factors. Such activity could be mediated by various proteins, such as p97/VCP or DNAJC9–HSP70, as discussed in the manuscript, or by yet unidentified factors. However, most of these mechanisms are likely to occur during the extensive histone-to-histone variant exchanges of meiosis and post-meiotic stages. To the best of our knowledge, epididymal sperm maturation and the uterine environment do not involve substantial histone-to-histone or histone-to-protamine exchanges.

      The authors showed that MSCI genes present an enhancement of repression in the absence of ATAD2 by enhancing H3.3 function. It would be also of interest to analyze the behavior of the Sex body during its silencing (zygotene to pachytene) by looking at different markers (i.e., gamma-H2AX phosphorylation, Ubiquitylation etc). 

      The referee is correct that this is an interesting question. Accordingly, in our future work, we plan to examine the sex body in more detail during its silencing, using a variety of relevant markers, including those suggested by the reviewer. However, we believe that such investigations fall outside the scope of the present study, which focuses on the molecular relationship between ATAD2 and H3.3, rather than on the role of H3.3 in regulating sex body transcription. For a comprehensive analysis of this aspect, studies should primarily focus on the H3.3 mouse models reported by Fontaine and colleagues (PMID: 35766398).

      Fig. 6: Co-staining of TH2B/TP1/PRM1 is convincing but would benefit from quantification (% cells with overlapping signals).

      The green cells correspond to those expressing TH2B but lacking transition proteins (TP) and protamine 1 (Prm1), indicating that they are at earlier stages than elongating–condensing spermatids. Counting these green cells simply reflects the ratio of elongating/condensing spermatids to earlier-stage cells, which varies depending on the field examined. The key point is that in wild-type mice, only red cells (elongating/condensing spermatids) and green cells (earlier stages) are observed. By contrast, in Atad2 KO testes, a significant proportion of yellow cells appears, which are never seen in wild-type tissue. The crucial metric is the percentage of yellow cells relative to the total number of elongating/condensing spermatids (red cells). In wild-type testes, this value is consistently 0%, whereas in Atad2 KO testes it always ranges between 50% and 100% across all fields containing substantial numbers of elongating/condensing spermatids.

    1. eLife Assessment

      This useful study reports a method to detect and analyze a novel post-translational modification, lysine acetoacetylation (Kacac), finding it regulates protein metabolism pathways. The study unveils epigenetic modifiers involved in placing this mark, including key histone acetyltransferases such as p300, and concomitant HDACs, which remove the mark. Proteomic and bioinformatics analysis identified many human proteins with Kacac sites, potentially suggesting broad effects on cellular processes and disease mechanisms. The data presented are solid and the study will be of interest to those studying protein and metabolic regulation.

    2. Reviewer #3 (Public review):

      Summary:

      This paper presents a timely and significant contribution to the study of lysine acetoacetylation (Kacac). The authors successfully demonstrate a novel and practical chemo-immunological method using the reducing reagent NaBH4 to transform Kacac into lysine β-hydroxybutyrylation (Kbhb).

      Strengths:

      This innovative approach enables simultaneous investigation of Kacac and Kbhb, showcasing their potential in advancing our understanding of post-translational modifications and their roles in cellular metabolism and disease.

      Weaknesses:

      The study lacks supporting in vivo data, such as gene knockdown experiments, to validate the proposed conclusions at the cellular level.

    3. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #2 (Public review):

      In the manuscript by Fu et al., the authors developed a chemo-immunological method for the reliable detection of Kacac, a novel post-translational modification, and demonstrated that acetoacetate and AACS serve as key regulators of cellular Kacac levels. Furthermore, the authors identified the enzymatic addition of the Kacac mark by acyltransferases GCN5, p300, and PCAF, as well as its removal by deacetylase HDAC3. These findings indicate that AACS utilizes acetoacetate to generate acetoacetyl-CoA in the cytosol, which is subsequently transferred into the nucleus for histone Kacac modification. A comprehensive proteomic analysis has identified 139 Kacac sites on 85 human proteins. Bioinformatics analysis of Kacac substrates and RNA-seq data reveal the broad impacts of Kacac on diverse cellular processes and various pathophysiological conditions. This study provides valuable additional insights into the investigation of Kacac and would serve as a helpful resource for future physiological or pathological research.

      The authors have made efforts to revise this manuscript and address my concerns. The revisions are appropriate and have improved the quality of the manuscript.

      We appreciate the constructive and thoughtful feedbacks, which have been invaluable in enhancing the quality of our manuscript.

      Reviewer #3 (Public review):

      Summary:

      This paper presents a timely and significant contribution to the study of lysine acetoacetylation (Kacac). The authors successfully demonstrate a novel and practical chemoimmunological method using the reducing reagent NaBH4 to transform Kacac into lysine βhydroxybutyrylation (Kbhb).

      Thank you for the positive and insightful comments.

      Strengths:

      This innovative approach enables simultaneous investigation of Kacac and Kbhb, showcasing its potential in advancing our understanding of post-translational modifications and their roles in cellular metabolism and disease.

      We are grateful for the reviewer’s comments, which has contributed to enhancing the quality of our study.

      Weaknesses:

      The experimental evidence presented in the article is insufficient to fully support the authors' conclusions. In the in vitro assays, the proteins used appear to be highly inconsistent with their expected molecular weights, as shown by Coomassie Brilliant Blue staining (Figure S3A). For example, p300, which has a theoretical molecular weight of approximately 270 kDa, appeared at around 37 kDa; GCN5/PCAF, expected to be ~70 kDa, appeared below 20 kDa. Other proteins used in the in vitro experiments also exhibited similarly large discrepancies from their predicted sizes. These inconsistencies severely compromise the reliability of the in vitro findings. Furthermore, the study lacks supporting in vivo data, such as gene knockdown experiments, to validate the proposed conclusions at the cellular level.

      We appreciate the reviewer’s comments. In the biochemical assays, we used the expressed catalytic domains of HATs—rather than the full-length proteins for activity testing. Specifically, the following constructs were expressed and purified: p300 (1287– 1666), GCN5 (499-663), PCAF (493-658), MOF (125-458), MOZ (497-780), MBP-MORF (361-716), Tip60 (221-512), HAT1 (20-341), and HBO1 (full length). This resulted in the observed discrepancies in molecular weight in Figure S3A compared to the expected fulllength weights. 

      Although a recent study (PMID: 37382194) reported the acetoacetyltransferase activities of p300 and GCN5 in cells, we recognize that additional knockdown experiments would be necessary to substantiate their contributions to in vivo Kacac generation and to explore the functional roles of Kacac in an enzyme-specific context. We plan to address these kinds of research issues in our future work.

    1. eLife Assessment

      This fundamental study provides new evidence of a change in how microglia survey neurons during the chronic phase of neurodegeneration, which researchers studying neuroinflammation and its role in neurodegenerative disease should find interesting. In this research, using time-lapse imaging of acute brain slices from prion-affected mice, the researchers show that, unlike in healthy brains, microglia become reactive, lose their territorial boundaries, and become highly mobile, exhibiting "kiss-and-ride" behavior, migrating into brain tissue and forming reversible, transient body-to-body contact with neurons. The evidence is compelling, with well-executed time-lapse imaging, good quantitative analysis across several disease stages, pharmacological validation of P2Y6 involvement, and the very surprising finding that this mobile behavior persists after microglia are removed from the brain.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Subhramanian et al. carefully examined how microglia adapt their surveillance strategies during chronic neurodegeneration, specifically in prion-infected mice. The authors used ex vivo time-lapse imaging and in vitro strategies and found that reactive microglia adopt a highly mobile, "kiss-and-ride" behavior, contrasting the more static surveillance typical of homeostatic microglia. The manuscript provides fundamental mechanistic insights into the dynamics of microglia-neuron interactions, implicates P2Y6 signaling in regulating mobility, and suggests that intrinsic reprogramming of microglia might underlie this behavior, the conclusions are therefore compelling.

      Strengths:

      (1) The novelty of the study is high, particularly the demonstration that microglia lose territorial confinement and dynamically migrate from neuron to neuron under chronic neurodegeneration.

      (2) The possible implications of a stimulus-independent high mobility in reactive microglia are particularly striking. Although this is not fully explored.

      (3) The use of time-lapse imaging in organotypic slices rather than overexpression models provided a more physiological approach.

      (4) Microglia-neuron interactions in neurodegeneration have broad implications for understanding the progression of diseases, such as Alzheimer's and Parkinson's, that are associated with chronic inflammation.

      Weaknesses:

      Previous weaknesses were addressed.

    3. Reviewer #2 (Public review):

      This is a nice paper focused microglial responses to different clinical stages of prion infection in acute brain slices. The key here is the use of time-lapse imaging that captures the dynamics of microglial surveillance, including morphology, migration, and intracellular neuron/microglial contacts. The authors use a myeloid GFP-labeled transgenic mouse to track microglia in SSLOW-infected brain slices, quantifying differences in motility and microglial-neuronal interactions via live fluorescence imaging. Interesting findings include the elaborate patterns of motility among microglia, the distinct types and durations of intracellular contacts, the potential role of calcium signaling in facilitating hypermobility, and the fact that this motion-promoting status is intrinsic to the microglia, persisting even after the cells have been isolated from infected brains. Although largely a descriptive paper, it offers mechanistic insights, including the role of calcium in supporting microglial movement, with bursts of signaling identified even within the time lapse format, and inhibition studies implicating the purinergic receptor and calcium transient regulator P2Y6 in migratory capacity.

      Strengths:

      (1) The focus on microglia activation and activity in the context of prion disease is interesting

      (2) Two different prions produce largely the same response

      (3) Use of time-lapse provides insight into the dynamics of microglia, distinguishing between types of contact - mobility vs motility - and providing insight on the duration/transience and reversibility of extensive somatic contacts that include brief and focused connections in addition to soma envelopment.

      (4) Imaging window selection (3 hours) guided by prior publications documenting preserved morphology, activity, and gene expression regulation up to 4 hours.

      (5) The distinction between high- and low-mobility microglia is interesting, especially given that hypermobility seems to be an innate property of the cells.

      (6) The live-imaging approach is validated by fixed tissue confocal imaging.

      (7) The variance in duration of neuron/microglia contacts is interesting, although there is no insight into what might dictate which status of interaction predominates

      (8) The reversibility of the enveloping action, which is not apparently a commitment to engulfment, is interesting, as is the fact that only neurons are selected for this activity.

      (9) The calcium studies use the fluorescent dye calbryte-590, which picks up neuronal and microglial bursts -prolonged bursts are detected in enveloped neurons and in the hyper-mobile microglia - the microglial lead is followed up using MRS-2578 P2Y6 inhibitor that blunts the mobility of the microglia

      Comments on revisions:

      The authors have addressed my concerns in full - I think this is a very nice addition to the literature.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      The Cx3cr1/EGFP line labels all myeloid cells, which makes it difficult to conclude that all observed behaviors are attributable to microglia rather than infiltrating macrophages. The authors refer to this and include it as a limitation. Nonetheless, complementary confirmation by additional microglia markers would strengthen their claims. 

      We appreciate the reviewer’s insightful comment regarding the cellular identity of the enveloping myeloid cells. As suggested, we performed triple co-immunostaining of SSLOW-infected Cx3cr1/EGFP mice using markers for neurons (NeuN), myeloid cells (IBA1), and resident microglia (TMEM119 or P2Y12). Because formic acid treatment used to deactivate prions abolishes the EGFP signal, we relied on IBA1 staining to identify the myeloid population. Our results confirmed that IBA1⁺ cells exhibiting the envelopment behavior are also TMEM119⁺ and P2Y12⁺, consistent with a resident microglial phenotype. These new data are presented in Figures S3 and S4 and described in the final section of the Results.

      Although the authors elegantly describe dynamic surveillance and envelopment hypothesis, it is unclear what the role of this phenotype is for disease progression, i.e., functional consequences. For example, are the neurons that undergo sustained envelopment more likely to degenerate? 

      We appreciate this important question regarding the functional implications of neuronal envelopment. At present, technical limitations prevent us from continuously tracking the fate of individual enveloped neurons in prion-infected mice. Nevertheless, our recent study demonstrated that P2Y12 knockout increases the prevalence of neuronal envelopment and accelerates disease progression (Makarava et al., 2025, J. Neuroinflammation). These findings suggest that while microglial envelopment may represent an adaptive response to increased neuronal surveillance demands, excessive envelopment, as observed in the absence of P2Y12, appears to be maladaptive. A new paragraph has been added to the Discussion to address this point.

      Moreover, although the increase in mobility is a relevant finding, it would be interesting for the authors to further comment on what the molecular trigger(s) is/are that might promote this increase. These adaptations, which are at least long-lasting, confer apparent mobility in the absence of external stimuli. 

      We thank the reviewer for this thoughtful suggestion. The molecular mechanisms underlying the increased mobility of microglia in prion-infected brains remain to be identified, and we plan to pursue this question in future studies. One possibility we briefly discuss in the revised manuscript is that proinflammatory signaling, mediated by secreted cytokines or interleukins, may drive this phenotype. Supporting this hypothesis, recent work has shown that IFNγ enhances microglial migration in the adult mouse cortex (doi:10.1073/pnas.2302892120). This work has been cited in the revised manuscript.

      The authors performed, as far as I could understand, the experiments in cortical brain regions. There is no clear rationale for this in the manuscript, nor is it clear whether the mobility is specific to a particular brain region. This is particularly important, as microglia reactivity varies greatly depending on the brain region. 

      We appreciate this insightful comment highlighting the importance of regional determinants of microglial reactivity, which indeed aligns with our ongoing research interests. In our previous studies, neuronal envelopment by microglia was observed consistently across all prion-affected brain regions exhibiting neuroinflammation. Assuming that envelopment requires microglial mobility, it is reasonable to speculate that microglia are mobile in all brain regions affected by prions and displaying neuroinflammatory responses. In the current study, we focused exclusively on the cortex because this region was used for quantifying the prevalence of neuronal envelopment as a function of disease progression in our prior work (DOI: 10.1172/JCI181169), which guided the present study design. Our ongoing investigations indicate that the prevalence of envelopment is region-dependent and correlates with microglial reactivity/the degree of neuroinflammation. In prion diseases, the degree of microglial reactivity is dictated by the tropism of specific prion strains to distinct brain regions. Notably, our prior studies have shown that strain-specific sialylation patterns of PrP<sup>Sc</sup> glycans play a key role in determining both regional strain tropism and the extent of neuroinflammatory activation (DOI: 10.3390/ijms21030828, DOI: 10.1172/JCI138677). In response to this comment, we have added a brief rationale for using the cortex in the Results section.

      It would be relevant information to have an analysis of the percentage of cells in normal, sub-clinical, early clinical, and advanced stages that became mobile. Without this information, the speed/distance alone can have different interpretations.

      We thank the reviewer for this valuable suggestion. The percentage of mobile cells across normal, sub-clinical, early clinical, and advanced disease stages is presented in Figure 3b and described in the final paragraph of the section “Enveloping behavior of reactive myeloid cells.”

      Reviewer #2 (Public review)

      The number of individual cells tracked has been provided, but not the number of individual mice. The sex of the mice is not provided. 

      We used N = 3 animals per group throughout the study; this information has now been added to the figure legends. Animals of both sexes were included in random proportions. The sex information is now listed for each experiment in the Animals subsection of the Methods.

      The statistical approach is not clear; was each cell treated as a single observation? 

      Yes, with the exception of the heat map in Figure 2d, all mobility parameters are analyzed and presented at the level of individual cells, with each cell treated as an independent observation. The primary aim of this study is to characterize behavioral patterns of single reactive myeloid cells. Analyzing data at the cell level allows us to capture the full distribution of cell behaviors and to preserve biologically meaningful heterogeneity within and across animals. By contrast, averaging values per animal would largely mask this variability. In the heat map in Figure 2d, data are averaged per animal, specifically to illustrate inter-animal variability within each group and to visualize changes across disease progression.

      The potential for heterogeneity among animals has not been addressed. 

      To address this concern, we now include a new Supplemental Figure (Figure S4)  presenting the data using Superplots, in which individual cells are shown as dots, animal-level average as circles, and group means calculated based on animals as black horizontal lines. These plots demonstrate that cell mobility measures are highly consistent across animals within each group, indicating limited inter-animal heterogeneity.

      Validation of prion accumulation at each clinical stage of the disease is not provided. 

      We now provide validation of PrP<sup>Sc</sup> accumulation across disease stages by Western blot, along with quantitative analysis, in a new Supplemental Figure (Figure S2). This confirms progressive PrP<sup>Sc</sup> accumulation with advancing disease.

      How were the numerous captures of cells handled to derive morphological quantitative values? Based on the videos, there is a lot of movement and shape-shifting.

      The following description has been added to Methods to clarify morphology analysis: For microglial morphology analysis, we quantified morphological parameters (radius, area, perimeter, and shape index) for individual EGFP⁺ cells in each time frame of the time-lapse recordings using the TrackMate 7.13.2 plugin in FIJI. Parameter values for each cell were then averaged across the entire three-hour imaging period to obtain a single mean value per cell.

      While it is recognized that there are limits to what can be measured simultaneously with live imaging, the authors appear to have fixed tissues from each time point too - it would be very interesting to know if the extent or prion accumulation influences the microglial surveillance - i.e., do the enveloped ones have greater pathology. 

      This is very interesting question which is difficult to answer due to technical challenges in monitoring the pathology or faith of individual neuronal cells as a function of their envelopment in live prion-infected animals. Our previous work revealed that both accumulation of total PrP<sup>Sc</sup> in a brain and  accumulation of PrP<sup>Sc</sup> specifically in lysosomal compartments of microglia due to phagocytosis precedes the onset of neuronal envelopment (DOI: 10.1172/JCI181169).  Moreover, the onset of neuronal envelopment occurred after a noticeable decline in neuronal levels of Grin1, a subunit of the NMDA receptor essential for synaptic plasticity. Reactive microglia were observed to envelop Grin1-deficient neurons, suggesting that microglia respond to neuronal dysfunction. However, considering that envelopment is very dynamic and - in most cases - reversible, correlating the degree of envelopment with dysfunction of individual neurons is technically challenging.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors): 

      (1) I recommend performing additional immunostaining using microglial markers to address specificity. 

      These new data showing immunostaining for markers of resident microglia TMEM119 and P2Y12 are presented in Figures S6 and S7 and described in the final section of the Results.

      (2) The authors can at least further discuss the functional consequences of their findings in further detail. 

      A new paragraph has been added to the Discussion to address this point.

      (3) Quantify the % of cells that become mobile in the different conditions. 

      The percentage of mobile cells across normal, sub-clinical, early clinical, and advanced disease stages is presented in Figure 3b and described in the final paragraph of the section “Enveloping behavior of reactive myeloid cells.”

      (4) Improve method details on the brain regions used and further expand the statistical section. 

      We have expanded the Statistical Analysis section to indicate whether statistical comparisons and mean values were calculated at the single-cell level or the animal level for each analysis. The specific statistical tests used and the number of animals (N) are now reported in the corresponding figure legends. The sex of animals is provided in Table 1 (Methods). Only the cortical region was examined in this study; this information is stated in the Methods and is now also noted in the figure legends for clarity.

      Reviewer #2 (Recommendations for the authors): 

      (1) More details on members of the PY2 receptor family expressed in microglia would be helpful. The study highlights a previously published prion-induced decline in the expression of P2Y12, a microglial marker that is required for intracellular neuron-microglial contacts, and P2Y6, involved in calcium transients, which is required for hypermotility. How are members of this family of receptors regulated at the gene and/or protein level in microglial and given their responsiveness to nucleotide ligands, are other members implicated in the properties being quantified here? 

      We appreciate the reviewer’s insightful comment. To address this point, we examined the expression of multiple P2Y receptors and ATP-gated P2X channels known to contribute to microglial surveillance, activation, motility, and phagocytosis, alongside the activation markers Tlr2, Cd68, and Trem2. Bulk brain transcript analyses indicated that all examined genes were upregulated in SSLOW-infected mice relative to controls (new Figure S5a). However, because microglial proliferation substantially increases microglial numbers during prion disease progression, bulk tissue measurements do not necessarily reflect per-cell expression levels. Therefore, we normalized gene expression values to the microglia-specific marker Tmem119, whose per-cell expression remains stable across disease stages (Makarava et al., 2025, J. Neuroinflammation). After normalization, Tlr2, Cd68, and Trem2 were increased approximately 10-, 6-, and 4-fold, respectively. In contrast, P2 receptor genes showed more modest changes: P2ry6 increased ~3-fold, P2ry13 ~2-fold, and P2rx7 ~1.3-fold, while P2rx4 remained unchanged (Figure S5a). Within the scope of the present study, we focused on P2Y6 due to (i) its role in regulating calcium transients, (ii) the magnitude of its upregulation relative to other P2 receptors, and (iii) its highly microglia-specific expression in the CNS. We note that currently available commercial P2Y6 antibodies lack sufficient specificity, making reliable assessment of protein-level expression challenging.

      (2) Is P2Y6 expressed in any other cell type that might account for the blunted mobility of the microglia? The authors mention P2Y12 also identifies the GFP cells; however, it would be beneficial to highlight the specificity of the target in the ex vivo treatment of the infected slices.

      In the brain, both P2Y12 and P2Y6 are considered highly specific to resident microglia under physiological and neuroinflammatory conditions. P2Y12 is, in fact, widely used as a canonical marker of homeostatic and resident microglia. While P2Y6 is also expressed in peripheral myeloid cells such as macrophages, our phenotypic characterization indicates that the cells exhibiting neuronal envelopment are TMEM119⁺ and P2Y12⁺, consistent with a resident microglial identity. These data, including new analyses added to the revised manuscript, support that the cells responding to P2Y6 signaling in our ex vivo slice experiments are resident microglia.

      (3) The fluorescent mouse lacks Cx3cr1 - have the authors investigated why there were no apparent consequences, at least in the context of prion infection? Are there functional redundancies that might be harnessed? Does this impact the generalizability of the findings here?

      The role of Cx3cr1 in prion disease has been directly examined in two independent studies (doi: 10.1099/jgv.0.000442; doi: 10.1186/1471-2202-15-44). One study reported no effect of Cx3cr1 deficiency on disease incubation time, whereas the other observed only a minor difference. Importantly, both studies found no detectable alterations in microglial activation patterns, cytokine expression, or PrP<sup>Sc</sup> deposition in Cx3cr1-deficient mice compared to wild-type controls. Our own data (Figure S1) are consistent with these findings: disease course and PrP<sup>Sc</sup> deposition were comparable between Cx3cr1/EGFP and wild-type mice. Moreover, we observed reactive microglial envelopment of neurons in both genotypes. Microglia isolated from SSLOW-infected Cx3cr1/EGFP mice also displayed similarly elevated mobility in vitro, in agreement with our previous observations of high mobility of microglia isolated from SSLOW-infected wild-type mice (Makarava et al., 2025, J. Neuroinflammation). Taken together, these results indicate that Cx3cr1 is not a key determinant of reactive microglial mobility or envelopment behavior in prion disease. Thus, the use of the Cx3cr1/EGFP reporter line does not compromise the generalizability of our conclusions.

      (4) The distinction between high mobility and low mobility microglia is interesting - is there any evidence to suggest that the slow-moving microglia are actually a separate class - do enveloping microglia exhibit both mobility states - can the authors comment on plasticity here? 

      We appreciate this insightful comment, which closely aligns with our ongoing interests. At present, we do not have evidence to support that high- versus low-mobility microglia represent distinct molecular phenotypes. Given that our time-lapse imaging spans only a three-hour window, it remains unclear whether these mobility states reflect stable cell-intrinsic properties or transient phases within a dynamic surveillance process. Notably, we observed that individual cells can transition between more stationary, neuron-associated states and highly mobile states within the same imaging session. In future work, we intend to investigate whether prolonged interactions with neuronal somas or other microenvironmental cues may drive diversification of reactive myeloid cell phenotypes.

      (5) In the discussion, the authors speculate about "collective coordinated decision making" - that seems a stretch unless greater context is provided. The fact that several microglia can be found in contact with an individual neuron and that each microglia can connect with multiple neurons simultaneously is certainly interesting; however, evidence for hive behavior is entirely lacking.

      We agree with the reviewer that our previous wording overstated the interpretation. The statement regarding collective decision-making has been removed.

    1. eLife assessment

      This important work is the first to suggest a model that the nematode C. elegans prefers specific bacteria (its major food source) that release high amounts of the known attractant isoamyl alcohol when supplemented with exogenous leucine and has also identified a likely receptor for the odorant isoamyl alcohol. The evidence supporting the claims of the authors is solid, and the manuscript would be improved by changes to the text that clarify and address the distinction between "supplemented" versus "enriched". The renaming of srd-12 to snif-1 should also be addressed.

    2. Reviewer #1 (Public review):

      Summary:

      Siddiqui et al., investigate the question of how bacterial metabolism contributes to the attraction of C. elegans to specific bacteria. They show that C. elegans prefers three bacterial species when cultured in a leucine-enriched environment. These bacterial species release more isoamyl alcohol, a known C. elegans attractant, when cultured with leucine supplement than without leucine supplement. The study shows correlative evidence that isoamyl alcohol is produced from leucine by the Ehrlich pathway. In addition, they show that SNIF-1 is a receptor for isoamyl alcohol because a null mutant of this receptor exhibits lower chemotaxis to isoamyl alcohol and that chemotaxis to isoamyl alcohol is rescued by expression of snif-1 in AWC.

      Strengths:

      (1) This study takes a creative approach to examine the question of what specific volatile chemicals released by bacteria may signify to C. elegans by examining both bacterial metabolism and C. elegans preference behavior. Although C. elegans has long been known to be attracted to bacterial metabolites, this study may be one of the first to examine the possible role of a specific bacterial metabolic pathway in mediating attraction.

      (2) A strength of the paper is the identification of SNIF-1 as a receptor for isoamyl alcohol. The ligands for very few olfactory receptors have been identified in C. elegans and so this is a significant addition to the field. The SNIF-1 null mutant strain will likely be a useful reagent for many labs examining olfactory and foraging behaviors.

      Weaknesses:

      (1) The authors write that the leucine metabolism via the Ehrlich pathway is required for production of isoamyl alcohol by three bacteria (CEent1, JUb66, BIGb0170), but their evidence for this is correlation and not causation. They show that the gene, ilvE (which is part of the Ehrlich pathway) is upregulated in CEent1 bacteria upon exposure to leucine. Although this indicates that the ilvE gene may be involved in leucine metabolism, it does not show causation. To show causation, they need to knockout ilvE from one of these strains, show that the bacteria does not have increased isoamyl alcohol production when cultured on leucine, and that the bacteria is no longer attractive to C. elegans.

      (2) Although the authors do show that the three bacterial strains they focus on (CEent1, JUb66, and BIGb0170) are more attractive to C. elegans when supplemented with leucine. Some other strains such as BIGb0393 are also more attractive with leucine supplementation and produce isoamyl alcohol (Fig 1B and Supp Table 2). It is unclear why these other strains are not included with the selected three strains.

      (3) The behavioral evidence that snif-1 gene encodes a receptor for isoamyl alcohol is compelling because of the mutant phenotype and rescue experiments. The evidence would be stronger with calcium imaging of AWC neurons in response to isoamyl alcohol in the receptor mutant with the expectation that the response would be reduced or abolished in the mutant compared to wildtype.

    3. Reviewer #2 (Public review):

      Summary:

      Siddiqui et al. show that C. elegans prefers certain bacterial strains that have been supplemented with the essential amino acid (EEA) leucine. They convincingly show that some leucine enriched bacteria stimulate the production of isoamyl alcohol (IAA). IAA is an attractive odorant that is sensed by the AWC. The authors an identify a receptor, SRD-12, that is expressed in the AWC chemosensory neurons and is required for chemotaxis to IAA. The authors propose that IAA is a predominant olfactory cue that determines diet preference in C. elegans. Since leucine is an EAA, the authors propose that worm IAA sensing allows the animal provides a proxy mechanism to identify EAA rich diets.

      Strengths:

      The authors propose IAA as a predominant olfactory cue that determines diet preference in C. elegans providing molecular mechanism underlying diet selection. They show that wild isolates of C. elegans have strong chemotactic response to IAA indicating that IAA is an ecologically relevant odor for the worm. The paper is well written, and the presented data are convincing and well organized. This is an interesting paper that connects chemotactic response with bacterially produced odors and thus provides an understanding how animals adapt their foraging behavior through the perception of molecules that may indicate the nutritional value.

      Weaknesses:

      Major: While I do like the way the authors frame C. elegans IAA sensing as mechanisms to identify leucine (EAA) rich diets, it is not fully clear whether bacterial IAA production is a proxy for bacterial leucine levels.

      (1) Can the authors measure leucine (or other EAA) content of the different CeMbio strains? This would substantiate the premise in the way they frame this in the introduction. While the authors convincingly show that leucine supplementation induces IAA production in some strains, it is not clear if there are lower leucine levels in the different in the non-preferred strains.

      (2) It is not clear whether the non-preferred bacteria in Figure 1A and 1B have the ability to produce IAA. To substantiate the claim that C. elegans prefers CEent1, JUb66, and BIGb0170 due to their ability to generate IAA from leucine, it would be measure IAA levels in non-preferred bacteria (+ and - leucine supplementation). If the authors have these data it would be good to include this.

      (3) The authors would strengthen their claim if they could show that deletion or silencing ilvE enzyme reduces IAA levels and eliminates the increased preference upon leucine supplementation.

      (4) While the three preferred bacteria possess the ilvE gene, it is not clear whether this enzyme is present in the other non-preferred bacterial strains. As far as I know, the CeMbio strains have been sequenced, so it should be easy to determine if the non-preferred bacteria possess the capacity to make IAA. Does expression of ilvE in e.g. E. coli increase its preference index or are the other genes in the biosynthesis pathway missing?

      (5) It is strongly implied that leucine rich diets are beneficial to the worm. Do the authors have data to show the effect on leucine supplementation on C. elegans healthspan, life-span or broodsize?

      Comments on revisions:

      (1) The authors have addressed most of the earlier questions. The main unresolved issue is the link between iaa production is a reflection of bacterial leucine levels. It is not clear if there are lower leucine levels in the different in non-preferred strains.

      The main conclusions that: 1. some bacterial strains can convert exogenous leucine into IAA which is an attractant to C. elegans. 2. The identification of a GPCR required for IAA responses are solid. These are important results that carry the paper. My outstanding concern remains with the overinterpretation of the framing that C. elegans IAA sensing is used as a mechanism to identify leucine (EAA) rich diets. It is fine to leave this a favorite hypothesis in the discussion but statements throughout the paper need to be nuanced without leucine measurement of the different bacterial strains. (Also since for the bacterial chemotaxis assays there were only done with a single concentration of leucine makes it difficult to infer bacterial leucine concentrations). I recommend softening claims related to leucine-rich diet detection unless quantitative measurements are provided.

      Part of the issue in the text lies in the difference between "supplemented" and "chemotaxis" (lab based constructs) and enriched and foraging (natural environment based). This is also the way it is set up in the introduction "Do animals use specific sensing mechanisms to find an EAA-enriched diet?". If enriched is used strictly the same as supplemented then it would be fine but in the text this distinction gets blurred and enriched drifts to the more ethological explanation.

      Then it is more than just semantics since leucine-supplemented diets are not something that occurs in the natural environment. IAA production by bacteria could be a signal for a leucine rich environment and it is fine to speculate about this in the discussion.

      Examples where the wording needs to be more precise to reflect the experimental results rather than the possible impact in its natural environment:

      The title:' The olfactory receptor SNIF-1 mediates foraging for leucine-rich diets in C. elegans"

      The intro:"Taken together, SNIF-1 regulates the dietary preference of worms to IAA-producing bacteria and thereby mediates the foraging behavior of C. elegans to leucine-enriched diets. Thus, IAA produced by bacteria is a dietary quality code for leucine-enriched bacteria."

      Results "Figure 1. C. elegans relies on odors to select leucine-enriched bacteria"

      Supplementation is used more in the text and the figure legends whereas headings and abstract use enriched. The experiments in the paper only describe leucine-supplemented experiments. So use I would supplemented instead of enriched when describing experiments for clarity.

      For instance:

      Page 4:"Microbial odors drive the preference of C. elegans for leucine-enriched diet"

      Page 5: "Altogether, these findings suggested that worms rely on odors to distinguish various bacteria and find leucine-enriched bacteria"

      Page 7: "Isoamyl alcohol odor is a signature for a leucine-enriched diet"

      Page 9: AWC odor sensory neurons facilitate the diet preference of C. elegans for leucine-enriched diets"

      page 20 "Leucine-enriched diets produce significantly higher levels of IAA odor, making up to 90% of their headspace"

      (2) As suggested in the first round of review the authors now add data IAA levels in non-preferred bacteria (+ and - leucine supplementation) in table S2. While it is good to have this data, the table is not very clear. Not clear what ND stands for in the table S2. Not determined or not detected? I assume not determined since some strains Jub44, BiGb0393 Jub134 produce IAA even in the absence of LEU. The authors mention that "the abundance of IAA in these strains is significantly less". However, the table just reflects yes or no. Can the authors give an indication of the concentration to understand what significantly less means? Fig. 2c at least gives a heat map.

      (3) On wormbase the gene is still called srd-12. The authors should seek permission to rename srd-12 to snif-1.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This is an important study, supported by solid to convincing data, that suggests a model for diet selection in C. elegans. The significance is that while C. elegans has long been known to be attracted to bacterial volatiles, what specific bacterial volatiles may signify to C. elegans is largely unknown. This study also provides evidence for a possible odorant/GPCR pairing.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Siddiqui et al., investigate the question of how bacterial metabolism contributes to the attraction of C. elegans to specific bacteria. They show that C. elegans prefers three bacterial species when cultured in a leucine-enriched environment. These bacterial species release more isoamyl alcohol, a known C. elegans attractant, when cultured with leucine supplement than without leucine supplement. The study shows correlative evidence that isoamyl alcohol is produced from leucine by the Ehrlich pathway. In addition, they show that SRD-12 (SNIF-1) is likely a receptor for isoamyl alcohol because a null mutant of this receptor exhibits lower chemotaxis to isoamyl alcohol and lower preference for leucine-enriched bacteria.

      Strengths:

      (1) This study takes a creative approach to examine the question of what specific volatile chemicals released by bacteria may signify to C. elegans by examining both bacterial metabolism and C. elegans preference behavior. Although C. elegans has long been known to be attracted to bacterial metabolites, this study may be one of the first to examine the role of a specific bacterial metabolic pathway in mediating attraction.

      (2)  A strength of the paper is the identification of SRD-12 (SNIF-1) as a likely receptor for isoamyl alcohol. The ligands for very few olfactory receptors have been identified in C. elegans and so this is a significant addition to the field. The srd-12 (snif-1) null mutant strain will likely be a useful reagent for many labs examining olfactory and foraging behaviors.

      Weaknesses:

      (1) The authors write that the leucine metabolism via the Ehrlich pathway is required for the production of isoamyl alcohol by three bacteria (CEent1, JUb66, BIGb0170), but their evidence for this is correlation and not causation. They write that the gene ilvE is a bacterial homolog of the first gene in the yeast Ehrlich pathway (it would be good to include a citation for this) and that the gene is present in these three bacterial strains. In addition, they show that this gene, ilvE, is upregulated in CEent1 bacteria upon exposure to leucine. To show causation, they need to knockout ilvE from one of these strains, show that the bacteria does not have increased isoamyl alcohol production when cultured on leucine, and that the bacteria is no longer attractive to C. elegans.

      Thank you for the comment. We have added the appropriate citation [1,2]. We agree that worms’ diet preference for the preferred strains upon ilvE knockout will further strengthen the claim for IAA being used as a proxy for leucine-enriched diet. Currently, protocols and tools for genetic manipulations for CeMbio strains are not available, making this experiment not feasible at this time.  

      (2) The authors examine three bacterial strains that C. elegans showed increased preference when grown with leucine supplementation vs. without leucine supplementation. However, there also appears to be a strong preference for another strain, JUb0393, when grown on plus leucine (Figure 1B). It would be good to include statistics and criteria for selecting the three strains.

      Thanks for your comment. We agree that for Pantoea nemavictus, JUb393, worms seem to prefer the leucine supplemented (+ LEU) bacteria over unsupplemented (-LEU). However, when given a choice between the individual CeMbio bacteria and E. coli OP50, worms showed preference for only CEent1, JUb66, and BIGb0170 (Figure 1F). Consequently, CEent1, JUb66, and BIGb0170 were selected for further analyses. We have included statistics for Figure 1B-C and Figure S1A-G with details mentioned in the figure legend. 

      (3) Although the behavioral evidence that srd-12 (snif-1) gene encodes a receptor for isoamyl alcohol is compelling, it does not meet the standard for showing that it is an olfactory receptor in C. elegans. To show it is indeed a likely receptor one or more of the following should be done:

      (a) Calcium imaging of AWC neurons in response to isoamyl alcohol in the receptor mutant with the expectation that the response would be reduced or abolished in the mutant compared to wildtype.

      (b)"A receptor swap" experiment where the SRD-12 (SNIF-1) receptor is expressed in AWB repulsive neuron in SRD-12 (SNIF-1) receptor mutant background with the expectation that with receptor swap C. elegans will now be repulsed from isoamyl alcohol in chemotaxis assays (experiment from Sengupta et al., 1996 odr-10 paper).

      Thanks for all your comments and suggestions. While the lab currently does not have the necessary expertise to conduct calcium imaging of neurons, we have performed additional experiments to confirm the requirements of AWC neurons for SNIF-1 function. We generated transgenic worms with extrachromosomal array expressing snif-1 under (a) AWC-specific promoter, odr-1, and (b) AWB-specific promoter, str-1. As shown in new panel 6H in the revised manuscript and Author response image 1, we found that overexpression of snif-1 in AWC neurons completely rescues the chemotaxis defect of snif-1 mutant (referred at VSL2401), whereas upon the “receptor swap" in AWB neurons IAA is sensed as a repellent.  

      Author response image 1.

      (A) Chemotaxis index (CI) of WT, VSL2401, VSL2401 [AWCp::snif-1] and VSL2401 [AWBp::snif-1] worms to IAA at 1:1000 dilution. Significant differences are indicated as **** P ≤ 0.0001 determined by one-way ANOVA followed by post hoc Dunnett’s multiple comparison test. Error bars indicate SEM (n≥15).

      (4) The authors conclude that C. elegans cannot detect leucine in chemotaxis assays. It is important to add the method for how leucine chemotaxis assay was done in order to interpret these results. Because leucine is not volatile if leucine is put on the plates immediately before the worms are added (as in a traditional odor chemotaxis assay), there is no leucine gradient for the worm to detect. It would be good to put leucine on the plate several hours before worms are introduced so worms have the possibility to be able to detect the gradient of leucine (for example, see Wakabayashi et al., 2009).

      Previously, the chemotaxis assays with leucine were performed like traditional odor chemotaxis assays. We also performed chemotaxis assay as detailed in Shingai et al 2005[3]. Leucine was spotted on the assay plates 5 hours prior to the introduction of worms on the plates. As shown in new panel S1H in the revised manuscript, wild-type worms do not show response to leucine in the modified chemotaxis assay.

      We have included the experimental details for leucine chemotaxis assays in the revised manuscript.  

      (5) The bacterial preference assay entitled "odor-only assay" is a misleading name. In the assay, C. elegans is exposed to both volatile chemicals (odors) and non-volatile chemicals because the bacteria are grown on the assay plate for 12 hours before the worms are introduced to the assay plate. In that time, the bacteria is likely releasing non-volatile metabolites into the plate which may affect the worm's preference. A true odor-only assay would have the bacteria on the lid and the worms on the plate.

      The ‘odor-only’ diet preference assay does not allow for non-volatile chemicals to reach worms. We achieved this by using tripartite dishes where the compartments containing worms and bacterial odors are separated by polystyrene barriers. At the time of the assay, worms were spotted in a separate compartment from that of bacteria (as shown in schematic 1A). The soluble metabolites released by the bacteria during their growth will accumulate in the agar within the bacterial compartment alone such that worms only encounter the volatile metabolites produced by bacteria wafting past the polystyrene barrier.

      (6) The findings of the study should be discussed more in the context of prior literature. For example, AWC neurons have been previously shown to be involved in bacterial preference (Harris et al., 2014; Worthy et al., 2018). In addition, CeMbio bacterial strains (the strains examined in this study) have been previously shown to release isoamyl alcohol (Chai et al. 2024).

      Thanks for the suggestion. We have modified the Discussion section to discuss the study in the light of relevant prior literature.  

      Reviewer #2 (Public review):

      Summary:

      Siddiqui et al. show that C. elegans prefers certain bacterial strains that have been supplemented with the essential amino acid (EEA) leucine. They convincingly show that some leucine enriched bacteria stimulate the production of isoamyl alcohol (IAA). IAA is an attractive odorant that is sensed by the AWC. The authors an identify a receptor, SRD-12 (SNIF-1), that is expressed in the AWC chemosensory neurons and is required for chemotaxis to IAA. The authors propose that IAA is a predominant olfactory cue that determines diet preference in C. elegans. Since leucine is an EAA, the authors propose that worm IAA sensing allows the animal provides a proxy mechanism to identify EAA rich diets.

      Strengths:

      The authors propose IAA as a predominant olfactory cue that determines diet preference in C. elegans providing molecular mechanism underlying diet selection. They show that wild isolates of C. elegans have a strong chemotactic response to IAA indicating that IAA is an ecologically relevant odor for the worm. The paper is well written, and the presented data are convincing and well organized. This is an interesting paper that connects chemotactic response with bacterially produced odors and thus provides an understanding of how animals adapt their foraging behavior through the perception of molecules that may indicate the nutritional value.

      Weaknesses:

      Major:

      While I do like the way the authors frame C. elegans IAA sensing as mechanisms to identify leucine (EAA) rich diets it is not fully clear whether bacterial IAA production is a proxy for bacterial leucine levels.

      (1) Can the authors measure leucine (or other EAA) content of the different CeMbio strains? This would substantiate the premise in the way they frame this in the introduction. While the authors convincingly show that leucine supplementation induces IAA production in some strains, it is not clear if there are lower leucine levels in the different in non-preferred strains.

      Thanks for your suggestion. Estimating leucine levels in various bacteria will provide useful information, and we hope to do so in future studies.

      (2) It is not clear whether the non-preferred bacteria in Figure 1A and 1B have the ability to produce IAA. To substantiate the claim that C. elegans prefers CEent1, JUb66, and BIGb0170 due to their ability to generate IAA from leucine, it would measure IAA levels in non-preferred bacteria (+ and - leucine supplementation). If the authors have these data it would be good to include this.

      Thanks for the suggestion. We have included the table indicating the presence or absence of IAA production by all the bacteria under + LEU and – LEU conditions (Table S2). Some of the nonpreferred bacteria indeed produce isoamyl alcohol. However, the abundance of IAA in these strains is significantly less than in the preferred bacteria.  

      Using the available genomic sequence data, we found that all CeMbio strains encode IlvE-like transaminase enzymes[4]. This suggests that presumably all the bacteria have the metabolic capacity to make alpha-ketoisocaproate (an intermediate in IAA biosynthetic pathway) from leucine. However, the regulation of metabolic flux is likely to be quite complex in various bacteria.  

      (3) The authors would strengthen their claim if they could show that deletion or silencing ilvE enzyme reduces IAA levels and eliminates the increased preference upon leucine supplementation.

      We agree that testing worms’ diet preference for the preferred strains upon ilvE knockout will further strengthen the claim for IAA being crucial for finding leucine-enriched diet. Currently the lab does not have the necessary expertise and standardize protocols to do genetic manipulations for the CeMbio strains.

      (4) While the three preferred bacteria possess the ilvE gene, it is not clear whether this enzyme is present in the other non-preferred bacterial strains. As far as I know, the CeMbio strains have been sequenced so it should be easy to determine if the non-preferred bacteria possess the capacity to make IAA. Does the expression of ilvE in e.g. E. coli increase its preference index or are the other genes in the biosynthesis pathway missing?

      Thanks for the suggestion. Using the available genomic sequence data, we find that all the bacteria in the CeMbio collection possess IlvE-like transaminase necessary for synthesis of alphaketoisocaproate, key metabolite in leucine turn over as well as precursor for IAA [4]. E. coli has an IlvE encoding gene in its genome [2]. However, we do not find IAA in the headspace of E. coli either with or without leucine supplementation. This indicates either (i) E. coli lacks enzymes for subsequent steps in IAA biosynthesis or (ii) leucine provided under the experimental regime is not sufficient to shift the metabolic flux to IAA production.  

      Previous studies have suggested that in yeast, the final two steps leading to IAA production are catalyzed by decarboxylase and dehydrogenase enzymes1. The genomic and metabolic flux data available for CeMbio do not describe specific enzymes leading up to IAA synthesis [4].  

      (5) It is strongly implied that leucine-rich diets are beneficial to the worm. Do the authors have data to show the effect on leucine supplementation on C. elegans healthspan, life-span or broodsize?

      Edwards et al. 2015 reported a 15% increase in the lifespan of worms upon 1 mM leucine supplementation [5]. Wang et al 2018 also showed lifespan extension upon 1 mM and 10 mM leucine supplementation. They also reported that while leucine supplementation did not have any effect on brood size, it did make worms more resistant to heat, paraquat, and UV-stress [6]. These studies have been included in the discussion section.

      Other comments:

      Page 6. Figure 2c. While the authors' conclusions are correct based on AWC expts. it would be good at this stage to include the possibility that odors that enriched in the absence of leucine may be aversive.

      Thanks for the comment. We have tested the chemotaxis response of the worms for most of the odors produced by CeMbio strains without leucine supplementation. We did not find any odor that is aversive to worms. However, we cannot completely rule out the possibility that a low abundance of aversive odor in the headspace of the bacteria was missed.

      Interestingly, we did identify 2-nonanone, a known repellent, in the headspace of the preferred bacteria upon leucine supplementation. However, the abundance of 2-nonanone in headspace of bacteria is relatively low (less than 1% for CEent1, and JUb66, and ~10% for BIGb0170). This suggests that the relative abundance of odors in an odor bouquet may be a relevant factor in determining worms’ reference.  

      Page 6. IAA increases 1.2-4 folds upon leucine supplementation. If the authors perform a chemotaxis assay with just IAA with 1-2-4 fold differences do you get the shift in preference index as seen with the bacteria? i.e. is the difference in IAA concentration sufficient to explain the shift in bacterial PI upon leucine supplementation? Other attractants such as Acetoin and isobutanol go up in -Leu conditions.

      Thanks for the suggestion. As shown in Figure S2H and S2I, when given a choice between a concentration of IAA (1:1000 dilution) attractive to worms and a 4-fold higher amount of IAA, worms chose the latter. This result suggests that worms can distinguish between relatively small difference in concentrations of IAA.

      We agree that the relative abundance of Acetoin and Isobutanol is high in -LEU conditions. The presence of other attractants in - LEU conditions should skew the preference of worms for – LEU bacteria. However, we found that worms prefer + LEU bacteria (Figure 1B), suggesting that the abundance of IAA mainly influences the diet preference of the worms.  

      Page 14-15. The authors identify a putative IAA receptor based on expression studies. I compliment the authors for isolating two CRISPR deletion alleles. They show that the srd-12 (snif-1) mutants have obvious defects in IAA chemotaxis. Very few ligand-odorant receptors combinations have been identified so this is an important discovery. CenGen data indicate that srd-12 (snif-1) is expressed in a limited set of neurons. Did the authors generate a reporter to show the expression of srd-12 (snif-1)? This is a simple experiment that would add to the characterization of the SRD-12 (SNIF-1) receptor. Rescue experiments would be nice even though the authors have independent alleles. To truly claim that SRD-12 (SNIF-1) is the ligand for IAA and activates the AWC neurons would require GCamp experiments in the AWC neuron or heterologous expression system. I understand that GCamp imaging might not be part of the regular arsenal of the lab but it would be a great addition (even in collaboration with one of the many labs that do this regularly). Comparing AWC activity using GCaMP in response IAA-producing bacteria with high leucine levels in both wild-type and SRD-12 (SNIF-1) deficient backgrounds, would further support their narrative. I leave that to the authors.

      Thanks for your comments and suggestions. To address this comment, we rescued snif-1 mutant (referred as VSL2401) with extrachromosomal array expressing snif-1 under AWC-specific promoter as well as its native promoter. As shown in Figure 6H and Author response image 2, we find that both transgenic lines show a complete rescue of chemotaxis response to isoamyl alcohol. To find where snif-1 is expressed, we generated a transgenic line of worms expressing GFP under snif-1 promoter, and mCherry under odr-1 promoter (to mark AWC neurons). As shown in Figure 6I, we found that snif-1 is expressed faintly in many neurons, with strong expression in one of the two AWC neurons marked by odr-1::mCherry. This result suggests that SNIF-1 is expressed in AWC neuron.

      We hope to perform GCaMP assay and further characterization of SNIF-1 in the future.

      Author response image 2.

      Chemotaxis index (CI) of WT, VSL2401, VSL2401 [AWCp:: snif-1] and VSL2401 [snif-1p::snif-1] worms to IAA at 1:1000 dilution. Significant differences are indicated as **** P ≤ 0.0001 determined by one-way ANOVA followed by post hoc Dunnett’s multiple comparison test. Error bars indicate SEM (n≥15).

      Minor:

      Page 4 "These results suggested that worms can forage for diets enriched in specific EAA, leucine...." More precise at this stage would be to state " These results indicated that worms can forage for diets supplemented with specific EAA...".

      We have changed the statement in the revised manuscript.

      Page 5."these findings suggested that worms not only rely on odors to choose between two bacteria but also to find leucine enriched bacteria" This statement is not clear to me and doesn't follow the data in Fig. S2. Preferred diets in odorant assays are the IAA producing strains.

      Thanks for your comment. We have revised the manuscript to make it clear. “Altogether, these findings suggested that worms rely on odors to distinguish different bacteria and find leucineenriched bacteria”. This statement concludes all the data shown in Figure 1 and Figure S1.  

      Page 5. Figure S2A provides nice and useful data that can be part of the main Figure 1.

      Thanks for the comment. We have incorporated the data from Figure S2A to main Figure 1.

      Reviewer #3 (Public review):

      Summary:

      The authors first tested whether EAA supplementation increases olfactory preference for bacterial food for a variety of bacterial strains. Of the EAAs, they found only leucine supplementation increased olfactory preference (within a bacterial strain), and only for 3 of the bacterial strains tested. Leucine itself was not found to be intrinsically attractive.

      They determined that leucine supplementation increases isoamyl alcohol (IAA) production in the 3 preferred bacterial strains. They identify the biochemical pathway that catabolizes leucine to IAA, showing that a required enzyme for this pathway is upregulated upon supplementation.

      Consistent with earlier studies, they find that AWC olfactory neuron is primarily responsible for increased preference for IAA-producing bacteria.

      Testing volatile compounds produced by bacteria and identified by GC/MS, and identified several as attractive, most of them require AWC for the full effect. Adaptation assays were used to show that odorant levels produced by bacterial lawns were sufficient to induce olfactory adaptation, and adaptation to IAA reduced chemotaxis to leucine-supplemented lawns. They then showed that IAA attractiveness is conserved across wild strains, while other compounds are more variable, suggesting IAA is a principal foraging cue.

      Finally, using the CeNGEN database, they developed a list of candidate IAA receptors. Using behavioral tests, they show that mutation of srd-12 (snif-1) greatly impairs IAA chemotaxis without affecting locomotion or attraction to another AWC-sensed odor, PEA.

      Comments

      This study will be of great interest in the field of C. elegans behavior, chemical senses and chemical ecology, and understanding of the sensory biology of foraging.

      Strengths:

      The identification of a receptor for IAA is an excellent finding. The combination of microbial metabolic chemistry and the use of natural bacteria and nematode strains makes an extremely compelling case for the ecological and adaptive relevance of the findings.

      Weaknesses:

      AWC receives synaptic input from other chemosensory neurons, and thus could potentially mediate navigation behaviors to compounds detected in whole or in part by those neurons. Language concluding detection by AWC should be moderated (e.g. p9 "worms sense an extensive repertoire...predominantly using AWC") unless it has been demonstrated.

      Thanks for your comment. We have modified the manuscript to incorporate the suggestion.

      srd-12 (snif-1) is not exclusively expressed in AWC. Normally, cell-specific rescue or knockdown would be used to demonstrate function in a specific cell. The authors should provide such a demonstration or explain why they are confident srd-12 (snif-1) acts in AWC.

      Thanks for the comment. We have performed AWC-specific rescue of snif-1 in mutant worms. As shown in Figure 6H, we found that AWC neurons specific rescue completely recovered the chemotaxis defect of the snif-1 mutant (referred as VSL2401) for IAA. In addition, snif-1 is expressed in one of the AWC neurons.

      A comparison of AWC's physiological responses between WT and srd-12 (snif-1), preferably in an unc13 background, would be nice. Even further, the expression of srd-12 (snif-1) in a different neuron type and showing that it confers responsiveness to IAA (in this case, inhibition) would be very convincing.

      Thanks for the suggestion. We have performed a receptor swap experiment, where snif-1 is misexpressed in AWB neurons. We find that these worms show slight but significant repulsion to IAA compared to WT and snif-1 mutant worms (Author response image 1).

      Recommendations for the authors:

      Reviewing Editor:

      Please consider all of the reviewer comments. In particular, as noted in the individual reviews, the strength of the evidence would be bolstered by additional experiments to demonstrate that the iLvE enzyme affects IAA levels in the preferred bacteria. The reviewers note that the authors haven't shown that IAA production is a reflection of leucine content. Are the non-preferred bacteria low on leucine or lack iLvE or IAA synthesis pathways? Further, more direct evidence that SRD-12 (SNIF-1) is in fact the primary IAA receptor would further strengthen the study. The authors should also be aware that geographic distance for wild isolate C. elegans may not directly correlate with phylogenetic distance. This should be assessed/discussed for the strains used.

      Thanks for the suggestions. Some of these have been addressed in response to reviewers. Thanks for your comments about possible disconnect between geographical and phylogenetic distances amongst natural isolates used here.

      By analyzing the phylogenetic tree generated using neighbor-joining algorithm available at CaeNDR database, we found that QX1211 and JU3226 are phylogenetically close, but the remaining isolates fall under different clades separated by long phylogenetic distances [7,8].  

      Reviewer #1 (Recommendations for the authors):

      (1) In the first sentence of the third paragraph of the introduction, C. elegans are described as "soildwelling." Although C. elegans has been described as soil-dwelling in the past, current research indicates they are most often found on rotten fruit, compost heaps and other bacterial-rich environments, not soil. "All Caenorhabditis species are colonizers of nutrient- and bacteria-rich substrates and none of them is a true soil nematode." from Kiontke, K. and Sudhaus, W. Ecology of Caenorhabditis species (WormBook).

      Your specific comment about C. elegans’ habitat is well received. However, in that sentence we are referring to the chemosensory system of soil-dwelling animals in general, and not particularly C. elegans.

      (2) Figure 3K, the model would be clearer if leucine-rich diet -> volatile chemicals ->AWC (instead of leucine-rich diet -> AWC <- volatile chemicals). The leucine-rich diet results in the production of volatile chemicals which are detected by AWC.

      We have modified the figure to make it clearer.

      (3) Figure 4 - it would help to include a table summarizing the volatile chemicals that each bacteria releases. Then the reader could more easily evaluate whether the adaptation to each specific odor is consistent with the change in preference for the specific bacteria based on what it releases in its headspace. In addition, Figure 4 would help to clarify whether bacteria in these experiments were cultured with or without leucine supplementation.

      Table S2 summarizes the odors released by all the bacteria under + LEU and – LEU conditions.

      In Figure 4, adaptation was performed by odors of bacteria when cultured under leucineunsupplemented conditions.

      Reviewer #2 (Recommendations for the authors):

      Page 9. Previous studies e.g. Bargmann Hartwieg and Horvitz have shown IAA is sensed by the AWC. Would be good to cite appropriately.

      Thanks for the comment. The reference has been cited at p9 and p16.

      References:

      (1) Yuan, J., Mishra, P., and Ching, C.B. (2017). Engineering the leucine biosynthetic pathway for isoamyl alcohol overproduction in Saccharomyces cerevisiae. Journal of Industrial Microbiology and Biotechnology 44, 107-117. 10.1007/s10295-016-1855-2 %J Journal of Industrial Microbiology and Biotechnology.

      (2) Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y., and Ishiguro-Watanabe, M. (2025). KEGG: biological systems database as a model of the real world. Nucleic Acids Res 53, D672-d677. 10.1093/nar/gkae909.

      (3) Shingai, R., Wakabayashi, T., Sakata, K., and Matsuura, T. (2005). Chemotaxis of Caenorhabditis elegans during simultaneous presentation of two water-soluble attractants, llysine and chloride ions. Comparative biochemistry and physiology. Part A, Molecular & integrative physiology 142, 308-317. 10.1016/j.cbpa.2005.07.010.

      (4) Dirksen, P., Assié, A., Zimmermann, J., Zhang, F., Tietje, A.M., Marsh, S.A., Félix, M.A., Shapira, M., Kaleta, C., Schulenburg, H., and Samuel, B.S. (2020). CeMbio - The Caenorhabditis elegans Microbiome Resource. G3 (Bethesda, Md.) 10, 3025-3039. 10.1534/g3.120.401309.

      (5) Edwards, C., Canfield, J., Copes, N., Brito, A., Rehan, M., Lipps, D., Brunquell, J., Westerheide, S.D., and Bradshaw, P.C. (2015). Mechanisms of amino acid-mediated lifespan extension in Caenorhabditis elegans. BMC genetics 16, 8. 10.1186/s12863-015-0167-2.

      (6) Wang, H., Wang, J., Zhang, Z.J.J.o.F., and Research, N. (2018). Leucine Exerts Lifespan Extension and Improvement in Three Types of Stress Resistance (Thermotolerance, AntiOxidation and Anti-UV Irradiation) in C. elegans. 6, 665-673.

      (7) Crombie, T.A., McKeown, R., Moya, N.D., Evans, Kathryn S., Widmayer, Samuel J., LaGrassa, V., Roman, N., Tursunova, O., Zhang, G., Gibson, Sophia B., et al. (2023). CaeNDR, the Caenorhabditis Natural Diversity Resource. Nucleic Acids Research 52, D850-D858. 10.1093/nar/gkad887 %J Nucleic Acids Research.

      (8) Cook, D.E., Zdraljevic, S., Roberts, J.P., and Andersen, E.C. (2017). CeNDR, the Caenorhabditis elegans natural diversity resource. Nucleic Acids Res 45, D650-d657. 10.1093/nar/gkw893.

    1. eLife Assessment

      This important study characterized and identified clonal MSC populations from human synovium. The authors provide convincing evidence that clonal MSC populations can be isolated and expanded from both normal and osteoarthritic synovium and that CD47 represents a potential marker for improved chondrogenic potential of MSC sub-populations. These findings could provide new avenues for osteoarthritis treatment in the future and deeper mechanistic understanding of the factors involved in the repair.

    2. Reviewer #1 (Public review):

      Summary:

      This work by Al-Jezani et al. focused on characterizing clonally derived MSC populations from the synovium of normal and osteoarthritis (OA) patients. This included characterizing the cell surface marker expression in situ (at time of isolation), as well as after in vitro expansion. The group also tried to correlate marker expression with trilineage differential potential. They also tested the ability of the different sub-populations for their efficacy in repairing cartilage in a rat model of OA. The main finding of the study is that CD47hi MSCs may have a greater capacity to repair cartilage than CD47lo MSCs, suggesting that CD47 may be a novel marker of human MSCs that have enhanced chondrogenic potential.

      Strengths:

      Studies on cell characterization of the different clonal populations isolated indicate that the MSC are heterogenous and traditional cell surface markers for MSCs do not accurately predict the differentiation potential of MSCs. While this has been previously established in the field of MSC therapy, the authors did attempt to characterize clones derived from single cells, as well as evaluate the marker profile at the time of isolation. While the outcome of heterogeneity is not surprising, the methods used to isolate and characterze the cells were well developed. The interesting finding of the study is the identification of CD47 as a potential MSC marker that could be related to chondrogenic potential. The authors suggest that MSCs with high CD47 repaired cartilage more effectively than MSC with low CD47 in a rat OA model.

      Comments on revisions:

      Thank you for addressing the comments from the first review. No additional revisions.

    3. Reviewer #2 (Public review):

      Summary:

      This is a compelling study that systematically characterized and identified clonal MSC populations derived from normal and osteoarthritis human synovium. There is immense growth in the focus on synovial-derived progenitors in the context of both disease mechanisms and potential treatment approaches, and the authors sought to understand the regenerative potential of synovial-derived MSCs.

      Strengths:

      This study has multiple strengths. MSC cultures were established from an impressive number of human subjects, and rigorous cell surface protein analyses were conducted, at both pre-culture and post-culture timepoints. In vivo experiments using a rat DMM model showed beneficial therapeutic effects of MSCs vs non-MSCs, with compelling data demonstrating that only "real" MSC clones incorporate into cartilage repair tissue and express Prg4. Proteomics analysis was performed to characterize non-MSC vs MSC cultures, and high CD47 expression was identified as a marker for MSC. Injection of CD47-Hi vs CD47-Low cells in the same rat DMM model also demonstrated beneficial effects, albeit only based on histology. A major strength of these studies is the direct translational opportunity for novel MSC-based therapeutic interventions, with high potential for a "personalized medicine" approach.

      Weaknesses:

      Weaknesses of this study include the rather cursory assessment of the OA phenotype in the rat model, confined entirely to histology (i.e. no microCT, no pain/behavioral assessments, no molecular readouts). This is relevant given the mixed results in therapeutic experiments demonstrating lower OA scores, but not lower inflammation scores, in CD47-Hi-treated rats. Thus, future work should focus on characterizing the therapeutic mechanism further given the clinical relevant of inflammation and pain in OA. It is somewhat unclear how the authors converged on CD47 vs other factors, but despite its somewhat broad profile, it was shown to be a useful marker to differentiate functional effects of MSCs. Additional work is needed to understand whether MSCs also engraft in ectopic cartilage (in the context of osteophyte/chondrophyte formation) or whether their effects are limited to articular cartilage. Despite these areas for improvement, this is a strong paper with a high degree of rigor, and the results are compelling, timely, and important.

      Overall, the authors achieved their aims, and the results support not just the therapeutic value of clonally-isolated synovial MSCs but also the immense heterogeneity in stromal cell populations (containing true MSCs and non-MSCs) that must be investigated further. Of note, the authors employed the ISCT criteria to characterize MSCs, with mixed results in pre-culture and post-culture assessments. This work is likely to have a long-term impact on methodologies used to culture and study MSCs, in addition to advancing the field's knowledge about how synovial-derived progenitors contribute to cartilage repair in vivo.

      Comments on revisions:

      I commend the authors for a good revision. While the revision primarily entailed re-analysis or additional analysis of existing data, as well as text-based changes, it improved the clarity and completeness of the manuscript.

      I do encourage the authors to expand their phenotyping assessments in future studies given that the interaction between structural disease, inflammation, and pain is complex, and our understanding of how the two interact and affect each other is evolving. There are multiple recent publications that show that a therapeutic or knock-out is protective against cartilage damage but doesn't alleviate pain, or vice versa. Thus, as a field, understanding which therapies target which pathological manifestations is an important next step to advance treatments. I also look forward to the follow-up studies on the MSC's role in ectopic cartilage.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This work by Al-Jezani et al. focused on characterizing clonally derived MSC populations from the synovium of normal and osteoarthritis (OA) patients. This included characterizing the cell surface marker expression in situ (at time of isolation), as well as after in vitro expansion. The group also tried to correlate marker expression with trilineage differential potential. They also tested the ability of the different subpopulations for their efficacy in repairing cartilage in a rat model of OA. The main finding of the study is that CD47hi MSCs may have a greater capacity to repair cartilage than CD47lo MSCs, suggesting that CD47 may be a novel marker of human MSCs that have enhanced chondrogenic potential. 

      Strengths: 

      Studies on cell characterization of the different clonal populations isolated indicate that the MSC are heterogenous and traditional cell surface markers for MSCs do not accurately predict the differentiation potential of MSCs. While this has been previously established in the field of MSC therapy, the authors did attempt to characterize clones derived from single cells, as well as evaluate the marker profile at the time of isolation. While the outcome of heterogeneity is not surprising, the methods used to isolate and characterize the cells were well developed. The interesting finding of the study is the identification of CD47 as a potential MSC marker that could be related to chondrogenic potential. The authors suggest that MSCs with high CD47 repaired cartilage more effectively than MSC with low CD47 in a rat OA model. 

      Weaknesses: 

      While the identification of CD47 as a novel MSC marker could be important to the field of cell therapy and cartilage regeneration, there was a lack of robust data to support the correlation of CD47 expression to chondrogenesis. The authors indicated that the proteomics suggested that the MSC subtype expressed significantly more CD47 than the non-MSC subtype. However, it was difficult to appreciate where this was shown. It would be helpful to clearly identify where in the figure this is shown, especially since it is the key result of the study. The authors were able to isolate CD47hi and CD47 low cells. While this is exciting, it was unclear how many cells could be isolated and whether they needed to be expanded before being used in vivo. Additional details for the CD47 studies would have strengthened the paper. Furthermore, the CD47hi cells were not thoroughly characterized in vitro, particularly for in vitro chondrogenesis. More importantly, the in vivo study where the CD47hi and CD47lo MSCs were injected into a rat model of OA lacked experimental details regarding how many cells were injected and how they were labeled. No representative histology was presented and there did not seem to be a statistically significant difference between the OARSI score of the saline injected and MSC injected groups. The repair tissue was stained for Sox9 expression, which is an important marker of chondrogenesis but does not show production of cartilage. Expression of Collagen Type II would be needed to more robustly claim that CD47 is a marker of MSCs with enhanced repair potential. 

      Reviewer #2 (Public review): 

      Summary: 

      This is a compelling study that systematically characterized and identified clonal MSC populations derived from normal and osteoarthritis human synovium. There is immense growth in the focus on synovial-derived progenitors in the context of both disease mechanisms and potential treatment approaches, and the authors sought to understand the regenerative potential of synovial-derived MSCs. 

      Strengths: 

      This study has multiple strengths. MSC cultures were established from an impressive number of human subjects, and rigorous cell surface protein analyses were conducted, at both pre-culture and post-culture timepoints. In vivo experiments using a rat DMM model showed beneficial therapeutic effects of MSCs vs non-MSCs, with compelling data demonstrating that only "real" MSC clones incorporate into cartilage repair tissue and express Prg4. Proteomics analysis was performed to characterize non-MSC vs MSC cultures, and high CD47 expression was identified as a marker for MSC. Injection of CD47-Hi vs CD47-Low cells in the same rat DMM model also demonstrated beneficial effects, albeit only based on histology. A major strength of these studies is the direct translational opportunity for novel MSC-based therapeutic interventions, with high potential for a "personalized medicine" approach. 

      Weaknesses: 

      Weaknesses of this study include the rather cursory assessment of the OA phenotype in the rat model, confined entirely to histology (i.e. no microCT, no pain/behavioral assessments, no molecular readouts). It is somewhat unclear how the authors converged on CD47 vs the other factors identified in the proteomics screen, and additional information is needed to understand whether true MSCs only engraft in articular cartilage or also in ectopic cartilage (in the context of osteophyte/chondrophyte formation). Some additional discussion and potential follow-up analyses focused on other cell surface markers recently described to identify synovial progenitors is also warranted. A conceptual weakness is the lack of discussion or consideration of the multiple recent studies demonstrating that DPP4+ PI16+ CD34+ stromal cells (i.e. the "universal fibroblasts") act as progenitors in all mesenchymal tissues, and their involvement in the joint is actively being investigated. Thus, it seems important to understand how the MSCs of the present study are related to these DPP4+ progenitors. Despite these areas for improvement, this is a strong paper with a high degree of rigor, and the results are compelling, timely, and important. 

      Overall, the authors achieved their aims, and the results support not just the therapeutic value of clonally-isolated synovial MSCs but also the immense heterogeneity in stromal cell populations (containing true MSCs and non-MSCs) that must be investigated further. Of note, the authors employed the ISCT criteria to characterize MSCs, with mixed results in pre-culture and post-culture assessments. This work is likely to have a longterm impact on methodologies used to culture and study MSCs, in addition to advancing the field's knowledge about how synovial-derived progenitors contribute to cartilage repair in vivo.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      In all figures, it would be beneficial to report the sample number used for the data reported. It is difficult to appreciate the statistical analysis without that information.

      Understood, the sample number and replicates have been added to each figure legend.

      Please check that Table S7 is part of the manuscript. It could not be found.

      It was added as an additional excel file since it was too large to fit in the word document.

      Lines 377-379 (Figure 2E): the authors write that rats receiving MSCs had a significantly lower OARSI and Krenn score vs. rats injected with non-MSCs. However, none of the bars indicating statistical significance run between these two groups. Please verify the text and figure.

      This has been corrected

      The details surrounding the labeling of the cells with tdTomato were not presented in the methods. 

      This has been added to the methods

      The fluorescent antibodies used should be listed and more details provided in the methods rather than a general statement that fluorescent antibodies were used.

      Our apologies, the clones and companies have been added.

      Additional information on the CD47 experiments (# cells, # animals) would have strengthened the study.

      This has been added to the methods and figure legend.

      Reviewer #2 (Recommendations for the authors): 

      My comments span minor corrections, requests for additional analyses, some suggestions for additional experiments, and requests for additional discussion of recent important studies. 

      Introduction: 

      The introduction is thorough and well-written. I recommend a brief discussion about the emerging evidence demonstrating that DPP4+ PI16+ CD34+ synovial cells, i.e. the "universal fibroblasts", act as stromal progenitors in development, homeostasis, and disease. Relevant osteoarthritis-related papers encompass human and mouse studies (PMIDs: 39375009, 38266107, 38477740, 36175067, 36414376).

      This has been added.

      Relatedly, as DPP4 is CD26 and therefore useful as a cell-surface antigen for flow cytometry, sorting, etc, it would be interesting to understand the relationship and similarities between the CD47-High cells identified in this study and the DPP4/PI16+ cells previously described. Do they overlap in phenotype/identity?

      We have added a new flow cytometry figure for address this question. 

      Results: 

      Note type-o on Line 311: "preformed" instead of "performed". Line 313 "prolife" instead of "profile"

      Thank you for catching these.

      The identified convergence of the cell surface marker profile of all normal and OA clones in culture is a highly intriguing result. Do the authors have stored aliquots of these cells to demonstrate whether this would also occur in soft substrate, i.e. low stiffness culture conditions? This could be done with standard dishes coated with bulk collagen or with commercially available low-stiffness dishes (1 kPa). This is relevant to multiple studies demonstrating the induction of a myofibroblast-like phenotype by stromal cells cultured on high-stiffness plastic or glass. This is also the experiment where assessment of DPP4/CD26 could be added, if possible.

      While we agree it would be interesting to determine the mechanism by which the cells phenotypes converge, we would argue that it is outside of the scope of the current manuscript. We have instead added a sentence to the discussion. 

      Line 353 regarding the use of CD68 as a negative gate: can the authors pleasecomment on why they employed CD68 here and not CD45? While monocytes/macs/myeloid cells are the most abundant immune cells in synovium, CD45 would more comprehensively exclude all immune cells. 

      That is a fair point, and we really don’t have any reason to have picked CD68 over CD45. In our opinion either would be a fair negative marker to use based on the literature. 

      Fig 2, minor suggestion: consider adding "MSC vs non-MSC" on the experimental schematic to more comprehensively summarize the experiment. 

      This has been modified 

      Fig 2E should show all individual datapoints, not just bar graphs. 

      This has been modified

      Fig 2: Given the significant reduction in Krenn score in DMM-MSC injected knees compared to DMM-saline knees, Fig 2 should also show representative images of the synovial phenotype to demonstrate which aspects of synovial pathology were mitigated. Was the effect related to lining hyperplasia, subsynovial infiltrate, fibrosis, etc? Similarly, can the authors narrate which aspects of the OARSI score drove the treatment effect (proteoglycans vs structure vs osteophytes, etc). 

      We have added a new sup figure breaking down the Krenn score as well as higher magnification images of representative synovium.

      Fig 2: In the absence of microCT imaging, can the authors quantify subchondral bone morphometrics using multiple histological sections? The tibial subchondral bone in Fig 2D appears protected from sclerosis/thickening.

      Unfortunately, this is beyond what are able to add to the manuscript. 

      The Fig 3 results are highly compelling and interesting. Congratulations.

      Thank you very much.

      Fig 4A: the cell highlighted in the high-mag zoom box in Fig 4A appears to be localized within the joint capsule or patellar tendon (it is unclear which anatomic region this image represents). The highly aligned nature of the tissue and cells along a fibrillar geometry indicates that this is not synovium. The interface between synovium and the tissue in question can be clearly observed in this image. Please choose an image more representative of synovium.

      We completely agree with the reviewers assessment. However, it is the synovium that overlays this tissue (Fig 4A arrow). We are simply showing that there were very few MSCs that took up residence in the synovium or the adjacent tissues. 

      Fig 4C and F: please show individual data points.

      This has been added

      Fig 5D: I see DPP4 and ITGA5 were also hits in the proteomics analysis, which is intriguing. Besides my comments/suggestions regarding DPP4 above, please note this recent paper identifying a ITGA5+ synovial fibroblast subset that orchestrates pathological crosstalk with lymphocytes in RA, PMID: 39486872

      Thank you for the information. We have added the reference in the results section. 

      Fig 5B-D: How did the authors converge on CD47 as the target for follow-up study? It does not appear to be a differentially-expressed protein based on the Volcano plot in Fig 5B, and it's unclear why it is a more important factor than any of the other proteins shown in the network diagram in Fig 5D, e.g. CTSL, ITGA5, DPP4. Can the authors add a quantitative plot supporting their statement "the MSC sub-type expressed significantly more CD47 than the non-MSCs" on Line 458? 

      We have re-written this line. It was incorrect to discuss amount of CD47. That was shown later with the flow analysis.  

      Fig 6D: Please show individual data points and also representative histology images to demonstrate the nature of the phenotypic effect.

      This has been added. 

      Fig 6E-F: In what anatomic region are these images? Please add anatomic markers to clarify the location and allow the reader to interpret whether this is articular cartilage or ectopic cartilage

      We have redone the figure to show the area as requested.

      Relevant to this, do the authors observe this type of cellular engraftment in ectopic cartilage/osteophytes or only in articular cartilage? Understanding the contribution of these cells to the formation/remodeling of various cartilage types in the context of OA is a critical aspect of this line of investigation.

      We didn’t see any contribution of these cells to ectopic cartilage formation and are actively working on a follow up study discussing this point specifically. 

      Discussion: 

      Besides my comments regarding DPP4 and ITGA5 above, the authors may also consider discussing PMID: 37681409 (JCI Insight 2023), which demonstrates that adult Prg4+ progenitors derived from synovium contribute to articular cartilage repair in vivo. 

      We agree that there are numerous markers we could look at in future studies and that other people in the field are actively studying.

    1. eLife Assessment

      This important study shows that a controlled pause in gene reading is required for early heart cells to form during development. The authors demonstrate that loss of this pause prevents the proper activation of the heart-producing program across animal and stem cell systems. The evidence is compelling, supported by careful genomic and functional analyses that clearly define the developmental block. Overall, this work will interest developmental biologists and inspire further studies on the origins of early heart defects.

    2. Reviewer #1 (Public review):

      This is a highly original and impactful study that significantly advances our understanding of transcriptional regulation, in particular RNAPII pausing, during early heart development. The Chen lab has a long history of producing influential studies in cardiac morphogenesis, and this manuscript represents another thorough and mechanistically insightful contribution. The authors have thoroughly addressed this Reviewer's concerns and incorporated all of my suggestions in the revised manuscript. In addition, their responses to the other reviewer's comments are also very clear. As it is, this work is of great interest to the readership of Elife, as well as to the general scientific community.

      The authors reveal a fundamentally new role for Rtf1-a component of the PAF1 complex-in governing promoter-proximal RNAPII pausing in the context of myocardial lineage specification. While transcriptional pausing has been implicated in stress responses and inducible gene programs, its developmental relevance has remained poorly defined. This study fills that gap with rigorous in vivo evidence demonstrating that Rtf1-dependent pausing is indispensable for activating the cardiac gene program from the lateral plate mesoderm.

      Importantly, the study also provides compelling therapeutic implications. Showing that CDK9 inhibition-using either flavopiridol or targeted knockdown-can restore promoter-proximal pausing and rescue cardiomyocyte formation in Rtf1-deficient embryos suggests that modulation of pause-release kinetics may represent a new avenue for correcting transcriptionally driven congenital heart defects. Given that many CDK inhibitors are clinically approved or in active development, this connection significantly elevates the translational impact of the findings.

      In sum, this study is rigorous, innovative, and transformative in its implications for developmental biology and cardiac medicine. I strongly support its publication.

    3. Reviewer #2 (Public review):

      Summary:

      Langenbacher at el. examine the requirement of Rtf1, a component of the PAF1C complex, which regulates transcriptional pausing in cardiac development. The authors first confirm that newly generated rtf1 mutant alleles recapitulate the defects in cardiac progenitor differentiation found using morpholinos from their previous work. The authors then show that conditional loss of Rtf1 in mouse embryos and depletion in mouse ESCs both demonstrates a failure to turn on cardiac progenitor and differentiation marker genes, supporting conservation of Rtf1 in promoting vertebrate cardiac progenitor development. The authors then employ bulk RNA-seq on flow-sorted hand2:GFP+ cells and multiomic single-cell RNA-seq on whole Rtf1-depleted zebrafish embryos at the 10-12 somite stage. These experiments corroborate that gene expression associated with cardiac progenitor differentiation is lost. Furthermore, analysis of differentiation trajectories suggests that the expression of genes associated with cardiac, blood, and endothelial progenitor differentiation is not initiated within the anterior lateral plate mesoderm. Structure-function analysis supports that the Rtf1 Plus3 domain is necessary for its function in promoting cardiac progenitor differentiation. ChIP-seq for RNA Pol II on 10-12 somite stage zebrafish embryos supports that Rtf1 is required for proper promoter pausing at the transcriptional start site. The transcriptional promoter pausing defect and cardiac differentiation can partially be rescued in zebrafish rtf1 mutants through pharmacological inhibition and depletion of Cdk9, a kinase that inhibits elongation. Thus, the authors have provided a clear analysis of the requirements and basic mechanism that Rf1 employs regulating cardiac progenitor development.

      Strengths and weaknesses:

      Overall, the data presented are strong and the message of the study is clear. The conclusions that Rtf1 is required for transcriptional pause release and promotes vertebrate cardiac progenitor differentiation are supported. Areas of strength include the complementary approaches in zebrafish and mouse embryos, and mouse embryonic stem cells, which together support the conserved requirement for Rtf1 in promoting cardiac differentiation. The bulk and single-cell RNA-sequencing analyses provide further support for this model via examining broader gene expression. In particular, the pseudotime analysis bolsters that there is a broader effect on differentiation of anterior lateral plate mesoderm derivatives. The structure-function analysis provides a relatively clean demonstration of the requirement of the Rtf1 Plus3 domain. The pharmacological and depletion epistasis of Cdk9 combined with the RNA Pol II ChIP-seq nicely support the mechanism implicating Cdk9 in the Rtf1-dependent RNA Pol II promoter pausing. Additionally, this is a revised manuscript. The authors were overall responsive to the previous critiques. The new analysis and revisions have helped to strengthen their hypothesis and improve the clarity of their study. While the revised manuscript is significantly improved, the lack of analysis from the multiomic analysis still represents a lost opportunity to provide further insight into Rtf1 mechanisms within this study. However, the authors have nevertheless achieved their goal for this study. The data sets reported will also be useful tools for further analysis and integration by the cardiovascular development community. Thus, the study will be of interest to scientists studying cardiovascular development and those broadly interested in epigenetic regulation controlling vertebrate development.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary:

      The manuscript submitted by Langenbacher et al., entitled " Rtf1-dependent transcriptional pausing regulates cardiogenesis", describes very interesting and highly impactful observations about the function of Rtf-1 in cardiac development. Over the last few years, the Chen lab has published novel insights into the genes involved in cardiac morphogenesis. Here, they used the mouse model, the zebrafish model, cellular assays, single cell transcription, chemical inhibition, and pathway analysis to provide a comprehensive view of Rtf1 in RNAPII (Pol2) transcription pausing during cardiac development. They also conducted knockdown-rescue experiments to dissect the functions of Rtf1 domains. 

      Strengths:

      The most interesting discovery is the connection between Rtf1 and CDK9 in regulating Pol2 pausing as an essential step in normal heart development. The design and execution of these experiments also demonstrate a thorough approach to revealing a previously underappreciated role of Pol2 transcription pausing in cardiac development. This study also highlights the potential amelioration of related cardiac deficiencies using small molecule inhibitors against cyclin dependent kinases, many of which are already clinically approved, while many other specific inhibitors are at various preclinical stages of development for the treatment of other human diseases. Thus, this work is impactful and highly significant. 

      We thank the reviewer for appreciating our work.

      Reviewer #2 (Public Review): 

      Summary: 

      Langenbacher at el. examine the requirement of Rtf1, a component of the PAF1C, which regulates transcriptional pausing in cardiac development. The authors first confirm their previous morphant study with newly generated rtf1 mutant alleles, which recapitulate the defects in cardiac progenitor and diUerentiation gene expression observed previously in morphants. They then examine the conservation of Rtf1 in mouse embryos and embryonic stem cell-derived cardiomyocytes. Conditional loss of Rtf1 in mesodermal lineages and depletion in murine ESCs demonstrates a failure to turn on cardiac progenitor and diUerentiation marker genes, supporting conservation of Rtf1 in promoting cardiac development. The authors subsequently employ bulk RNA-seq on flow-sorted hand2:GFP+ cells and multiomic single-cell RNA-seq on whole Rtf1-depleted embryos at the 10-12 stage. These experiments corroborate that genes associated with cardiac and muscle development are lost. Furthermore, the diUerentiation trajectories suggest that the expression of genes associated with cardiac maturation is not initiated.  Structure-function analysis supports that the Plus3 domain is necessary for its function in promoting cardiac progenitor formation. ChIP-seq for RNA Pol II on 1012 somite stage embryos suggests that Rtf1 is required for proper promoter pausing. This defect can partially be rescued through use of a pharmacological inhibitor for Cdk9, which inhibits elongation, can partially restore elongation in rtf1 mutants.  

      Strengths: 

      Many aspects of the data are strong, which support the basic conclusions of the authors that Rtf1 is required for transcriptional pausing and has a conserved requirement in vertebrate cardiac development. Areas of strength include the genetic data supporting the conserved requirement for Rtf1 in promoting cardiac development, the complementary bulk and single-cell RNA-sequencing approaches providing some insight into the gene expression changes of the cardiac progenitors, the structure-function analysis supporting the requirement of the Plus3 domain, and the pharmacological epistasis combined with the RNA Pol II ChIP-seq, supporting the mechanism implicating Cdk9 in the Rtf1 dependent mechanism of RNA Pol II pausing. 

      We thank the reviewer for the summary and for recognizing many strengths of our work. 

      Weaknesses: 

      While most of the basic conclusions are supported by the data, there are a number of analyses that are confusing as to why they chose to perform the experiments the way they did and some places where the interpretations presently do not support the interpretations. One of the conclusions is that the phenotype aUects the maturation of the cardiomyocytes and they are arresting in an immature state. However, this seems to be mostly derived from picking a few candidates from the single cell data in Fig. 6. If that were the case, wouldn't the expectation be to observe relatively normal expression of earlier marker genes required for specification, such as Nkx2.5 and Gata5/6? The in situ expression analysis from fish and mice (Fig. 2 and Fig. 3) and bulk RNA-seq (Fig. 5) seems to suggest that there are pretty early specification and diUerentiation defects. While some genes associated with cardiac development are not changed, many of these are not specific to cardiomyocyte progenitors and expressed broadly throughout the ALPM. Similarly, it is not clear why a consistent set of cardiac progenitor genes (for instance mef2ca, nkx2.5, and tbx20) was analyzed for all the experiments, in particular with the single cell analysis. 

      A major conclusion of our study is that Rtf1 deficiency impairs myocardial lineage differentiation from mesoderm, as suggested by the reviewer. Thus, the main goal of this study is to understand how Rtf1 drives cardiac differentiation from the LPM, rather than the maturation of cardiomyocytes.  Multiple lines of evidence support this conclusion:

      (a) In situ hybridization showed that Rtf1 mutant embryos do not have nkx2.5+ cardiac progenitor cells and subsequently fail to produce cardiomyocytes (Figs. 2, 3).

      (b) RT-PCR analysis showed that knockdown of Rtf1 in mouse embryonic stem cells causes a dramatic reduction of cardiac gene expression and production of significantly fewer beating patches (Fig.4).

      (c) Bulk RNA sequencing revealed significant downregulation of cardiac lineage genes, including nkx2.5 (Fig. 5).

      (d) Single cell RNA sequencing clearly showed that lateral plate mesoderm (LPM) cells are significantly more abundant in Rtf1 morphant,s whereas cardiac progenitors are less abundant (Fig. 6 and Fig.6 Supplement 1-5). 

      When feasible, we used cardiac lineage restricted markers in our assays. Nkx2.5 and tbx5a are not highlighted in the single cell analysis because their expression in our sc-seq dataset was too low to examine in the clustering/trajectory analysis.  In this revised manuscript, we provide violin plots showing the low expression levels of these genes in single cells from Rtf1 deficient embryos (Figure 6 Supplement 5).

      The point of the multiomic analysis is confusing. RNA- and ATAC-seq were apparently done at the same time. Yet, the focus of the analysis that is presented is on a small part of the RNA-seq data. This data set could have been more thoroughly analyzed, particularly in light of how chromatin changes may be associated with the transcriptional pausing. This seems to be a lost opportunity. Additionally, how the single cell data is covered in Supplemental Fig. 2 and 3 is confusing. There is no indication of what the diUerent clusters are in the Figure or the legend. 

      In this study, we performed single cell multiome analysis and used both scRNAseq and scATACseq datasets to generate reliable clustering.  The scRNAseq analysis reveals how Rtf1 deficiency impacts cardiac differentiation from mesoderm, which inspired us to investigate the underlying mechanism and led to the discovery of defects in Rtf1-dependent transcriptional pause release.

      We agree with the reviewer that deep examination of Rtf1-dependent chromatin changes would provide additional insights into how Rtf1 influences early development and careful examination of the scATACseq dataset is certainly a good future direction.  

      In this revised manuscript, we have revised Fig.6 Supplement 1 to include the predicted cell types and provide an additional excel file showing the annotation of all 39 clusters (Supplementary Table 2). 

      While the effect of Rtf1 loss on cardiomyocyte markers is certainly dramatic, it is not clear how well the mutant fish have been analyzed and how specific the eUect is to this population. It is interpreted that the eUects on cardiomyocytes are not due to "transfating" of other cell fates, yet supplemental Fig. 4 shows numerous eUects on potentially adjacent cell populations. Minimally, additional data needs to be provided showing the live fish at these stages and marker analysis to support these statements. In some images, it is not clear the embryos are the same stage (one can see pigmentation in the eyes of controls that is not in the mutants/morphants), causing some concern about developmental delay in the mutants. 

      Single cell RNA sequencing showed an increased abundance of LPM cells and a reduced abundance of cardiac progenitors in Rtf1 morphants (Fig. 6 and Fig.6 Supplement 1-5). The reclustering of anterior lateral plate mesoderm (ALPM) cells and their derivatives further showed that cells representing undiRerentiated ALPM were increased whereas cells representing all three ALPM derivatives were reduced. These findings indicate a defect in ALPM diRerentiation. 

      The reviewer questioned whether we examined stage-matched embryos. In our assay, Rtf1 mutant embryos were collected from crosses of Rtf1 heterozygotes. Each clutch from these crosses consists of ¼ embryos showing rtf1 mutant phenotypes and ¾ embryos showing wild type phenotypes which were used as control. Mutants and their wild type siblings were fixed or analyzed at the same time.

      The reviewer questioned the specificity of the Rtf1 deficient cardiac phenotype and pointed out that Rtf1 mutant embryos do not have pigment cells around the eye.  Rtf1 is a ubiquitously expressed transcriptional regulator.  Previous studies in zebrafish have shown that Rtf1 deficiency significantly impacts embryonic development. Rtf1 deficiency causes severe defects in cardiac lineage and neural crest cell development; consequently, Rtf1 deficient embryos do not have cardiomyocytes and pigmentation (Langenbacher et al., 2011, Akanuma et al., 2007, and Jurynec et al., 2019).  We now provide an image showing a 2-day-old Rtf1 mutant embryo and their wild type sibling to illustrate the cardiac, neural crest, and somitogenesis defects caused by loss of Rtf1 activity (Fig. 2 Supplement 1).

      With respect to the transcriptional pausing defects in the Rtf1 deficient embryos, it is not clear from the data how this eUect relates to the expression of the cardiac markers. This could have been directly analyzed with some additional sequencing, such as PRO-seq, which would provide a direct analysis of transcriptional elongation. 

      We showed that Rtf1 deficiency results in a nearly genome-wide decrease in promoterproximal pausing and downregulation of cardiac makers. Attenuating transcriptional pause release could restore cardiomyocyte formation in Rtf1 deficient embryos. In this revised manuscript, we provide additional RNAseq data showing that the expression levels of critical cardiac development genes such as nkx2.5, tbx5a, tbx20, mef2ca, mef2cb, ttn.2, and ryr2b are significantly rescued.  We agree with the reviewer that further analyses using the PRO-seq approach could provide additional insights, but it is beyond the scope of this manuscript. 

      Some additional minor issues include the rationale that sequence conservation suggests an important requirement of a gene (line 137), which there are many examples this isn't the case, referencing figures panels out of order in Figs. 4, 7, and 8) as described in the text, and using the morphants for some experiments, such as the rescue, that could have been done in a blinded manner with the mutants. 

      We have clarified the rationale in this revised manuscript and made the eRort to reference figures in order. 

      The reviewer commented that rescue experiments “could have been done in a blinded manner with the mutants”. This was indeed how the flavopiridol rescue and cdk9 knockdown experiments were carried out. Embryos from crosses of Rtf1 heterozygotes were collected, fixed after treatment and subjected to in situ hybridization. Embryos were then scored for cardiac phenotype and genotyped (Fig.8 d-g). Morpholino knockdown was used in genomic experiments because our characterization of rtf1 morphants showed that they faithfully recapitulate the rtf1 mutant phenotype during the timeframe of interest (Fig. 2).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      This reviewer has a few suggestions below, aimed at improving the clarity and impact of the current study. Once these items are addressed, the manuscript should be of interest to the Elife reader. 

      Item 1. Strengthening the interaction between Rfh1 and CDK9 on Pol2 pausing. 

      The authors have convincingly shown that the chemical inhibition of CDK9 by flavopiridol can partially rescue the expression of cardiac genes in the zebrafish model. Although flavopiridol is FDA approved and has been a classical inhibitor for the dissection of CDK9 function, it also inhibits related CDKs (such as Flavopiridol (Alvocidib) competes with ATP to inhibit CDKs including CDK1, CDK2, CDK4, CDK6, and CDK9 with IC50 values in the 20-100 nM range) Therefore, this study could be more impactful if the authors can provide evidence on which of these CDKs may be most relevant during Rtf1-dependent cardiogenesis. To determine whether the observed cardiac defect indicates a preferential role for CDK9, or that other CDKs may also be able to provide partial rescue may be clarified using additional, more selective small molecules (e.g., BAY1251152, LDC000067 are commercially available). 

      The reviewer raised a reasonable concern about the specificity of flavopiridol. We thank the reviewer for the insightful suggestion and share the concern about specificity. To address this question, we have used an orthogonal testing through morpholino inhibition where we directly targeted CDK9 and observed the same level of rescue, supporting a critical role of transcription pausing in cardiogenesis.

      Item 2. Differences between CRISPR lines and morphants 

      Much of the work presented used Rtf1 morphants while the authors have already generated 2 CRISPR lines. What is the diUerence between morphants and mutants? The authors should comment on the similarities and/or differences between using morphants or mutants in their study and whether the same Rtf1- CDK9 connection also occurs in the CRISPR lines. 

      The morphology of our mutants (rtf1<sup>LA2678</sup> and rtf1<sup>LA2679</sup>) resembles the morphants and the previously reported ENU-induced rtf1<sup>KT641</sup> allele. Extensive in situ hybridization analysis showed that the morphants faithfully recapitulate the mutant phenotypes (Fig.2). We have performed rescue experiments (flavopiridol and CDK9 morpholino) using Rtf1 mutant embryos and found that inhibiting Cdk9 restores cardiomyocyte formation (Fig.8). 

      Item 3. Discuss the therapeutic relevance of study 

      The authors have already generated a mouse model of Rtf1 Mesp1-Cre knockout where cardiac muscle development is severely derailed (Fig 3B). Thus, a demonstration of a conserved role for CDK9 inhibitor in rescuing cardiogenesis using mouse cells or the mouse model will provide important information on a conserved pathway function relevant to mammalian heart development. In the Discussion, how this underlying mechanistic role may be useful in the treatment of congenital heart disease should be provided.  

      Thank you for the insight. We have incorporated your comments in the discussion. 

      Item 4. Insights into the role of CDK9-Rtf1 in response to stress versus in cardiogenesis. 

      In the Discussion, the authors commented on the role of additional stress-related stimuli such as heat shock and inflammation that have been linked to CDK9 activity. However, the current ms provides the first, endogenous role of Pol2 pausing in a critical developmental step during normal cardiogenesis. The authors should emphasize the novelty and significance of their work by providing a paragraph on the state of knowledge on the molecular mechanisms governing cardiogenesis, then placing their discovery within this framework. This minor addition will also clarify the significance of this work to the broad readership of eLife. 

      Thank you for the suggestion. We have incorporated your comments and elaborate on the novelty and significance of our work in the discussion. 

      Reviewer #2 (Recommendations For The Authors): 

      (1) It is diUicult to assess what the overt defects are in the embryos at any stages. Images of live images were not included in the supplement. Do these have a small, malformed heart tube later or are the embryos just deteriorating due to broad defects? 

      The Rtf1 deficient embryos do not produce nkx2.5+ cardiac progenitors. Consequently, we never observed a heart tube or detected cells expressing cardiomyocyte marker genes such as myl7. This finding is consistent with previous reports using rtf1 morphants and rtf<sup>1KT64</sup>, an ENU-induced point mutation allele (Langenbacher et al., 2011 and Akanuma, 2007). In this revised manuscript, we provide a live image of 2-day-old wild type and rtf1<sup>LA2679/LA2679</sup> embryos (Fig. 2 Supplement 1). After two days, rtf1 mutant embryos undergo broad cell death. 

      (2) Fig. 2, although the in situs are convincing, there is not a quantitative assessment of expression changes for these genes. This could have been done for the bulk or single cell RNA-seq experiments, but was not and these genes weren't not included in the heat maps. A quantitative assessment of these genes would benefit the study. 

      The top 40 most significantly diRerentially expressed genes are displayed in the heatmap presented in Fig.5d. The complete diRerential gene expression analysis results for our hand2 FACS-based comparison of rtf1 morphants and controls is presented in Supplementary Data File 1.  In this revised manuscript, we provide a new supplemental figure with violin plots showing the expression levels of genes of interest in our single cell sequencing dataset (Fig.6 Supplement 5).

      (3) It doesn't not appear that any statistical tests were used for the comparisons in Fig. 2.

      We now provide the statistical data in the legend and Fig.2 b, d, f, h and i.

      (4) It's not clear the magnifications and orientations of the embryos in Fig. 3b are the same. 

      Embryos shown in Fig.3b are at the same magnification. However, because Rtf1 mutant embryos display severe morphological defects, the orientation of mutant embryos was adjusted to examine the cardiac tissue.

      (5) The n's for analysis of MLC2v in WT Rtf1 CKO embryos in Fig. 3b are only 1. At least a few more embryos should be analyzed to confirm that the phenotype is consistent. 

      We have revised the figure and present the number of embryos analyzed and statistics in Fig.3c. 

      (6) A number of figure panels are referred to out of order in the text. Fig. 4E-G are before Fig. 4C, D, Fig. 7C  before 7B, Fig. 8D-I before 8A ,B. In general, it is easier for the reader if the figures panels are presented in the order they are referred to in the text. 

      Revised as suggested.

      (7) While additional genes can be included, it is not clear why the same sets of genes are not examined in the bulk or single-cell RNA-seq as with the in situs or expression was analyzed in embryos. I suggest including the genes like nkx2.5, tbx20, myl7, in all the sequencing analysis. 

      We used the same set of genes in all analyses when possible. However, the low expression of genes such as nkx2.5 and myl7 in our sc-seq dataset preclude them from the clustering/trajectory analysis. In this revised manuscript, we present violin plots showing their expression in wild type and rtf1 morphants (Fig. 6 Supplement 5).

      (8) If a multiomic approach was used, why wasn't its analysis incorporated more into the manuscript? In general, a clearer presentation and deeper analysis of the single cell data would benefit the study. The integration of the RNA and ATAC would benefit the analysis.

      As addressed in our response to the reviewer’s public review, both datasets were used in clustering. Examining changes in chromatin accessibility is certainly interesting, but beyond the scope of this study. 

      (9) Many of the markers analyzed are not cardiac specific or it is not clear they are expressed in cardiac progenitors at the stage of the analysis. Hand2 has broader expression. Additional confirmation of some of the genes through in situ would help the interpretations. 

      Markers used for the in situ hybridization analysis (myl7, mef2ca, nkx2.5, tbx5a, and tbx20) are known for their critical role in heart development. For sc-seq trajectory analyses, most displayed genes (sema3e, bmp6, ttn.2, mef2cb, tnnt2a, ryr2b, and myh7bb) were identified based on their diRerential expression along the LPM-cardiac progenitor pseudotime trajectory. Rather than selecting genes based on their cardiac specificity, our goal was to examine the progressive gene expression changes associated with cardiac progenitor formation and compare gene expression of wild type and rtf1 deficient embryos.

      (10) Additional labels of the cell clusters are needed for Supplemental Figs. 2 and 3. 

      The cluster IDs were presented on Supplementary Figures 2 and 3. In this revised version, we added predicted cell types to the UMAP (revised Fig.6 Supplement 1) and provided an excel file with this information (revised Supplementary Table 2). 

      (11) On lines 101-102, the interpretation from the previous data is that diUerentiation of the LPM requires Rtf1. However, later from the single cell data the interpretation based on the markers is that Rtf1 loss aUects maturation. However, it is not clear this interpretation is correct or what changed from the single cell data. If that were the case, one would expect to see maintenance of more early marks and subsequent loss of maturation markers, which does not appear to the be the case from the presented data.

      Our data suggests that cardiac progenitor formation is not accomplished by simultaneously switching on all cardiac marker genes. Our pseudotime trajectory analysis highlights tnnt2a, ryr2b, and myh7bb as genes that increase in expression in a lagged manner compared to mef2cb (Fig. 6). Thus, the abnormal activation of mef2cb without subsequent upregulation of tnnt2a, ryr2b, and myh7bb in rtf1 morphants suggests a requirement for rtf1 in the progressive gene expression changes required for proper cardiac progenitor diRerentiation. Our single cell experiment focuses on the process of cardiac progenitor diRerentiation and does not provide insights into cardiomyocyte maturation. We have edited the text to clarify these interpretations. 

      (12) The interpretation that there is not "transfating" is not supported by the shown data. Analysis of markers in other tissues, again with in situ, to show spatially would benefit the study. 

      As stated in our response to the reviewer’s public review, we observed a dramatic increase of ALPM cells, but a decrease of ALPM derivatives including the cardiac lineage. We did not observe the expansion of one ALPM-derived subpopulation at the expense of the others. These observations suggest a defect in ALPM diRerentiation and argue against the notion that the region of the ALPM that would normally give rise to cardiac progenitors is instead diRerentiating into another cell type.

      (13) The rationale that sequence conservation means a gene is important (lines 137-139) is not really true. There are examples a lot of highly conserved genes whose mutants don't have defects. 

      We have revised the text to avoid confusion. 

      (14) The data showing that the 8 bp mutations do not aUect the RNA transcript is not shown or at least indicated in Fig. 7. It would seem that this experiment could have been done in the mutant embryos, in which case the experiment would have been semi-blinded as the genotyping would occur after imaging. 

      The modified Rtf1 wt RNA (Rtf1 wt* in revised Fig. 7) robustly rescued nkx2.5 expression in rtf1 deficient embryos, demonstrating that the 8 bp modifications do not negatively impact the activity of the injected RNA. As stated previously, morpholino knockdown was used in some experiments because our characterization of rtf1 morphants showed that they faithfully recapitulate the rtf1 mutant phenotype during the timeframe of interest.

      (15) Using a technique like PRO-seq at the same stage as the ChIP-seq would complement the ChIP-seq and allow a more detailed analysis of the transcriptional pausing on specific genes observed in WT and mutant embryos. 

      As stated in our response to the reviewer’s public review, we appreciate the suggestion but PRO-seq is beyond the scope of this study.

    1. eLife Assessment

      In the gram-positive model organism Bacillus subtilis, the membrane associated ParA family member MinD, concentrates the division inhibitor MinC at cell poles where it prevents aberrant division events. This important study presents compelling data suggesting that polar localization of MinCD is largely due to differences in diffusion rates between monomeric and dimeric MinD. This finding is exciting as it negates the necessity for a third, localization determinant, in this system as has been proposed by previous investigations.

    2. Reviewer #1 (Public review):

      The authors used fluorescence microscopy, image analysis, and mathematical modeling to study the effects of membrane affinity and diffusion rates of MinD monomer and dimer states on MinD gradient formation in B. subtilis. To test these effects, the authors experimentally examined MinD mutants that lock the protein in specific states, including Apo monomer (K16A), ATP-bound monomer (G12V) and ATP-bound dimer (D40A, hydrolysis defective), and compared to wild-type MinD. Overall, the experimental results support the conclusions that reversible membrane binding of MinD is critical for the formation of the MinD gradient, but the binding affinities between monomers and dimers are similar.

      The modeling part is a new attempt to use the Monte Carlo method to test the conditions for the formation of the MinD gradient in B. subtilis. The modeling results provide good support for the observations and find that the MinD gradient is sensitive to different diffusion rates between monomers and dimers. This simulation is based on several assumptions and predictions, which raises new questions that need to be addressed experimentally in the future.

    3. Reviewer #3 (Public review):

      This important study by Bohorquez et al examines the determinants necessary for concentrating the spatial modulator of cell division, MinD, at the future site of division and the cell poles. Proper localization of MinD is necessary to bring the division inhibitor, MinC, in proximity to the cell membrane and cell poles where it prevents aberrant assembly of the division machinery. In contrast to E. coli, in which MinD oscillates from pole-to-pole courtesy of a third protein MinE, how MinD localization is achieved in B. subtilis-which does not encode a MinE analog-has remained largely a mystery. The authors present compelling data indicating that MinD dimerization is dispensable for membrane localization but required for concentration at the cell poles. Dimerization is also important for interactions between MinD and MinC, leading to the formation of large protein complexes. Computational modeling, specifically a Monte Carlo simulation, supports a model in which differences in diffusion rates between MinD monomers and dimers lead to concentration of MinD at cell poles. Once there, interaction with MinC increases the size of the complex, further reinforcing diffusion differences. Notably, interactions with MinJ-which has previously been implicated in MinCD localization, are dispensable for concentrating MinD at cell poles although MinJ may help stabilize the MinCD complex at those locations.

      [Editor's note: The editors and reviewers have no further comments and encourage the authors to proceed with a Version of Record.]

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Review:

      Reviewer #1 (Public review):

      The authors used fluorescence microscopy, image analysis, and mathematical modeling to study the effects of membrane affinity and diffusion rates of MinD monomer and dimer states on MinD gradient formation in B. subtilis. To test these effects, the authors experimentally examined MinD mutants that lock the protein in specific states, including Apo monomer (K16A), ATP-bound monomer (G12V) and ATP-bound dimer (D40A, hydrolysis defective), and compared to wild-type MinD. Overall, the experimental results support the conclusions that reversible membrane binding of MinD is critical for the formation of the MinD gradient, but the binding affinities between monomers and dimers are similar.

      The modeling part is a new attempt to use the Monte Carlo method to test the conditions for the formation of the MinD gradient in B. subtilis. The modeling results provide good support for the observations and find that the MinD gradient is sensitive to different diffusion rates between monomers and dimers. This simulation is based on several assumptions and predictions, which raises new questions that need to be addressed experimentally in the future.  

      Reviewer #3 (Public review):

      This important study by Bohorquez et al examines the determinants necessary for concentrating the spatial modulator of cell division, MinD, at the future site of division and the cell poles. Proper localization of MinD is necessary to bring the division inhibitor, MinC, in proximity to the cell membrane and cell poles

      where it prevents aberrant assembly of the division machinery. In contrast to E. coli, in which MinD 50 oscillates from pole-to-pole courtesy of a third protein MinE, how MinD localization is achieved in B. 51 subtilis-which does not encode a MinE analog-has remained largely a mystery. The authors present 52 compelling data indicating that MinD dimerization is dispensable for membrane localization but required 53 for concentration at the cell poles. Dimerization is also important for interactions between MinD and MinC, 54 leading to the formation of large protein complexes. Computational modeling, specifically a Monte Carlo 55 simulation, supports a model in which differences in diffusion rates between MinD monomers and dimers 56 lead to concentration of MinD at cell poles. Once there, interaction with MinC increases the size of the 57 complex, further reinforcing diffusion differences. Notably, interactions with MinJ-which has previously 58 been implicated in MinCD localization, are dispensable for concentrating MinD at cell poles although MinJ may help stabilize the MinCD complex at those locations.

      Comments on revisions:  

      I believe the authors put respectable effort into revisions and addressing reviewer comments, particularly 64      those that focused on the strengths of the original conclusions. The language in the current version of the manuscript is more precise and the overall product is stronger.  

      We are happy to learn that the reviewer considers our manuscript ready for publication.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):  

      The author has adequately answered the questions that were raised in my previous comments. There are only few minor revisions needed for improvement.  

      Line 48−49: 'These proteins ensure that cell division occurs at midcell and not close to nascent division sites or cell poles'  

      delete 'nascent division site'  

      This has now been corrected as suggested.

      Line 64−65: 'MinC inhibits polymerization of FtsZ by direct protein-protein interactions and needs to bind to the Walker A-type ATPase MinD for its recruitment to septa or the polar regions of the cell'

      delete 'septa or', because MinD recruits MinC to the cell poles to block polar division, not septal formation.  

      This has now been corrected as suggested.

      Supplemental information:

      Some parameters in Table S1 are missing definitions. If these parameters relate to terms described in the "Methods" section, please add the corresponding parameter symbols after the terms.  

      We would like to thank the reviewer for pointing this out. We have improved Table S1 and corrected the related parameters in the Methods section (lines 605-619).

    1. eLife Assessment

      Ge et al here report a structural study of the native tripartite multidrug efflux pump complexes from Escherichia coli that identifies a novel accessory subunit, YbjP, the structure of the native TolC-YbjP-AcrABZ complex, as well as structures of the AcrB protein in L, T, and O conformations. The strength of the structural data is compelling, and the importance of the findings is potentially fundamental. However, additional analysis and comparison with pre-existing data would help to put the obtained data and its impact in the proper context, and the inclusion of functional data would help to substantiate some claims that are currently incompletely supported.