10,000 Matching Annotations
  1. Jul 2025
    1. Reviewer #2 (Public review):

      Summary:

      This manuscript presents the JAX Animal Behavior System (JABS), an integrated mouse phenotyping platform that includes modules for data acquisition, behavior annotation, and behavior classifier training and sharing. The manuscript provides details and validation for each module, demonstrating JABS as a useful open-source behavior analysis tool that removes barriers to adopting these analysis techniques by the community. In particular, with the JABS-AI module, users can download and deploy previously trained classifiers on their own data, or annotate their own data and train their own classifiers. The JABS-AI module also allows users to deploy their classifiers on the JAX strain survey dataset and receive an automated behavior and genetic report.

      Strengths:

      (1) The JABS platform addresses the critical issue of reproducibility in mouse behavior studies by providing an end-to-end system from rig setup to downstream behavioral and genetic analyses. Each step has clear guidelines, and the GUIs are an excellent way to encourage best practices for data storage, annotation, and model training. Such a platform is especially helpful for labs without prior experience in this type of analysis.

      (2) A notable strength of the JABS platform is its reuse of large amounts of previously collected data at JAX Labs, condensing this into pretrained pose estimation models and behavioral classifiers. JABS-AI also provides access to the strain survey dataset through automated classifier analyses, allowing large-scale genetic screening based on simple behavioral classifiers. This has the potential to accelerate research for many labs by identifying particular strains of interest.

      (3) The ethograph analysis will be a useful way to compare annotators/classifiers beyond the JABS platform.

      Weaknesses:

      (1) The manuscript as written lacks much-needed context in multiple areas: what are the commercially available solutions, and how do they compare to JABS (at least in terms of features offered, not necessarily performance)? What are other open-source options? How does the supervised behavioral classification approach relate to the burgeoning field of unsupervised behavioral clustering (e.g., Keypoint-MoSeq, VAME, B-SOiD)? What kind of studies will this combination of open field + pose estimation + supervised classifier be suitable for? What kind of studies is it unsuited for? These are all relevant questions that potential users of this platform will be interested in.

      (2) Throughout the manuscript, I often find it unclear what is supported by the software/GUI and what is not. For example, does the GUI support uploading videos and running pose estimation, or does this need to be done separately? How many of the analyses in Figures 4-6 are accessible within the GUI?

      (3) While the manuscript does a good job of laying out best practices, there is an opportunity to further improve reproducibility for users of the platform. The software seems likely to perform well with perfect setups that adhere to the JABS criteria, but it is very likely that there will be users with suboptimal setups - poorly constructed rigs, insufficient camera quality, etc. It is important, in these cases, to give users feedback at each stage of the pipeline so they can understand if they have succeeded or not. Quality control (QC) metrics should be computed for raw video data (is the video too dark/bright? are there the expected number of frames? etc.), pose estimation outputs (do the tracked points maintain a reasonable skeleton structure; do they actually move around the arena?), and classifier outputs (what is the incidence rate of 1-3 frame behaviors? a high value could indicate issues). In cases where QC metrics are difficult to define (they are basically always difficult to define), diagnostic figures showing snippets of raw data or simple summary statistics (heatmaps of mouse location in the open field) could be utilized to allow users to catch glaring errors before proceeding to the next stage of the pipeline, or to remove data from their analyses if they observe critical issues.

    1. eLife Assessment

      This carefully conducted study aims to understand how the early visual experience of premature infants induces lasting deficits, including compromised motion processing. The authors address this important question in a ferret animal model, exposing the developing visual system prematurely to patterned visual input by opening one or both eyes at a time when both retinal waves and light traveling through closed lids can drive sensory responses. Convincing evidence is presented, suggesting that eye opening at this time impacts temporal frequency tuning and elevates spontaneous firing rates. These findings will have great relevance for neuroscientists studying visual system development, particularly in the context of premature birth.

    2. Reviewer #1 (Public review):

      The authors note that very premature infants experience the visual world early and, as a consequence, sustain lasting deficits including compromised motion processing. Here they investigate the effects of early eye opening in ferret, choosing a time point after birth when both retinal waves and light traveling through closed lids drive sensory responses. The laboratory has long experience in quantitative studies of visual response properties across development and this study reflects their expertise.

      The investigators find little or no difference in mean orientation and direction selectivity, or in spatial frequency tuning, as a result of early eye opening but marked differences in temporal frequency tuning. These changes are especially interesting as they relate to deficits seen in prematurely delivered children. Temporal frequency bandwidth for responses evoked from early-opened contralateral eyes were broader than for controls; this is the case for animals in which either one or both eyes were opened prematurely. Further, when only one eye was opened early, responses to low temporal frequencies were relatively stronger.

      The investigators also found changes in firing rate and sign of response to visual stimuli. Premature eye-opening increased spontaneous rates in all test configurations. When only one eye was opened early, firing rates recorded from the ipsilateral cortex were strongly suppressed, with more modest effects in other test cases.

      As the authors' discussion notes, these observations are just a starting point for studies underlying mechanism. The experiments are so difficult to perform and so carefully described that the results will be foundational for future studies of how premature birth influences cortical development.

    3. Reviewer #2 (Public review):

      In this paper, Griswold and Van Hooser investigate what happens if animals are exposed to patterned visual experience too early, before its natural onset. To this end, they make use of the benefits of the ferret as a well-established animal model for visual development. Ferrets naturally open their eyes around postnatal day 30; here, Griswold and Van Hooser opened either one or both eyes prematurely. Subsequent recordings in the mature primary visual cortex show that while some tuning properties like orientation and direction selectivity developed normally, the premature visual exposure triggered changes in temporal frequency tuning and overall firing rates. These changes were widespread, in that they occurred even for neurons responding to the eye that was not opened prematurely. These results demonstrate that the nature of the visual input well before eye opening can have profound consequences on the developing visual system.

      The conclusions of this paper are well supported by the data, but in the initially submitted version of the paper, there were a few questions regarding the data processing and suggestions for the discussion:

      (1) The assessment of the tuning properties is based on fits to the data. Presumably, neurons for which the fits were poor were excluded? It would be useful to know what the criteria were, how many neurons were excluded, and whether there was a significant difference between the groups in the numbers of neurons excluded (which could further point to differences between the groups).

      (2) For the temporal frequency data, low- and high-frequency cut-offs are defined, but then only used for the computation of the bandwidth. Given that the responses to low temporal frequencies change profoundly with premature eye opening, it would be useful to directly compare the low- and high-frequency cut-offs between groups, in addition to the index that is currently used.

      (3) In addition to the tuning functions and firing rates that have been analyzed so far, are there any differences in the temporal profiles of neural responses between the groups (sustained versus transient responses, rates of adaptation, latency)? If the temporal dynamics of the responses are altered significantly, that could be part of an explanation for the altered temporal tuning.

      (4) It would be beneficial for the general interpretation of the results to extend the discussion. First, it would be useful to provide a more detailed discussion of what type of visual information might make it through the closed eyelids (the natural state), in contrast to the structured information available through open eyes. Second, it would be useful to highlight more clearly that these data were collected in peripheral V1 by discussing what might be expected in binocular, more central V1 regions. Third, it would be interesting to discuss the observed changes in firing rates in the context of the development of inhibitory neurons in V1 (which still undergo significant changes through the time period of premature visual experience chosen here).

    1. eLife Assessment

      This important study uses long-term behavioural observations to understand the factors that influence female-on-female aggression in gorilla social groups. The evidence supporting the claims is convincing, as it includes novel methods of assessing aggression and considers other potential factors. The work will be of interest to broad biologists working on the social interactions of animals.

    2. Joint Public Review:

      Summary:

      This work aims to improve our understanding of the factors that influence female-on-female aggressive interactions in gorilla social hierarchies, using 25 years of behavioural data from five wild groups of two gorilla species. Researchers analysed aggressive interactions between 31 adult females, using behavioural observations and dominance hierarchies inferred through Elo-rating methods. Aggression intensity (mild, moderate, severe) and direction (measured as the rank difference between aggressor and recipient) were used as key variables. A linear mixed-effects model was applied to evaluate how aggression direction varied with reproductive state (cycling, trimester-specific pregnancy, or lactation) and sex composition of the group. This study highlights the direction of aggressive interactions between females, with most interactions being directed from higher- to lower-ranking adult females close in social rank. However, the results show that 42% of these interactions are directed from lower- to higher-ranking females. Particularly, lactating and pregnant females targeted higher-ranking individuals, which the authors suggest might be due to higher energetic needs, which increase risk-taking in lactating and pregnant females. Sex composition within the group also influenced which individuals were targeted. The authors suggest that male presence buffers female-on-female aggression, allowing females to target higher-ranking females than themselves. In contrast, females targeted lower-ranking females than themselves in groups with a larger ratio of females, which supposes a lower risk for the females since the pool of competitors is larger. The findings provide an important insight into aggression heuristics in primate social systems and the social and individual factors that influence these interactions, providing a deeper understanding of the evolutionary pressures that shape risk-taking, dominance maintenance, and the flexibility of social strategies in group-living species.

      The authors achieved their aim by demonstrating that aggression direction in female gorillas is influenced by factors such as reproductive condition and social context, and their results support the broader claim that aggression heuristics are flexible. However, some specific interpretations require further support. Despite this, the study makes a valuable contribution to the field of behavioural ecology by reframing how we think about intra-sexual competition and social rank maintenance in primates.

      Strengths:

      One of the study's major strengths is the use of an extensive dataset that compiles 25 years of behavioural data and 6871 aggressive interactions between 31 adult females in five social groups, which allows for a robust statistical analysis. This study uses a novel approach to the study of aggression in social groups by including factors such as the direction and intensity of aggressive interactions, which offers a comprehensive understanding of these complex social dynamics. In addition, this study incorporates ecological and physiological factors such as the reproductive state of the females and the sex composition of the group, which allows an integrative perspective on aggression within the broader context of body condition and social environment. The authors successfully integrate their results into broader evolutionary and ecological frameworks, enriching discussions around social hierarchies and risk sensitivity in primates and other animals.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors):

      Suggestions:

      Although this study has an impressive dataset, I felt that some parts of the discussion would benefit from further explanation, specifically when discussing the differences in female aggression direction between groups with different sex compositions. In the discussion is suggested that males buffer female-on-female aggression and that they 'support' lower-ranking females (see line 212), however, the study only tested the sex composition of the group and does not provide any evidence of this buffering. Thus, I would suggest adding more information on how this buffering or protection from males might manifest (for example, listing male behaviours that might showcase this protection) or referencing other studies that support this claim. Another example of this can be found in lines 223-224, which suggests that females choose lower-ranking individuals when they are presented with a larger pool of competitors; however, in lines 227-228, it's stated that this result contradicts previous work in baboons, which makes the previous claim seem unjustified. I recommend adding other examples from studies that support the results of this paper and adding a line that addresses reasons why these differences between gorillas and baboons might be caused (for example, different social dynamics or ecological constraints). In addition, I suggest the inclusion of physiological data such as direct measures of energy expenditure, caloric intake, or hormone levels, as it would strengthen the claims made in the second paragraph of the discussion. However, I understand this might not be possible due to data or time constraints, so I suggest adding more robust justification on why lactation and pregnancy were used as a proxy for energetic need. In the methods (lines 127-128), it is unclear which phase of the pregnancy or lactation is more energetically demanding. I would also suggest adding a comment on the limitations of using reproductive state to infer energetic need. Lastly, if the data is available, I believe it would be interesting to add body size and age of the females or the size difference between aggressor and target as explanatory variables in the models to test if physiological characteristics influence female-on-female aggression.

      Male support:

      We have now added more references (Watts 1994, 1997) and enriched our arguments regarding male presence buffering aggression. Previous research suggests that male gorillas may support lower-ranking females and they may intervene in female-female conflicts (Sicotte 2002). Unfortunately, our dataset did not allow us to test for male protection. We conduct proximity scans every 10 minutes and these scans are not associated to each interaction, meaning that we cannot reliably test if proximity to a male influences the likelyhood to receive aggression.

      Number of competitors and choice of weaker competitors:

      We added a very relevant reference in humans, showing that people choose weaker competitors when they have they can choose. We removed the example to baboons because it used sex ratio and the relevance to our study was not that straightforward.

      Reproductive state as a proxy for energetic needs:

      We now mention clearly that reproductive state is an indirect measure of energetic needs.

      We rephrased our methods to: “Lactation is often considered more energetically demanding than pregnancy as a whole but the latest stages of pregnancy are highly energetically demanding, potentially even more than lactation”

      Unfortunately, we do not have access to physiological and body size data. Regarding female age, for many females, ages are estimates with errors up to a decade, and thus, we choose not to use them as a reliable predictor. Having accurate values for all these variables, would indeed be very valuable and improve the predicting power of our study.

      Recommendations for writing and presentation:

      Overall, the manuscript is well-organised and well-written, but there are certain areas that could improve in clarity. In the introduction, I believe that the term 'aggression heuristic' should be introduced earlier and properly defined in order to accommodate a broader audience. The main question and aims of the study are not stated clearly in the last paragraph of the introduction. In the methods, I think it would improve the clarity to add a table for the classification of each type of agonistic interactions instead of naming them in the text. For example, a table that showcase the three intensity categories (severe, mild and moderate), than then dives into each behaviour (e.g. hit, bite, attack, etc.) and a short description of these behaviours, I think this would be helpful since some of the behaviours mentioned can be confusing (what's the difference between attack, hit and fight?). In addition, in line 104, it states that all interactions were assigned equal intensity, which needs to be explained.

      We now define aggression heuristics in both the abstract and the first paragraph of the introduction. We have also explained aggressive interactions that their nature was not obvious from their names. Hopefully, these explanations make clear the differences among the recorded behaviours.

      We have now specified that the “equal intensity” refers to avoidances and displacements used to infer power relationships: “We assigned to all avoidance/displacement interactions equal intensity, that is, equal influence to the power relationship of the interacting individuals”

      Minor corrections:

      (1) In line 41, there is a 1 after 'similar'. I am unsure if it's a mistake or a reference.

      We corrected the typo.

      (2) In lines 68-69, there is mention of other studies, but no references are provided.

      We added citations as suggested.

      (3) Remove the reference to Figure 1 (line 82) from the introduction; the figure should be referenced in the text just before the image, however, your figure is in a different section.

      We removed the reference as suggested.

      (4) Line 98 and 136, it's written 'ad libtum' but the correct spelling is 'ad libitum'.

      We corrected the typo.

      (5) Figure 3, remove the underscores between the words in the axis titles.

      We removed the underscores.

      Reviewer #2 (Recommendations for the authors):

      Here, I have outlined some specific suggestions that require attention. Addressing these comments will enhance the readability and enhance the quality of the manuscript.

      (1) L69. Add citation here, indicating the studies focusing on aggression rates.

      We added citations as suggested.

      (2) L88. The study periods used in this study and the authors' previous study (Reference 11) are different. So please add one table as Table 1 showing the details info on the sampling efforts and data included in their analysis of this study. For example, the study period, the numbers of females and males, sampling hours, the number of avoidance/displacement behaviors used to calculate individual Elo-ratings, and the number of mild/moderate/severe aggressive interactions, etc.

      We have now added another table, as suggested (new Table 1) and we have also made clear that we used the hierarchies presented in detail in (Smit & Robbins 2025).

      (3) L103. If readers do not look over Reference 25 on purpose, they do not know what the authors want to talk about and why they mention the optimized Elo-rating method. Clarify this statement and add more content explaining the differences between the two methods, or just remove it.

      We rephrased the text and in response to the previous comment, we clearly state that there are more details about our approach in Smit & Robbins 2025. At the end of the relevant sentence, we added the following parenthesis “(see “traditional Elo rating method”; we do not use the “optimized Elorating method” as it yields similar results and it is not widely used)” and we removed the sentence referring to the optimized Elo-rating method.

      (4) L110. Here, the authors stated that the individual with the standardized Elo-score 1 was the highest-ranking. L117, the "aggression direction" score of each aggressive interaction was the standardized Elo-score of the aggressor, subtracting that of the recipient. So, when the "aggression direction" score was 1, it should mean that the aggressor was the highest-ranking and the recipient was the lowest-ranking female. This is not as the authors stated in L117-120 (where the description was incorrectly reversed). Please clarify.

      The highest ranking individual has indeed Elo_score equal to 1 and we calculated the interaction score (or "aggression direction score") of each aggressive interaction by subtracting the standardized Elo-score of the aggressor from that of the recipient (Elo_recepient – Elo_aggressor). So, when the aggressor is the lowest-ranking female (Elo_score=0) and the recipient the highestranking female one (Elo_score=1), the "aggression direction score" is 1-0 = 1.

      (5) Regarding point 3 of the Public Review, please also revise/expand the paragraph L193-208 in the Discussion section accordingly.

      Please see our response to the public review. We have enriched the results section, added pairwise comparisons in a new table (Table 2) and modified the discussion accordingly.

      (6) Table 1. It's not clear why authors added the column 'Aggression Rate' but did not provide any explanation in the Methods/Results section. How did they calculate the correlation between each tested variable and the "overall adult female aggression rates"? Correlating the number of females in the first trimester of female pregnancy with the female aggression rates in each study group? What did the correlation coefficients mean? L202-204 may provide some hints as to why the authors introduced the Aggression Rate. But it should be made clear in the previous text.

      We now added more details in the legend of the table to make our point clear: “To highlight that aggression rates can increase due to increase in interactions of different score, we also include the effect of some of the tested variables on overall adult female aggression rates, based on results of linear mixed effects models from (Smit & Robbins 2024).”  We did not include detailed methods to calculate those results because they are detailed in (Smit & Robbins 2024). We find it valuable to show the results of both aggression rates and aggression directionality according to the same predictor variables as a means to clarify that aggression rates and aggression directionality are not always coordinated to one another (they do not always change in a consistent manner relative to one another).

      (7) L166.This is not rigorous. Please rephrase. There is only one western gorilla group containing only one resident male included in the analysis.

      We have toned down our text: “Our results did not show any significant difference between femalefemale aggression patterns within the one western and four mountain gorillas groups”

      (8) L167. I don't think the interaction scores in the third trimester of female pregnancy were significantly higher than those in the first trimester. The same concern applies in L194-195.

      We have now added a new table with post hoc pairwise comparisons among the different reproductive states that clarifies that.

      (9) L202. There is no column 'Aggression rates' in Table 1 of Reference 11.

      We have rephrased to make clear that we refer to Table 1 of the present study.

      (10) L204-205. Reference 49. Maybe not a proper citation here. This claim requires stronger evidence or further justification. Additionally, please rephrase and clarify the arguments in L204208 for better readability and precision.

      We have added three more references and rephrased to clarify our argument.

      Reviewer #3 (Recommendations for the authors):

      (1) Line 41: The word "similar" is misspelled.

      We corrected the typo.

    1. eLife Assessment

      In this important study, the authors model reinforcement-learning experiments using a recurrent neural network. The work examines if the detailed credit assignment necessary for back-propagation through time can be replaced with random feedback. The authors provide solid evidence that the solution is adequate within relatively simple tasks.

    2. Reviewer #1 (Public review):

      Summary:

      Can a plastic RNN serve as a basis function for learning to estimate value. In previous work this was shown to be the case, with a similar architecture to that proposed here. The learning rule in previous work was back-prop with an objective function that was the TD error function (delta) squared. Such a learning rule is non-local as the changes in weights within the RNN, and from inputs to the RNN depends on the weights from the RNN to the output, which estimates value. This is non-local, and in addition, these weights themselves change over learning. The main idea in this paper is to examine if replacing the values of these non-local changing weights, used for credit assignment, with random fixed weights can still produce similar results to those obtained with complete bp. This random feedback approach is motivated by a similar approach used for deep feed-forward neural networks.

      This work shows that this random feedback in credit assignment performs well but is not as well as the precise gradient-based approach. When more constraints due to biological plausibility are imposed performance degrades. These results are consistent with previous results on random feedback.

      Strengths:

      The authors show that random feedback can approximate well a model trained with detailed credit assignment.

      The authors simulate several experiments including some with probabilistic reward schedules and show results similar to those obtained with detailed credit assignments as well as in experiments.

      The paper examines the impact of more biologically realistic learning rules and the results are still quite similar to the detailed back-prop model.

    3. Reviewer #2 (Public review):

      Summary:

      Tsurumi et al. show that recurrent neural networks can learn state and value representations in simple reinforcement learning tasks when trained with random feedback weights. The traditional method of learning for recurrent network in such tasks (backpropogation through time) requires feedback weights which are a transposed copy of the feed-forward weights, a biologically implausible assumption. This manuscript builds on previous work regarding "random feedback alignment" and "value-RNNs", and extends them to a reinforcement learning context. The authors also demonstrate that certain non-negative constraints can enforce a "loose alignment" of feedback weights. The author's results suggest that random feedback may be a powerful tool of learning in biological networks, even in reinforcement learning tasks.

      Strengths:

      The authors describe well the issues regarding biologically plausible learning in recurrent networks and in reinforcement learning tasks. They take care to propose networks which might be implemented in biological systems and compare their proposed learning rules to those already existing in literature. Further, they use small networks on relatively simple tasks, which allows for easier intuition into the learning dynamics.

      Weaknesses:

      The principles discovered by the authors in these smaller networks are not applied to larger networks or more complicated tasks with long temporal delays (>100 timesteps), so it remains unclear to what degree these methods can scale or can be used more generally.

    4. Reviewer #3 (Public review):

      Summary:

      The paper studies learning rules in a simple sigmoidal recurrent neural network setting. The recurrent network has a single layer of 10 to 40 units. It is first confirmed that feedback alignment (FA) can learn a value function in this setting. Then so-called bio-plausible constraints are added: (1) when value weights (readout) is non-negative, (2) when the activity is non-negative (normal sigmoid rather than downscaled between -0.5 and 0.5), (3) when the feedback weights are non-negative, (4) when the learning rule is revised to be monotic: the weights are not downregulated. In the simple task considered all four biological features do not appear to impair totally the learning.

      Strengths:

      (1) The learning rules are implemented in a low-level fashion of the form: (pre-synaptic-activity) x (post-synaptic-activity) x feedback x RPE. Which is therefore interpretable in terms of measurable quantities in the wet-lab.

      (2) I find that non-negative FA (FA with non negative c and w) is the most valuable theoretical insight of this paper: I understand why the alignment between w and c is automatically better at initialization.

      (3) The task choice is relevant, since it connects with experimental settings of reward conditioning with possible plasticity measurements.

    1. eLife Assessment

      This work investigates ZC3H11A as a cause of high myopia through the analysis of human data and experiments with genetic knockout of Zc3h11a in mouse, providing a useful model of myopia. The evidence supporting the conclusion is still incomplete in the revised manuscript as the concerns raised in the previous review were not fully addressed. The article would benefit from a more robust genetic analysis and comprehensive presentation of human phenotypic data to clarify the modes of inheritance in the families, currently limited by loss of patient follow-up and addressing whether there is a reduction in bipolar cell number or decreased marker protein expression through cell counts or quantifiable, less saturated Western blots. The work will be of interest to ophthalmologists and researchers working on myopia

    2. Reviewer #1 (Public Review):

      The authors reported that mutations were identified in the ZC3H11A gene in four adolescents from 1015 high myopia subjects in their myopia cohort. They further generated Zc3h11a knockout mice utilizing the CRISPR/Cas9 technology.

      The main claims are only partially supported. The reviewers still have the concerns of 1) the modes of inheritance for the families need to be shown; 2) the phenotype of heterozygous mutant mice is too weak; 3) the authors still have not addressed the biological question of whether there are fewer bipolar cells or decreased expression of the marker protein. This would involve counting cells, which they have not done. The blots they show do not appear to support their quantifications. Considering the sensitivity of quantifying nearly saturated blots, the authors should show blots that are not exposed to that level of saturation.

    1. eLife Assessment

      This important study provides convincing evidence that the Kinesin protein family member KIF7 regulates the development of the cerebral cortex and its connectivity and the specificity of Sonic Hedgehog signaling by controlling the details of Gli repressor vs activator functions. This study provides new insights into general aspects of cortical development.

    2. Reviewer #1 (Public review):

      Summary:

      This is an interesting follow-up to a paper published in Human Molecular Genetics reporting novel roles in corticogenesis of the Kif7 motor protein that can regulate the activator as well as the repressor functions of the Gli transcription factors in Shh signalling. This new work investigates how a null mutation in the Kif7 gene affects the formation of corticofugal and thalamocortical axon tracts and the migration of cortical interneurons. It demonstrates that Kif7 null mutant embryos present with ventriculomegaly and heterotopias as observed in patients carrying KIF7 mutations. The Kif7 mutation also disrupts the connectivity between cortex and thalamus and leads to an abnormal projection of thalamocortical axons. Moreover, cortical interneurons show migratory defects that are mirrored in cortical slices treated with the Shh inhibitor cyclopamine suggesting that the Kif7 mutation results in a down-regulation of Shh signalling. Interestingly, these defects are much less severe at later stages of corticogenesis.

      Strengths/weaknesses:

      The findings of this manuscript are clearly presented and are based on detailed analyses. Using a compelling set of experiments, especially the live imaging to monitor interneuron migration, the authors convincingly investigate Kif7's roles and their results support their major claims. The migratory defects in interneurons and the potential role of Shh signalling present novel findings and provide some mechanistic insights but rescue experiments would further support Kif7's role in interneuron migration. Similarly, the mechanism underlying the misprojection which has previously been reported in other cilia mutants remains unexplored. Taken together, this manuscript makes novel contributions to our understanding of the role of primary cilia in forebrain development and to the aetiology of the neural symptons in ciliopathy patients.

      Comments on revisions:

      The authors addressed most of the points I raised in my original review.

    3. Reviewer #2 (Public review):

      Summary:

      This study investigates the role of KIF7, a ciliary kinesin involved in the Sonic Hedgehog (SHH) signaling pathway, in cortical development using Kif7 knockout mice. The researchers examined embryonic cortex development (mainly at E14.5), focusing on structural changes and neuronal migration abnormalities.

      Strengths:

      (1) The phenotype observed is interesting, and the findings provide neurodevelopmental insight into some of the symptoms and malformations seen in patients with KIF7 mutations.

      (2) The authors assess several features of cortical development, including structural changes in layers of the developing cortex, connectivity of the cortex with thalamus, as well as migration of cINs from CGE and MGE to cortex.

      Comments on revisions:

      The authors have made significant and thoughtful responses as well as experimental additions to the authors comments. Their efforts are appreciated and the manuscript is much improved.

    4. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations for the authors):

      (1) I am not convinced by the figures the authors present on Shh protein expression. The "bright tiny dots" of Shh protein in the cortex are not visible on the images in Figure 7. I wonder whether the authors could present higher magnification and/or black and white images with increased contrast.

      We have modified Figure 7: we now present a higher magnification and a black and white image with increased contrast to better visualize SHH (+) bright tiny dots in the lateral cortex.

      (2)The manuscript also contains several typos.

      We apologize for these mistakes which have all been corrected.

    1. eLife Assessment

      This study presents useful findings on the application of HPV cfDNA as a marker for monitoring treatment response and prognosis in patients with recurrent or metastatic cervical cancer. The evidence supporting the claims of the authors is solid, although inclusion of a larger number of patient samples would have strengthened the study. The work will be of interest to medics and biologists working on cervical cancer.

    2. Reviewer #1 (Public review):

      Summary:

      The study by Zhuomin Yin and colleagues focuses on the relationship between cell-free HPV (cfHPV) DNA and metastatic or recurrent cervical cancer patients. It expands the application of cfHPV DNA in tracking disease progression and evaluating treatment response in cervical cancer patients. The study is overall well-designed, including appropriate analyses.

      Strengths:

      The findings provide valuable reference points for monitoring drug efficacy and guiding treatment strategies in patients with recurrent and metastatic cervical cancer. The concordance between HPV cfDNA fluctuations and changes in disease status suggests that cfDNA could play a crucial role in precision oncology, allowing for more timely interventions. As with similar studies, the authors used Droplet Digital PCR to measure cfDNA copy numbers, a technique that offers ultrasensitive nucleic acid detection and absolute quantification, lending credibility to the conclusions.

      Weaknesses:

      Despite including 28 clinical cases, only 7 involved recurrent cervical cancer, which may not be sufficient to support some of the authors' conclusions fully. Future studies on larger cohorts could solidify HPV cfDNA's role as a standard in the personalized treatment of recurrent cervical cancer patients.

      Comments on revisions:

      Thanks for your additional efforts and for addressing my concerns.

    3. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The study "Monitoring of Cell-free Human Papillomavirus DNA in Metastatic or Recurrent Cervical Cancer: Clinical Significance and Treatment Implications" by Zhuomin Yin and colleagues focuses on the relationship between cell-free HPV (cfHPV) DNA and metastatic or recurrent cervical cancer patients. It expands the application of cfHPV DNA in tracking disease progression and evaluating treatment response in cervical cancer patients. The study is overall well-designed, including appropriate analyses.

      Strengths:

      The findings provide valuable reference points for monitoring drug efficacy and guiding treatment strategies in patients with recurrent and metastatic cervical cancer. The concordance between HPV cfDNA fluctuations and changes in disease status suggests that cfDNA could play a crucial role in precision oncology, allowing for more timely interventions. As with similar studies, the authors used Droplet Digital PCR to measure cfDNA copy numbers, a technique that offers ultrasensitive nucleic acid detection and absolute quantification, lending credibility to the conclusions.

      Weaknesses:

      Despite including 28 clinical cases, only 7 involved recurrent cervical cancer, which may not be sufficient to support some of the authors' conclusions fully. Future studies on larger cohorts could solidify HPV cfDNA's role as a standard in the personalized treatment of recurrent cervical cancer patients.

      (1) The authors should provide source data for Figures 2, 3, and 4 as supplementary material.

      We greatly appreciate your evaluation of our study and fully agree with the limitations you have pointed out. We appreciate your constructive feedback. Based on your suggestions, we have made the following additions to the article. We have realized that the information provided in Figures 2, 3, and 4 is limited. Therefore, we have presented the original data from Figures 2, 3, and 4 in tabular form in Supplementary Table 2.

      (2) Description of results in Figure 2: Figure 2 would benefit from clearer annotations regarding HPV virus subtypes. For example, does the color-coding in Figure 2B imply that all samples in the LR subgroup are of type HPV16? If that is the case, is it possible that detection variations are due to differences in subtype detection efficiency rather than cfDNA levels? The authors should clarify these aspects. Annotation of Figure 2B suggests that the p-value comes from comparing the LR and LN + H + DSM groups. This should be clarified in the legend. If this p-value comes from comparing HPV cfDNA copies for the (LR, LNM, HM) and (LN + HM, LN + HM + DSM) groups, did the authors carry out post-hoc pairwise comparisons? It would be helpful to include acronyms for these groups in the legend also.

      We fully agree with your point regarding the need for clearer labeling of HPV genotypes in Figures 2B and 2C. If each data point could be color-coded to represent the HPV genotype, Figures 2B and 2C would be clearer and provide more information. However, we must acknowledge that due to the limitations of our current graphing software and our graphical expertise, we were unable to fully represent each HPV genotype in the figures. To address this, we have presented the data in Supplementary Table 2. This table shows the HPV genotype for each patient, the corresponding metastasis patterns, and the baseline HPV copy numbers. We hope this will address the limitation of insufficient information in Figure 2.

      The point you raised regarding whether the differences in detection results might stem from variations in subtype detection efficiency rather than cfDNA levels is a valid limitation of this study. Due to the limited sample size, we did not perform subgroup analyses based on different HPV genotypes, which may have introduced bias in the results presented in Figures 2B and 2C. In response, we have added the following clarification in the discussion section (lines 416-422) and addressed this limitation in the limitations section (lines 499-502). Based on your suggestion, we believe that it is essential to expand the sample size and perform subgroup analysis of the baseline copy numbers for each HPV genotype before treatment. We hope to achieve this goal in future studies.

      Thank you for your thoughtful comments regarding the statistical analyses in the study. The p-value in Figure 2B comes from the comparison among five groups, using a two-sided Kruskal-Wallis test. Your suggestion to perform post-hoc pairwise comparisons is excellent and has made the data presentation in the article more rigorous. Following your advice, we conducted pairwise comparisons between the groups. We used the Mann-Whitney U test to compare HPV cfDNA copy numbers between two groups. Since the LR group only had one value, it could not be included in the pairwise comparisons. Significant differences were observed in two comparisons: LNM vs. LN + H + DSM (P = 0.006) and HM vs. LN + H + DSM (P = 0.036). No significant differences were found between the other groups: LNM vs. HM (P = 0.768), LNM vs. LN + HM (P = 0.079), HM vs. LN + HM (P = 0.112), and LN + HM vs. LN + H + DSM (P = 0.145), as determined by the Mann-Whitney U test  (Figure 2B). (Lines 258-263).

      Thank you for your thoughtful suggestion regarding the inclusion of group acronyms in the legends of Figures 2B and 2C. Including the full names corresponding to the abbreviations would indeed enhance clarity. While we attempted to add both acronyms and full names to the figure legend, the full names were too lengthy and impacted the figure's presentation. Therefore, we have provided the full names corresponding to the abbreviations in the figure caption below, to help readers easily understand the abbreviations used in the figure.

      (3) Interpretation of results in Figure 2 and elsewhere: Significant differences detected in Figure 2B could imply potential associations between HPV cfDNA levels (or subtypes) and recurrence/metastasis patterns. Figure 2C shows that there is a difference in cfDNA levels between the groups compared, suggesting an association but this would not necessarily be a direct "correlation". Overall, interpretation of statistical findings would benefit from more precise language throughout the text and overstatement should be avoided.

      Thank you for your insightful comments regarding the interpretation of results in Figure 2 and elsewhere. We acknowledge that there are several limitations in this study, and the interpretation of the results should be more careful and cautious. Indeed, in the results section, there were issues with inaccurate wording and exaggeration. We have made revisions in the discussion section, which are presented as follows: Preliminary results indicate that baseline HPV cfDNA levels may be linked to recurrence/metastasis patterns, potentially reflecting tumor burden and spread (Lines 411-413). Additionally, we have also made changes in the conclusion section, which are presented as follows: The baseline copy number of HPV cfDNA may be associated with metastatic patterns, thereby reflecting tumor burden and the extent of spread to some extent (Lines 511-513).

      (4) The authors state that six patients showed cfDNA elevation with clinically progressive disease, yet only three are represented in Figure 3B1 under "Patients whose disease progressed during treatment." What is the expected baseline variability in cfDNA for patients? If we look at data from patients with early-stage cancer would we see similar fluctuations? And does the degree of variability vary for different HPV subtypes? Without understanding the normal fluctuations in cfDNA levels, interpreting these changes as progression indicators may be premature.

      Thank you for your feedback. We appreciate your thorough review and attention to detail. Six cervical squamous cell carcinoma (SCC) patients exhibited elevated HPV cfDNA levels as their clinical condition progressed. In the previous Figures 3A1 and 3A2, we only presented data from three patients, as we initially believed that displaying the cfDNA curves from three patients would offer a clearer view, while including six patients might lead to overlap and reduce clarity. However, this may have caused confusion for readers. Based on your suggestion, we have revised Figure 3A1 to include the cfDNA curves for all six patients who with squamous cell carcinoma who experienced clinical disease progression during treatment (Figure 3A1), along with the corresponding SCC-Ag curves (Figure 3A2).

      Thank you for highlighting the issue of baseline variability in HPV cfDNA. This is indeed a limitation of our study, which did not address this aspect. If baseline variability is defined as changes in HPV cfDNA levels measured at different time points before treatment in the same patient, fluctuations at different time points are inevitable and objective. Following your suggestion, we have added a discussion on baseline variability in the limitations section of the manuscript to provide readers with a more objective understanding of our study's findings (Lines 501-502).In future studies, we will incorporate baseline variability into the research design to better understand pre-treatment HPV cfDNA fluctuations and provide support for clinical decision-making.

      (5) It would be helpful if where p-values are given, the test used to derive these values was also stated within parentheses e.g. (P < 0.05, permutation test with Benjamini-Hochberg procedure).

      Thank you for your valuable suggestions and examples. Following your advice, we have included the statistical test methods used to obtain the p-values in parentheses wherever they appear in the results section. Additionally, we have specified the statistical test methods for the p-values below the figures in the results section.

      Reviewer #2 (Public review):

      Summary:

      The authors conducted a study to evaluate the potential of circulating HPV cell-free DNA (cfDNA) as a biomarker for monitoring recurrent or metastatic HPV+ cervical cancer. They analyzed serum samples from 28 patients, measuring HPV cfDNA levels via digital droplet PCR and comparing these to squamous cell carcinoma antigen (SCC-Ag) levels in 26 SCC patients, while also testing the association between HPV cfDNA levels and clinical outcomes. The main hypothesis that the authors set out to test was whether circulating HPV cfDNA levels correlated with metastatic patterns and/or treatment response in HPV+ CC.

      The main claims put forward by the paper are that:

      (1) HPV cfDNA was detected in all 28 CC patients enrolled in the study and levels of HPV cfDNA varied over a median 2-month monitoring period.

      (2) 'Median baseline' HPV cfDNA varied according to 'metastatic pattern' in individual patients.

      (3) Positivity rate for HPV cfDNA was more consistent than SCC-Ag.

      (4) In 20 SCC patients monitored longitudinally, concordance with changes in disease status was 90% for HPV cfDNA.

      This study highlights HPV cfDNA as a promising biomarker with advantages over SCC-Ag, underscoring its potential for real-time disease surveillance and individualized treatment guidance in HPV-associated cervical cancer.

      Strengths:

      This study presents valuable insights into HPV+ cervical cancer with potential translational significance for management and guiding therapeutic strategies. The focus on a non-invasive approach is particularly relevant for women's cancers, and the study exemplifies the promising role of HPV cfDNA as a biomarker that could aid personalized treatment strategies.

      Weaknesses:

      While the authors acknowledge the study's small cohort and variability in sequential sampling protocols as a limitation, several revisions should be made to ensure that (1) the findings are presented in a way that aligns more closely with the data without overstatement and (2) that the statistical support for these findings is made more clear. Specific suggestions are outlined below.

      (1) Line 54 in the abstract refers to 'combined multiple-metastasis pattern' but it is not clear what this refers to at this point in the text.

      Thank you for your detailed feedback. You are correct that the "combined multi-metastatic pattern" was not adequately explained in the abstract, which may have caused confusion. To address this, we have clarified the definitions of the combined multi-metastatic pattern and single-metastatic pattern in lines 53-55 of the manuscript. Patients with a combined multi-metastatic pattern (lymph node + hematogenous ± diffuse serosal metastasis)  exhibited a higher median baseline HPV cfDNA level compared to those with a single-metastasis pattern (local recurrence, lymph node metastasis, or hematogenous metastasis) (P = 0.003).

      (2) Line 90 The reference to 'prospective clinical study (NCT03175848) in primary stage IVB CC to investigate the role of radiotherapy (RT) in combination therapy' seems not to be at all relevant at this point in the text. I would limit the description of this study to the methods.

      Thank you for your thoughtful and thorough review. Your suggestions are highly relevant. Upon further reflection, we recognized that this sentence was redundant in its original placement. Following your recommendation, we have removed it from this section and moved it to the methods section (Lines 109-111). The revised statement is as follows: "Notably, 19 cases from the primary CC group participated in our prospective clinical study (NCT03175848), focused on stage IVB cervical cancer."

      (3) Line 56 refers to HPV cfDNA levels (range 0.3-16.9) but what units?

      Thank you for your feedback regarding the manuscript format. While you highlighted this specific issue, we have since identified several other instances of omitted units in parentheses throughout the manuscript. We acknowledge that such formatting oversights can create ambiguity for readers. Following your suggestions, we have corrected all such issues in the manuscript. We greatly appreciate your careful and thorough review.

      (4) Lines 247-248 claim that higher baseline HPV cfDNA levels correlated with a more substantial post-chemotherapy decrease. This correlation should be statistically validated, and the p-value should be included.

      Thank you for your insightful comments, which highlighted an issue with this sentence. Upon review, I have made the necessary revisions. Since no statistical analysis was conducted and the P-value was not provided, the original sentence was imprecise. Given the small sample size, statistical analysis is not feasible. I have revised the sentence as follows: “For patients in whom systemic cytotoxic chemotherapy was effective, a significant decrease in HPV cfDNA levels could be detected after chemotherapy” (Lines 297-298).

      (5) The authors mention that baseline samples were collected "between Day -14 and Day +30 preceding initial treatment." If Day -14 indicates two weeks before treatment, then this would imply some samples were taken up to 30 days post-treatment. This notation should be clarified. To what extent might outliers or more extreme values in Figure 2 driven by variability in how baseline sampling was carried out?

      Thank you for your insightful comments. Undoubtedly, this is indeed a major limitation of our study. These factors could lead to a certain degree of bias in the detection data. The primary reason is that the study was conducted during the COVID-19 pandemic, making it sometimes difficult to conduct sampling regularly. In accordance with your suggestion, I have already added this part of the content to the results section of the article (Lines 266-275). We have also included the variation in baseline sampling as a limitation in the discussion section (Lines 497-499). In future studies, we will strive to improve the study design by ensuring baseline samples are collected prior to treatment, thereby enhancing the reliability of statistical and analytical results.

      (6) Would be useful to amend Figure 1 to show a subset of patients with SCC and a subset of patients who underwent longitudinal monitoring.

      Thank you for your detailed suggestion. Including a subset of pathological types could indeed add more information to Figure 1. However, regarding the pathological types of the patients in this group, we have listed them in Table 1 and Supplementary Table 2. Among the 28 patients, 26 are diagnosed with squamous cell carcinoma, so 92.9% of the patients in this study have squamous cell carcinoma. To avoid making Figure 1 too complex, we decided not to include the pathological type in the figure.

      (7) Line 120 "a time point matching or closely following HPV cfDNA sampling" - what is the time range for 'closely following' here? A couple of hours or days after sampling?

      Thank you for your detailed feedback. Based on your suggestion, we have revised the sentence as follows:

      "For patients with squamous cell CC in the sequential sampling group, concurrent SCC-Ag testing was performed at a time point that matched, or was within 7 days before or after, the HPV cfDNA sampling." (Line 123-125)

      (8) Lines 178-190 and lines 179-180 seem to make exactly the same point.

      Thank you very much for your careful review. Indeed, these two sentences were repetitive and conveyed the same point. I have removed the previous sentence here (lines 206-207).

      (9) In Figure 4, please indicate the number of patients in each group in the legend e.g. HPV16+ (n=x number of patients).

      Thank you for your feedback on the details of Figure 4 and the examples provided. We have updated Figure 4 according to your suggestions and included the number of patients in each group in the figure legend.

      (10) Lines 322-3 'HPV cfDNA predicted treatment response or disease progression at an earlier time point than imaging assessments' - based on the data available and the numbers of patients, I would argue that this is too bold a claim.

      Thank you very much for pointing out this issue. We fully agree with your view. We have modified this sentence as follows: "Secondly, dynamically monitored HPV cfDNA levels appeared to predict treatment response and disease progression. " (Lines 391-392).

    1. eLife Assessment

      This valuable study introduces a modern and accessible PyTorch reimplementation of the widely used SpliceAI model for splice site prediction. The authors provide solid evidence that their OpenSpliceAI implementation matches the performance of the original while improving usability and enabling flexible retraining across species. These advances are likely to be of broad interest to the computational genomics community.

    2. Reviewer #1 (Public review):

      Summary:

      Chao et al. produced an updated version of the SpliceAI package using modern deep learning frameworks. This includes data preprocessing, model training, direct prediction, and variant effect prediction scripts. They also added functionality for model fine-tuning and model calibration. They convincingly evaluate their newly trained models against those from the original SpliceAI package and investigate how to extend SpliceAI to make predictions in new species. While their comparisons to the original SpliceAI models are convincing on the grounds of model performance, their evaluation of how well the new models match the original's understanding of non-local mutation effects is incomplete. Further, their evaluation of the new calibration functionality would benefit from a more nuanced discussion of what set of splice sites their calibration is expected to hold for, and tests in a context for which calibration is needed.

      Strengths:

      (1) They provide convincing evidence that their new implementation of SpliceAI matches the performance of the original model on a similar dataset while benefiting from improved computational efficiencies. This will enable faster prediction and retraining of splicing models for new species as well as easier integration with other modern deep learning tools.

      (2) They produce models with strong performance on non-human model species and a simple, well-documented pipeline for producing models tuned for any species of interest. This will be a boon for researchers working on splicing in these species and make it easy for researchers working on new species to generate their own models.

      (3) Their documentation is clear and abundant. This will greatly aid the ability of others to work with their code base.

      Weaknesses:

      (1) The authors' assessment of how much their model retains SpliceAI's understanding of "non-local effects of genomic mutations on splice site location and strength" (Figure 6) is not sufficiently supported. Demonstrating this would require showing that for a large number of (non-local) mutations, their model shows the same change in predictions as SpliceAI or that attribution maps for their model and SpliceAI are concordant even at distances from the splice site. Figure 6A comes close to demonstrating this, but only provides anecdotal evidence as it is limited to 2 loci. This could be overcome by summarizing the concordance between ISM maps for the two models and then comparing across many loci. Figure 6B also comes close, but falls short because instead of comparing splicing prediction differences between the models as a function of variants, it compares the average prediction difference as a function of the distance from the splice site. This limits it to only detecting differences in the model's understanding of the local splice site motif sequences. This could be overcome by looking at comparisons between differences in predictions with mutants directly and considering non-local mutants that cause differences in splicing predictions.

      (2) The utility of the calibration method described is unclear. When thinking about a calibrated model for splicing, the expectation would be that the models' predicted splicing probabilities would match the true probabilities that positions with that level of prediction confidence are splice sites. However, the actual calibration that they perform only considers positions as splice sites if they are splice sites in the longest isoform of the gene included in the MANE annotation. In other words, they calibrate the model such that the model's predicted splicing probabilities match the probability that a position with that level of confidence is a splice site in one particular isoform for each gene, not the probability that it is a splice site more broadly. Their level of calibration on this set of splice sites may very well not hold to broader sets of splice sites, such as sites from all annotated isoforms, sites that are commonly used in cryptic splicing, or poised sites that can be activated by a variant. This is a particularly important point as much of the utility of SpliceAI comes from its ability to issue variant effect predictions, and they have not demonstrated that this calibration holds in the context of variants. This section could be improved by expanding and clarifying the discussion of what set of splice sites they have demonstrated calibration on, what it means to calibrate against this set of splice sites, and how this calibration is expected to hold or not for other interesting sets of splice sites. Alternatively, or in addition, they could demonstrate how well their calibration holds on different sets of splice sites or show the effect of calibrating their models against different potentially interesting sets of splice sites and discuss how the results do or do not differ.

      (3) It is difficult to assess how well their calibration method works in general because their original models are already well calibrated, so their calibration method finds temperatures very close to 1 and only produces very small and hard to assess changes in calibration metrics. This makes it very hard to distinguish if the calibration method works, as it doesn't really produce any changes. It would be helpful to demonstrate the calibration method on a model that requires calibration or on a dataset for which the current model is not well calibrated, so that the impact of the calibration method could be observed.

    3. Reviewer #2 (Public review):

      Summary:

      The paper by Chao et al offers a reimplementation of the SpliceAI algorithm in PyTorch so that the model can more easily/efficiently be retrained. They apply their new implementation of the SpliceAI algorithm, which they call OpenSpliceAI, to several species and compare it against the original model, showing that the results are very similar and that in some small species, pre-training on other species helps improve performance.

      Strengths:

      On the upside, the code runs fine, and it is well documented.

      Weaknesses:

      The paper itself does not offer much beyond reimplementing SpliceAI. There is no new algorithm, new analysis, new data, or new insights into RNA splicing. There is no comparison to many of the alternative methods that have since been published to surpass SpliceAI. Given that some of the authors are well-known with a long history of important contributions, our expectations were admittedly different. Still, we hope some readers will find the new implementation useful.

    4. Reviewer #3 (Public review):

      Summary:

      The authors present OpenSpliceAI, a PyTorch-based reimplementation of the well-known SpliceAI deep learning model for splicing prediction. The core architecture remains unchanged, but the reimplementation demonstrates convincing improvements in usability, runtime performance, and potential for cross-species application.

      Strengths:

      The improvements are well-supported by comparative benchmarks, and the work is valuable given its strong potential to broaden the adoption of splicing prediction tools across computational and experimental biology communities.

      Major comments:

      Can fine-tuning also be used to improve prediction for human splicing? Specifically, are models trained on other species and then fine-tuned with human data able to perform better on human splicing prediction? This would enhance the model's utility for more users, and ideally, such fine-tuned models should be made available.

    5. Author response:

      Reviewer #1 (Public review):

      Summary:

      Chao et al. produced an updated version of the SpliceAI package using modern deep learning frameworks. This includes data preprocessing, model training, direct prediction, and variant effect prediction scripts. They also added functionality for model fine-tuning and model calibration. They convincingly evaluate their newly trained models against those from the original SpliceAI package and investigate how to extend SpliceAI to make predictions in new species. While their comparisons to the original SpliceAI models are convincing on the grounds of model performance, their evaluation of how well the new models match the original's understanding of non-local mutation effects is incomplete. Further, their evaluation of the new calibration functionality would benefit from a more nuanced discussion of what set of splice sites their calibration is expected to hold for, and tests in a context for which calibration is needed.

      Strengths:

      (1) They provide convincing evidence that their new implementation of SpliceAI matches the performance of the original model on a similar dataset while benefiting from improved computational efficiencies. This will enable faster prediction and retraining of splicing models for new species as well as easier integration with other modern deep learning tools.

      (2) They produce models with strong performance on non-human model species and a simple, well-documented pipeline for producing models tuned for any species of interest. This will be a boon for researchers working on splicing in these species and make it easy for researchers working on new species to generate their own models.

      (3) Their documentation is clear and abundant. This will greatly aid the ability of others to work with their code base.

      We thank the reviewer for these positive comments.  

      Weaknesses:

      (1) The authors' assessment of how much their model retains SpliceAI's understanding of "nonlocal effects of genomic mutations on splice site location and strength" (Figure 6) is not sufficiently supported. Demonstrating this would require showing that for a large number of (non-local) mutations, their model shows the same change in predictions as SpliceAI or that attribution maps for their model and SpliceAI are concordant even at distances from the splice site. Figure 6A comes close to demonstrating this, but only provides anecdotal evidence as it is limited to 2 loci. This could be overcome by summarizing the concordance between ISM maps for the two models and then comparing across many loci. Figure 6B also comes close, but falls short because instead of comparing splicing prediction differences between the models as a function of variants, it compares the average prediction difference as a function of the distance from the splice site. This limits it to only detecting differences in the model's understanding of the local splice site motif sequences. This could be overcome by looking at comparisons between differences in predictions with mutants directly and considering non-local mutants that cause differences in splicing predictions.

      We agree that two loci are insufficient to demonstrate preservation of non-local effects. To address this, we have extended our analysis to a larger set of sites: we randomly sampled 100 donor and 100 acceptor sites, applied our ISM procedure over a 5,001 nt window centered at each site for both models, and computed the ISM map as before. We then calculated the Pearson correlation between the collection of OSAI<sub>MANE</sub> and SpliceAI ISM importance scores. We also created 10 additional ISM maps similar to those in Figure 6A, which are now provided in Figure S23.

      Follow is the revised paragraph in the manuscript’s Results section:

      First, we recreated the experiment from Jaganathan et al. in which they mutated every base in a window around exon 9 of the U2SURP gene and calculated its impact on the predicted probability of the acceptor site. We repeated this experiment on exon 2 of the DST gene, again using both SpliceAI and OSAI<sub>MANE</sub> . In both cases, we found a strong similarity between the resultant patterns between SpliceAI and OSAI<sub>MANE</sub> , as shown in Figure 6A. To evaluate concordance more broadly, we randomly selected 100 donor and 100 acceptor sites and performed the same ISM experiment on each site. The Pearson correlation between SpliceAI and OSAI<sub>MANE</sub> yielded an overall median correlation of 0.857 (see Methods; additional DNA logos in Figure S23). 

      To characterize the local sequence features that both models focus on, we computed the average decrease in predicted splice-site probability resulting from each of the three possible singlenucleotide substitutions at every position within 80bp for 100 donor and 100 acceptor sites randomly sampled from the test set (Chromosomes 1, 3, 5, 7, and 9). Figure 6B shows the average decrease in splice site strength for each mutation in the format of a DNA logo, for both tools.

      We added the following text to the Methods section:

      Concordance evaluation of ISM importance scores between OSAI<sub>MANE</sub> and SpliceAI

      To assess agreement between OSAI<sub>MANE</sub> and SpliceAI across a broad set of splice sites, we applied our ISM procedure to 100 randomly chosen donor sites and 100 randomly chosen acceptor sites. For each site, we extracted a 5,001 nt window centered on the annotated splice junction and, at every coordinate within that window, substituted the reference base with each of the three alternative nucleotides. We recorded the change in predicted splice-site probability for each mutation and then averaged these Δ-scores at each position to produce a 5,001-score ISM importance profile per site.

      Next, for each splice site we computed the Pearson correlation coefficient between the paired importance profiles from ensembled OSAI<sub>MANE</sub> and ensembled SpliceAI. The median correlation was 0.857 for all splice sites. Ten additional zoom-in representative splice site DNA logo comparisons are provided in Supplementary Figure S23.

      (2) The utility of the calibration method described is unclear. When thinking about a calibrated model for splicing, the expectation would be that the models' predicted splicing probabilities would match the true probabilities that positions with that level of prediction confidence are splice sites. However, the actual calibration that they perform only considers positions as splice sites if they are splice sites in the longest isoform of the gene included in the MANE annotation. In other words, they calibrate the model such that the model's predicted splicing probabilities match the probability that a position with that level of confidence is a splice site in one particular isoform for each gene, not the probability that it is a splice site more broadly. Their level of calibration on this set of splice sites may very well not hold to broader sets of splice sites, such as sites from all annotated isoforms, sites that are commonly used in cryptic splicing, or poised sites that can be activated by a variant. This is a particularly important point as much of the utility of SpliceAI comes from its ability to issue variant effect predictions, and they have not demonstrated that this calibration holds in the context of variants. This section could be improved by expanding and clarifying the discussion of what set of splice sites they have demonstrated calibration on, what it means to calibrate against this set of splice sites, and how this calibration is expected to hold or not for other interesting sets of splice sites. Alternatively, or in addition, they could demonstrate how well their calibration holds on different sets of splice sites or show the effect of calibrating their models against different potentially interesting sets of splice sites and discuss how the results do or do not differ.

      We thank the reviewer for highlighting the need to clarify our calibration procedure. Both SpliceAI and OpenSpliceAI are trained on a single “canonical” transcript per gene: SpliceAI on the hg 19 Ensembl/Gencode canonical set and OpenSpliceAI on the MANE transcript set. To calibrate each model, we applied post-hoc temperature scaling, i.e. a single learnable parameter that rescales the logits before the softmax. This adjustment does not alter the model’s ranking or discrimination (AUC/precision–recall) but simply aligns the predicted probabilities for donor, acceptor, and non-splice classes with their observed frequencies. As shown in our reliability diagrams (Fig. S16-S22), temperature scaling yields negligible changes in performance, confirming that both SpliceAI and OpenSpliceAI were already well-calibrated. However, we acknowledge that we didn’t measure how calibration might affect predictions on non-canonical splice sites or on cryptic splicing. It is possible that calibration might have a detrimental effect on those, but because this is not a key claim of our paper, we decided not to do further experiments. We have updated the manuscript to acknowledge this potential shortcoming; please see the revised paragraph in our next response.

      (3) It is difficult to assess how well their calibration method works in general because their original models are already well calibrated, so their calibration method finds temperatures very close to 1 and only produces very small and hard to assess changes in calibration metrics. This makes it very hard to distinguish if the calibration method works, as it doesn't really produce any changes. It would be helpful to demonstrate the calibration method on a model that requires calibration or on a dataset for which the current model is not well calibrated, so that the impact of the calibration method could be observed.

      It’s true that the models we calibrated didn’t need many changes. It is possible that the calibration methods we used (which were not ours, but which were described in earlier publications) can’t improve the models much. We toned down our comments about this procedure, as follows.

      Original:

      “Collectively, these results demonstrate that OSAIs were already well-calibrated, and this consistency across species underscores the robustness of OpenSpliceAI’s training approach in diverse genomic contexts.” Revised:

      “We observed very small changes after calibration across phylogenetically diverse species, suggesting that OpenSpliceAI’s training regimen yielded well‐calibrated models, although it is possible that a different calibration algorithm might produce further improvements in performance.”

      Reviewer #2 (Public review):

      Summary:

      The paper by Chao et al offers a reimplementation of the SpliceAI algorithm in PyTorch so that the model can more easily/efficiently be retrained. They apply their new implementation of the SpliceAI algorithm, which they call OpenSpliceAI, to several species and compare it against the original model, showing that the results are very similar and that in some small species, pretraining on other species helps improve performance.

      Strengths:

      On the upside, the code runs fine, and it is well documented.

      Weaknesses:

      The paper itself does not offer much beyond reimplementing SpliceAI. There is no new algorithm, new analysis, new data, or new insights into RNA splicing. There is no comparison to many of the alternative methods that have since been published to surpass SpliceAI. Given that some of the authors are well-known with a long history of important contributions, our expectations were admittedly different. Still, we hope some readers will find the new implementation useful.

      We thank the reviewer for the feedback. We have clarified that OpenSpliceAI is an open-source PyTorch reimplementation optimized for efficient retraining and transfer learning, designed to analyze cross-species performance gains, and supported by a thorough benchmark and the release of several pretrained models to clearly position our contribution.

      Reviewer #3 (Public review):

      Summary:

      The authors present OpenSpliceAI, a PyTorch-based reimplementation of the well-known SpliceAI deep learning model for splicing prediction. The core architecture remains unchanged, but the reimplementation demonstrates convincing improvements in usability, runtime performance, and potential for cross-species application.

      Strengths:

      The improvements are well-supported by comparative benchmarks, and the work is valuable given its strong potential to broaden the adoption of splicing prediction tools across computational and experimental biology communities.

      Major comments:

      Can fine-tuning also be used to improve prediction for human splicing? Specifically, are models trained on other species and then fine-tuned with human data able to perform better on human splicing prediction? This would enhance the model's utility for more users, and ideally, such fine-tuned models should be made available.

      We evaluated transfer learning by fine-tuning models pretrained on mouse (OSAI<sub>Mouse</sub>), honeybee (OSAI<sub>Honeybee</sub>), Arabidopsis (OSAI<sub>Arabidopsis</sub>), and zebrafish (OSAI<sub>Zebrafish</sub>) on human data. While transfer learning accelerated convergence compared to training from scratch, the final human splicing prediction accuracy was comparable between fine-tuned and scratch-trained models, suggesting that performance on our current human dataset is nearing saturation under this architecture.

      We added the following paragraph to the Discussion section:

      We also evaluated pretraining on mouse (OSAI<sub>Mouse</sub>), honeybee (OSAI<sub>Honeybee</sub>), zebrafish (OSAI<sub>Zebrafish</sub>), and Arabidopsis (OSAI<sub>Arabidopsis</sub>) followed by fine-tuning on the human MANE dataset. While cross-species pretraining substantially accelerated convergence during fine-tuning, the final human splicing-prediction accuracy was comparable to that of a model trained from scratch on human data. This result indicates that our architecture seems to capture all relevant splicing features from human training data alone, and thus gains little or no benefit from crossspecies transfer learning in this context (see Figure S24).

      Reviewer #1 (Recommendations for the authors):

      We thank the editor for summarizing the points raised by each reviewer. Below is our point-bypoint response to each comment:

      (1) In Figure 3 (and generally in the other figures) OpenSpliceAI should be replaced with OSAI_{Training dataset} because otherwise it is hard to tell which precise model is being compared. And in Figure 3 it is especially important to emphasize that you are comparing a SpliceAI model trained on Human data to an OSAI model trained and evaluated on a different species.

      We have updated the labels in Figures 3, replacing “OpenSpliceAI” with “OSAI_{training dataset}” to more clearly specify which model is being compared.

      (2) Are genes paralogous to training set genes removed from the validation set as well as the test set? If you are worried about data leakage in the test set, it makes sense to also consider validation set leakage.

      Thank you for this helpful suggestion. We fully agree, and to avoid any data leakage we implemented the identical filtering pipeline for both validation and test sets: we excluded all sequences paralogous or homologous to sequences in the training set, and further removed any sequence sharing > 80 % length overlap and > 80 % sequence identity with training sequences. The effect of this filtering on the validation set is summarized in Supplementary Figure S7C.

      Figure S7. (C) Scatter plots of DNA sequence alignments between validation and training sets for Human-MANE, mouse, honeybee, zebrafish, and Arabidopsis. Each dot represents an alignment, with the x-axis showing alignment identity and the y-axis showing alignment coverage. Alignments exceeding 80% for both identity and coverage are highlighted in the redshaded region and were excluded from the test sets.

      Reviewer #3 (Recommendations for the authors):

      (1) The legend in Figure 3 is somewhat confusing. The labels like "SpliceAI-Keras (species name)" may imply that the model was retrained using data from that species, but that's not the case, correct?

      Yes, “SpliceAI-Keras (species name)” was not retrained; it refers to the released SpliceAI model evaluated on the specified species dataset. We have revised the Figure 3 legends, changing “SpliceAI-Keras (species name)” to “SpliceAI-Keras” to clarify this.

      (2) Please address the minor issues with the code, including ensuring the conda install works across various systems.

      We have addressed the issues you mentioned. OpenSpliceAI is now available on Conda and can be installed with:  conda install openspliceai. 

      The conda package homepage is at: https://anaconda.org/khchao/openspliceai We’ve also corrected all broken links in the documentation.

      (3) Utility:

      I followed all the steps in the Quick Start Guide, and aside from the issues mentioned below, everything worked as expected.

      I attempted installation using conda as described in the instructions, but it was unsuccessful. I assume this method is not yet supported.

      In Quick Start Guide: predict, the link labeled "GitHub (models/spliceai-mane/10000nt/)" appears to be incorrect. The correct path is likely "GitHub (models/openspliceaimane/10000nt/)".

      In Quick Start Guide: variant (https://ccb.jhu.edu/openspliceai/content/quick_start_guide/quickstart_variant.html#quick-startvariant), some of the download links for input files were broken. While I was able to find some files in the GitHub repository, I think the -A option should point to data/grch37.txt, not examples/data/input.vcf, and the -I option should be examples/data/input.vcf, not data/vcf/input.vcf.

      Thank you for catching these issues. We’ve now addressed all issues concerning Conda installation and file links. We thank the editor for thoroughly testing our code and reviewing the documentation.

    1. eLife Assessment

      The manuscript by Hawes et al. provides important findings on how striatal projection neurons regulate spontaneous locomotion speed in the context of implicit motivation and distinct contextual valence. The supporting evidence for the findings is convincing. This work will be of broad interest to neuroscientists in the fields of basal ganglia, movement control, and cognition.

    2. Reviewer #1 (Public review):

      Summary:

      This fundamental work employed multidisciplinary approaches and conducted rigorous experiments to study how a specific subset of neurons in the dorsal striatum (i.e., "patchy" striatal neurons) modulates locomotion speed depending on the valence of naturalistic contexts.

      Strengths:

      The scientific findings are novel and original and significantly advance our understanding of how the striatal circuit regulates spontaneous movement in various contexts.

      Weaknesses:

      This is extensive research involving various circuit manipulation approaches. Some of these circuit manipulations are not physiological. A balanced discussion of the technical strengths and limitations of the present work would be helpful and beneficial to the field.

    3. Reviewer #2 (Public review):

      Hawes et al. investigated the role of striatal neurons in the patch compartment of the dorsal striatum. Using Sepw1-Cre line, the authors combined a modified version of the light/dark transition box test that allows them to examine locomotor activity in different environmental valence with a variety of approaches, including cell-type-specific ablation, miniscope calcium imaging, fiber photometry, and opto-/chemogenetics. First, they found ablation of patchy striatal neurons resulted in an increase in movement vigor when mice stayed in a safe area or when they moved back from more anxiogenic to safe environments. The following miniscope imaging experiment revealed that a larger fraction of striatal patchy neurons was negatively correlated with movement speed, particularly in an anxiogenic area. Next, the authors investigated differential activity patterns of patchy neurons' axon terminals, focusing on those in GPe, GPi, and SNr, showing that the patchy axons in SNr reflect movement speed/vigor. Chemogenetic and optogenetic activation of these patchy striatal neurons suppressed the locomotor vigor, thus demonstrating their causal role in the modulation of locomotor vigor when exposed to valence differentials. Unlike the activation of striatal patches, such a suppressive effect on locomotion was absent when optogenetically activating matrix neurons by using the Calb1-Cre line, indicating distinctive roles in the control of locomotor vigor by striatal patch and matrix neurons. Together, they have concluded that nigrostriatal neurons within striatal patches negatively regulate movement vigor, dependent on behavioral contexts where motivational valence differs.

      The strengths of this work include the use of multiple experimental approaches, including genetic/viral ablation of patch neurons, miniscope single-cell imaging, as well as projection-specific recording of axonal activity by fiber photometry, and causal manipulation of the neurons by chemogenetic and optogenetics. Although similar findings were reported previously, the authors' results will be of value owing to multiple levels of investigation. In my view, this study will add to the important literature by demonstrating how patch (striosomal) neurons in the striatum controls movement vigor.

    4. Reviewer #3 (Public review):

      Hawes et al. combined behavioral, optical imaging, and activity manipulation techniques to investigate the role of striatal patch SPNs in locomotion regulation. Using Sepw1-Cre transgenic mice, they found that patch SPNs encode locomotion deceleration in a light-dark box procedure through optical imaging techniques. Moreover, genetic ablation of patch SPNs increased locomotion speed, while chemogenetic activation of these neurons decreased it. The authors concluded that a subtype of patch striatonigral neurons modulates locomotion speed based on external environmental cues.

      In the revision, the authors have largely addressed my concerns with additional explanation and discussion, although some of the key experiments to strengthen the authors' claim by identifying the function of specific cell populations remain to be conducted due to technical challenges. Nevertheless, the current results remain valuable and interesting to a wide audience in the field.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This fundamental work employed multidisciplinary approaches and conducted rigorous experiments to study how a specific subset of neurons in the dorsal striatum (i.e., "patchy" striatal neurons) modulates locomotion speed depending on the valence of the naturalistic context.

      Strengths:

      The scientific findings are novel and original and significantly advance our understanding of how the striatal circuit regulates spontaneous movement in various contexts.

      We appreciate the reviewer’s positive evaluation.

      Weaknesses:

      This is extensive research involving various circuit manipulation approaches. Some of these circuit manipulations are not physiological. A balanced discussion of the technical strengths and limitations of the present work would be helpful and beneficial to the field. Minor issues in data presentation were also noted.

      We have incorporated the recommended discussion of technical limitations and addressed the physiological plausibility of our manipulations on Page 33 of the revised Discussion section. Specifically, we wrote:

      “Judicious interpretation of the present data must consider the technical limitations of the various methods and circuit-level manipulations applied. Patchy neurons are distributed unevenly across the extensive structure of the striatum, and their targeted manipulation is constrained by viral spread in the dorsal striatum. Somatic calcium imaging using single-photon microscopy captures activity from only a subset of patchy neurons within a narrow focal plane beneath each implanted GRIN lens. Similarly, limitations in light diffusion from optical fibers may reduce the effective population of targeted fibers in both photometry and optogenetic experiments. For example, the more modest locomotor slowing observed with optogenetic activation of striatonigral fibers in the SNr compared to the stronger effects seen with Gq-DREADD activation across the dorsal striatum could reflect limited fiber optic coverage in the SNr. Alternatively, it may suggest that non-striatonigral mechanisms also contribute to generalized slowing. Our photometry data does not support a role for striatopallidal projections from patchy neurons in movement suppression. The potential contribution of intrastriatal mechanisms, discussed earlier, remains to be empirically tested. Although the behavioral assays used were naturalistic, many of the circuit-level interventions were not. Broad ablation or widespread activation of patchy neurons and their efferent projections represent non-physiological manipulations. Nonetheless, these perturbation results are interpreted alongside more naturalistic observations, such as in vivo imaging of patchy neuron somata and axon terminals, to form a coherent understanding of their functional role”.

      Reviewer #2 (Public review):

      Hawes et al. investigated the role of striatal neurons in the patch compartment of the dorsal striatum. Using Sepw1-Cre line, the authors combined a modified version of the light/dark transition box test that allows them to examine locomotor activity in different environmental valence with a variety of approaches, including cell-type-specific ablation, miniscope calcium imaging, fiber photometry, and opto-/chemogenetics. First, they found ablation of patchy striatal neurons resulted in an increase in movement vigor when mice stayed in a safe area or when they moved back from more anxiogenic to safe environments. The following miniscope imaging experiment revealed that a larger fraction of striatal patchy neurons was negatively correlated with movement speed, particularly in an anxiogenic area. Next, the authors investigated differential activity patterns of patchy neurons' axon terminals, focusing on those in GPe, GPi, and SNr, showing that the patchy axons in SNr reflect movement speed/vigor. Chemogenetic and optogenetic activation of these patchy striatal neurons suppressed the locomotor vigor, thus demonstrating their causal role in the modulation of locomotor vigor when exposed to valence differentials. Unlike the activation of striatal patches, such a suppressive effect on locomotion was absent when optogenetically activating matrix neurons by using the Calb1-Cre line, indicating distinctive roles in the control of locomotor vigor by striatal patch and matrix neurons. Together, they have concluded that nigrostriatal neurons within striatal patches negatively regulate movement vigor, dependent on behavioral contexts where motivational valence differs.

      We are grateful for the reviewer’s thorough summary of our main findings.

      In my view, this study will add to the important literature by demonstrating how patch (striosomal) neurons in the striatum control movement vigor. This study has applied multiple approaches to investigate their functionality in locomotor behavior, and the obtained data largely support their conclusions. Nevertheless, I have some suggestions for improvements in the manuscript and figures regarding their data interpretation, accuracy, and efficacy of data presentation.

      We appreciate the reviewer’s overall positive assessment and have made substantial improvements to the revised manuscript in response to reviewers’ constructive suggestions. 

      (1) The authors found that the activation of the striatonigral pathway in the patch compartment suppresses locomotor speed, which contradicts with canonical roles of the direct pathway. It would be great if the authors could provide mechanistic explanations in the Discussion section. One possibility is that striatal D1R patch neurons directly inhibit dopaminergic cells that regulate movement vigor (Nadal et al., Sci. Rep., 2021; Okunomiya et al., J Neurosci., 2025). Providing plausible explanations will help readers infer possible physiological processes and give them ideas for future follow-up studies.

      We have added the recommended data interpretation and future perspectives on Page 30 of the revised Discussion section. Specifically, we wrote:

      “Potential mechanisms by which striatal patchy neurons reduce locomotion involve the suppression of dopamine availability within the striatum. Dopamine, primarily supplied by neurons in the SNc and VTA, broadly facilitates locomotion (Gerfen and Surmeier 2011, Dudman and Krakauer 2016). Recent studies have shown that direct activation of patchy neurons leads to a reduction in striatal dopamine levels, accompanied by decreased walking speed (Nadel, Pawelko et al. 2021, Dong, Wang et al. 2025, Okunomiya, Watanabe et al. 2025). Patchy neuron projections terminate in structures known as “dendron bouquets”, which enwrap SNc dendrites within the SNr and can pause tonic dopamine neuron firing (Crittenden, Tillberg et al. 2016, Evans, Twedell et al. 2020). The present work highlights a role for patchy striatonigral inputs within the SN in decelerating movement, potentially through GABAergic dendron bouquets that limit dopamine release back to the striatum (Dong, Wang et al. 2025). Additionally, intrastriatal collaterals of patch spiny projection neurons (SPNs) have been shown to suppress dopamine release and associated synaptic plasticity via dynorphin-mediated activation of kappa opioid receptors on dopamine terminals (Hawes, Salinas et al. 2017). This intrastriatal mechanism may further contribute to the reduction in striatal dopamine levels and the observed decrease in locomotor speed, representing a compelling avenue for future investigation.”

      (2) On page 14, Line 301, the authors stated that "Cre-dependent mCheery signals were colocalized with the patch marker (MOR1) in the dorsal striatum (Fig. 1B)". But I could not find any mCherry on that panel, so please modify it.

      We have included representative images of mCherry and MOR1 staining in Supplementary Fig. S1 of the revised manuscript.

      (3) From data shown in Figure 1, I've got the impression that mice ablated with striatal patch neurons were generally hyperactive, but this is probably not the case, as two separate experiments using LLbox and DDbox showed no difference in locomotor vigor between control and ablated mice. For the sake of better interpretation, it may be good to add a statement in Lines 365-366 that these experiments suggest the absence of hyperactive locomotion in general by ablating these specific neurons.

      As suggested by the reviewer, we have added the following statement on Page 17 of the revised manuscript: “These data also indicate that PA elevates valence-specific speed without inducing general hyperactivity”.

      (4) In Line 536, where Figure 5A was cited, the author mentioned that they used inhibitory DREADDs (AAV-DIO-hM4Di-mCherrry), but I could not find associated data on Figure 5. Please cite Figure S3, accordingly.

      We have added the citation for the now Fig. S4 on Page 25 of the revised manuscript.

      (5) Personally, the Figure panel labels of "Hi" and "ii" were confusing at first glance. It would be better to have alternatives.

      As suggested by the reviewer, we have now labeled each figure panel with a distinct single alphabetical letter.

      (6) There is a typo on Figure 4A: tdTomata → tdTomato

      We have made the correction on the figure.

      Reviewer #3 (Public review):

      Hawes et al. combined behavioral, optical imaging, and activity manipulation techniques to investigate the role of striatal patch SPNs in locomotion regulation. Using Sepw1-Cre transgenic mice, they found that patch SPNs encode locomotion deceleration in a light-dark box procedure through optical imaging techniques. Moreover, genetic ablation of patch SPNs increased locomotion speed, while chemogenetic activation of these neurons decreased it. The authors concluded that a subtype of patch striatonigral neurons modulates locomotion speed based on external environmental cues. Below are some major concerns:

      The study concludes that patch striatonigral neurons regulate locomotion speed. However, unless I missed something, very little evidence is presented to support the idea that it is specifically striatonigral neurons, rather than striatopallidal neurons, that mediate these effects. In fact, the optogenetic experiments shown in Fig. 6 suggest otherwise. What about the behavioral effects of optogenetic stimulation of striatonigral versus striatopallidal neuron somas in Sepw1-Cre mice?

      Our photometry data implicate striatonigral neurons in locomotor slowing, as evidenced by a negative cross-correlation with acceleration and a negative lag, indicating that their activity reliably precedes—and may therefore contribute to—deceleration. In contrast, photometry results from striatopallidal neurons showed no clear correlation with speed or acceleration.

      Figure 6 demonstrates that optogenetic manipulation within the SNr of Sepw1-Cre<sup>+</sup> striatonigral axons recapitulated context-dependent locomotor changes seen with Gq-DREADD activation of both striatonigral and striatopallidal Sepw1-Cre<sup>+</sup> cells in the dorsal striatum but failed to produce the broader locomotor speed change observed when targeting all Sepw1-Cre<sup>+</sup> cells in the dorsal striatum using either ablation or Gq-DREADD activation. The more subtle speed-restrictive phenotype resulting from ChR activation in the SNr could, as the reviewer suggests, implicate striatopallidal neurons in broad locomotor speed regulation. However, our photometry data indicate that this scenario is unlikely, as activity of striatopallidal Sepw1-Cre<sup>+</sup> fibers is not correlated with locomotor speed. Another plausible explanation is that the optogenetic approach may have affected fewer striatonigral fibers, potentially due to the limited spatial spread of light from the optical fiber within the SNr. Broad locomotor speed change in LDbox might require the recruitment of a larger number of striatonigral fibers than we were able to manipulate with optogenetics. We have added discussion of these technical limitations to the revised manuscript. Additionally, we now discuss the possibility that intrastriatal collaterals may contribute to reduced local dopamine levels by releasing dynorphin, which acts on kappa opioid receptors located on dopamine fibers (Hawes, Salinas et al. 2017), thereby suppressing dopamine release.

      The reviewer also suggests an interesting experiment involving optogenetic stimulation of striatonigral versus striatopallidal somata in Sepw1-Cre mice. While we agree that this approach would yield valuable insights, we have thus far been unable to achieve reliable results using retroviral vectors. Moreover, selectively targeting striatopallidal terminals optogenetically remains technically challenging, as striatonigral fibers also traverse the pallidum, and the broad anatomical distribution of the pallidum complicates precise targeting. This proposed work will need to be pursued in a future study, either with improved retrograde viral tools or the development of additional mouse lines that offer more selective access to these neuronal populations as we documented recently (Dong, Wang et al. 2025).

      In the abstract, the authors state that patch SPNs control speed without affecting valence. This claim seems to lack sufficient data to support it. Additionally, speed, velocity, and acceleration are very distinct qualities. It is necessary to clarify precisely what patch neurons encode and control in the current study.

      We believe the reviewer’s interpretation pertains to a statement in the Introduction rather than the Abstract: “Our findings reveal that patchy SPNs control the speed at which mice navigate the valence differential between high- and low-anxiety zones, without affecting valence perception itself.” Throughout our study, mice consistently preferred the dark zone in the Light/Dark box, indicating intact perception of the valence differential between illuminated areas. While our manipulations altered locomotor speed, they did not affect time spent in the dark zone, supporting the conclusion that valence perception remained unaltered. We appreciate the reviewer’s insight and agree it is an intriguing possibility that locomotor responses could, over time, influence internal states such as anxiety. We addressed this in the Discussion, noting that while dark preference was robust to our manipulations, future studies are warranted to explore the relationship between anxious locomotor vigor and anxiety itself.

      We report changes in scalar measures of animal speed across Light/Dark box conditions and under various experimental manipulations. Separately, we show that activity in both patchy neuron somata and striatonigral fibers is negatively correlated with acceleration—indicating a positive correlation with deceleration. Notably, the direction of the cross-correlational lag between striatonigral fiber activity and acceleration suggests that this activity precedes and may causally contribute to mouse deceleration, thereby influencing reductions in speed. To clarify this, we revised a sentence in the Results section: “Moreover, patchy neuron efferent activity at the SNr may causally contribute to deceleration, as indicated by the negative cross-correlational lag, thereby reducing animal speed.”. We also updated the Discussion to read: “Together, these data specifically implicate patchy striatonigral neurons in slowing locomotion by acting within the SNr to drive deceleration.”

      One of the major results relies on chemogenetic manipulation (Figure 5). It would be helpful to demonstrate through slice electrophysiology that hM3Dq and hM4Di indeed cause changes in the activity of dorsal striatal SPNs, as intended by the DREADD system. This would support both the positive (Gq) and negative (Gi) findings, where no effects on behavior were observed.

      We were unable to perform this experiment; however, hM3Dq has previously been shown to be effective in striatal neurons (Alcacer, Andreoli et al. 2017). The lack of effect observed in Gi-DREADD mice serves as an unintended but valuable control, helping to rule out off-target effects of the DREADD agonist JHU37160 and thereby reinforcing the specificity of hM3Dq-mediated activation in our study. We have now included an important caveat regarding the Gi-DREADD results, acknowledging the possibility that they may not have worked effectively in our target cells: “Potential explanations for the negative results in Gi-DREADD mice include inherently low basal activity among patchy neurons or insufficient expression of GIRK channels in striatal neurons, which may limit the effectiveness of Gi-coupling in suppressing neuronal activity (Shan, Fang et al. 2022).

      Finally, could the behavioral effects observed in the current study, resulting from various manipulations of patch SPNs, be due to alterations in nigrostriatal dopamine release within the dorsal striatum?

      We agree that this is an important potential implication of our work, especially given that we and others have shown that patchy striatonigral neurons provide strong inhibitory input to dopaminergic neurons involved in locomotor control (Nadel, Pawelko et al. 2021, Lazaridis, Crittenden et al. 2024, Dong, Wang et al. 2025, Okunomiya, Watanabe et al. 2025). Accordingly, we have expanded the discussion section to include potential mechanistic explanations that support and contextualize our main findings.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Here are some minor issues for the authors' reference:

      (1) This work supports the motor-suppressing effect of patchy SPNs, and >80% of them are direct pathway SPNs. This conclusion is not expected from the traditional basal ganglia direct/indirect pathway model. Most experiments were performed using nonphysiological approaches to suppress (i.e., ablation) or activate (i.e., continuous chemo-optogenetic stimulation). It remains uncertain if the reported observations are relevant to the normal biological function of patchy SPNs under physiological conditions. Particularly, under what circumstances an imbalanced patch/matrix activity may be induced, as proposed in the sections related to the data presented in Figure 6. A thorough discussion and clarification remain needed. Or it should be discussed as a limitation of the present work.

      We have added discussion and clarification of physiological limitations in response to reviewer feedback. Additionally, we revised the opening sentence of an original paragraph in the discussion section to emphasize that it interprets our findings in the context of more physiological studies reporting natural shifts in patchy SPN activity due to cognitive conflict, stress, or training. The revised opening sentence now reads: “Together with previous studies of naturally occurring shifts in patchy neuron activation, these data illustrate ethologically relevant roles for a subgroup of genetically defined patchy neurons in behavior.”

      (2) Lines 499-500: How striato-nigral cells encode speed and deceleration deserves a thorough discussion and clarification. These striatonigral cells can target both SNr GABAergic neurons and dendrites of the dopaminergic neurons. A discussion of microcircuits formed by the patchy SPNs axons in the SNr GABAergic and SNC DAergic neurons should be presented.

      We have added this point at lines 499–500, including a reference to a relevant review of microcircuitry. Additionally, we expanded the discussion section to address microcircuit mechanisms that may underlie our main findings.

      (3) Line 70: "BNST" should be spelled out at the first time it is mentioned.

      This has been done.

      (4) Line 133: only GCaMP6 was listed in the method, but GCaMP8 was also used (Figure 4). Clarification or details are needed.

      Thank you for your careful attention to detail. We have corrected the typographical errors in the Methods section. Specifically, in the Stereotaxic Injections section, we corrected “GCaMP83” to “GCaMP8s.” In the Fiber Implant section, we removed the incorrect reference to “GCaMP6s” and clarified that GCaMP8s was used for photometry, and hChR2 was used for optogenetics.

      (5) Line 183: Can the authors describe more precisely what "a moment" means in terms of seconds or minutes?

      This has been done.

      (6) Line 288: typo: missing / in ΔF.

      Thank you this has been fixed.

      (7) Line 301-302: the statement of "mCherry and MOR1 colocalization" does not match the images in Figure 1B.

      This has been corrected by proving a new Supplementary Figure S1.

      (8) Related to the statement between Lines 303-304: Figure 1c data may reflect changes in MOR1 protein or cell loss. Quantification of NeuN+ neurons within the MOR1 area would strengthen the conclusion of 60% of patchy cell loss in Figure 1C.

      Since the efficacy of AAV-FLEX-taCasp3 in cell ablation has been well established in our previous publications and those of others (Yang, Chiang et al. 2013, Wu, Kung et al. 2019), we do not believe the observed loss of MOR1 staining in Fig. 1C merely reflects reduced MOR1 expression. Moreover, a general neuronal marker such as NeuN may not reliably detect the specific loss of patchy neurons in our ablation model, given the technical limitations of conventional cell-counting methods like MBF’s StereoInvestigator, which typically exhibit a variability margin of 15–20%.

      (9) Lines 313-314: "Similarly, PA mice demonstrated greater stay-time in the dark zone (Figure 1E)." Revision is needed to better reflect what is shown in Figure 1E and avoid misunderstandings.

      Thank you this has been addressed.

      (10) The color code in Figure 2Gi seems inconsistent with the others? Clarifications are needed.

      Color coding in Figure 2Gi differs from that in 2Eii out of necessity. For example, the "Light" cells depicted in light blue in 2Eii are represented by both light gray and light red dots in 2Gi. Importantly, Figure 2G does not encode specific speed relationships; instead, any association with speed is indicated by a red hue.

      (11) Lines 538-539: the statement of "Over half of the patch was covered" was not supported by Figure 5C. Clarification is needed.

      Thank you. For clarity, we updated the x-axis labels in Figures 1C and 5C from “% area covered” to “% DS area covered,” and defined “DS” as “dorsal striatal” in the corresponding figure legends. Additionally, we revised the sentence in question to read: “As with ablation, histological examination indicated that a substantial fraction of dorsal patch territories, identified through MOR1 staining, were impacted (Fig. 5C).”

      (12) Figure 3: statistical significance in Figure 3 should be labeled in various panels.

      We believe the reviewer's concern pertains to the scatter plot in panel F—specifically, whether the data points are significantly different from zero. In panel 3F, the 95% confidence interval clearly overlaps with zero, indicating that the results are not statistically significant.

      (13) Figures 6D-E: no difference in the speed of control mice and ChR2 mice under continuous optical stimulation was not expected. It was different from Gq-DRADDS study in Figure 5E-F. Clarifications are needed.

      For mice undergoing constant ChR2 activation of Sepw1-Cre<sup>+</sup> SNr efferents, overall locomotor speed does not differ from controls. However, the BIL (bright-to-illuminated) effect on zone transitions is disrupted: activating Sepw1-Cre<sup>+</sup> fibers in the SNr blunts the typical increase in speed observed when mice flee from the light zone toward the dark zone. This impaired BIL-related speed increase upon exiting the light was similarly observed in the Gq-DREADD cohort. The reviewer is correct that this optogenetic manipulation within the SNr did not produce the more generalized speed reductions seen with broader Gq-DREADD activation of all Sepw1-Cre<sup>+</sup> cells in the dorsal striatum. A likely explanation is the difference in targeting—ChR2 specifically activates SNr-bound terminals, whereas Gq-DREADD broadly activates entire Sepw1-Cre<sup>+</sup> cells. Notably, many of the generalized speed profile changes observed with chemogenetic activation are opposite to those resulting from broad ablation of Sepw1-Cre<sup>+</sup> cells.

      The more subtle speed-restrictive phenotype observed with ChR2 activation targeted to the SNr may suggest that fewer striatonigral fibers were affected by this technique, possibly due to the limited spread of light from the fiber optic. Broad locomotor speed change in LDbox might require the recruitment of a larger number of striatonigral fibers than we were able to manipulate with an optogenetic approach. Alternatively, it could indicate that non-striatonigral Sepw1-Cre+ projections—such as striatopallidal or intrastriatal pathways—play a role in more generalized slowing. If striatopallidal fibers contributed to locomotor slowing, we would expect to see non-zero cross-correlations between neural activity and speed or acceleration, along with negative lag indicating that neural activity precedes the behavioral change. However, our fiber photometry data do not support such a role for Sepw1-Cre+ striatopallidal fibers.

      We have also referenced the possibility that intrastriatal collaterals could suppress striatal dopamine levels, potentially explaining the stronger slowing phenotype observed when the entire striatal population is affected, as opposed to selectively targeting striatonigral terminals.

      These technical considerations and interpretive nuances have been incorporated and clarified in the revised discussion section.

      (14) Lines 632: "compliment": a typo?

      Yes, it should be “complement”.

      (15) Figure 4 legend: descriptions of panels A and B were swapped.

      Thank you. This has been corrected.

      6) Friedman (2020) was listed twice in the bibliography (Lines 920-929).

      Thank you. This has been corrected.

      Reviewer #3 (Recommendations for the authors):

      It will be helpful to label and add figure legends below each figure.

      Thank you for the suggestion.

      Editor's note:

      Should you choose to revise your manuscript, if you have not already done so, please include full statistical reporting including exact p-values wherever possible alongside the summary statistics (test statistic and df) and, where appropriate, 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05 in the main manuscript. We noted some instances where only p values are reported.

      Readers would also benefit from coding individual data points by sex and noting N/sex.

      We have included detailed statistical information in the revised manuscript. Both male and female mice were used in all experiments in approximately equal numbers. Since no sex-related differences were observed, we did not report the number of animals by sex.

      References

      Alcacer, C., L. Andreoli, I. Sebastianutto, J. Jakobsson, T. Fieblinger and M. A. Cenci (2017). "Chemogenetic stimulation of striatal projection neurons modulates responses to Parkinson's disease therapy." J Clin Invest 127(2): 720-734.

      Crittenden, J. R., P. W. Tillberg, M. H. Riad, Y. Shima, C. R. Gerfen, J. Curry, D. E. Housman, S. B. Nelson, E. S. Boyden and A. M. Graybiel (2016). "Striosome-dendron bouquets highlight a unique striatonigral circuit targeting dopamine-containing neurons." Proc Natl Acad Sci U S A 113(40): 11318-11323.

      Dong, J., L. Wang, B. T. Sullivan, L. Sun, V. M. Martinez Smith, L. Chang, J. Ding, W. Le, C. R. Gerfen and H. Cai (2025). "Molecularly distinct striatonigral neuron subtypes differentially regulate locomotion." Nat Commun 16(1): 2710.

      Dudman, J. T. and J. W. Krakauer (2016). "The basal ganglia: from motor commands to the control of vigor." Curr Opin Neurobiol 37: 158-166.

      Evans, R. C., E. L. Twedell, M. Zhu, J. Ascencio, R. Zhang and Z. M. Khaliq (2020). "Functional Dissection of Basal Ganglia Inhibitory Inputs onto Substantia Nigra Dopaminergic Neurons." Cell Rep 32(11): 108156.

      Gerfen, C. R. and D. J. Surmeier (2011). "Modulation of striatal projection systems by dopamine." Annual review of neuroscience 34: 441-466.

      Hawes, S. L., A. G. Salinas, D. M. Lovinger and K. T. Blackwell (2017). "Long-term plasticity of corticostriatal synapses is modulated by pathway-specific co-release of opioids through kappa-opioid receptors." J Physiol 595(16): 5637-5652.

      Lazaridis, I., J. R. Crittenden, G. Ahn, K. Hirokane, T. Yoshida, A. Mahar, V. Skara, K. Meletis, K. Parvataneni, J. T. Ting, E. Hueske, A. Matsushima and A. M. Graybiel (2024). "Striosomes Target Nigral Dopamine-Containing Neurons via Direct-D1 and Indirect-D2 Pathways Paralleling Classic Direct-Indirect Basal Ganglia Systems." bioRxiv.

      Nadel, J. A., S. S. Pawelko, J. R. Scott, R. McLaughlin, M. Fox, M. Ghanem, R. van der Merwe, N. G. Hollon, E. S. Ramsson and C. D. Howard (2021). "Optogenetic stimulation of striatal patches modifies habit formation and inhibits dopamine release." Sci Rep 11(1): 19847.

      Okunomiya, T., D. Watanabe, H. Banno, T. Kondo, K. Imamura, R. Takahashi and H. Inoue (2025). "Striosome Circuitry Stimulation Inhibits Striatal Dopamine Release and Locomotion." J Neurosci 45(4).

      Shan, Q., Q. Fang and Y. Tian (2022). "Evidence that GIRK Channels Mediate the DREADD-hM4Di Receptor Activation-Induced Reduction in Membrane Excitability of Striatal Medium Spiny Neurons." ACS Chem Neurosci 13(14): 2084-2091.

      Wu, J., J. Kung, J. Dong, L. Chang, C. Xie, A. Habib, S. Hawes, N. Yang, V. Chen, Z. Liu, R. Evans, B. Liang, L. Sun, J. Ding, J. Yu, S. Saez-Atienzar, B. Tang, Z. Khaliq, D. T. Lin, W. Le and H. Cai (2019). "Distinct Connectivity and Functionality of Aldehyde Dehydrogenase 1a1-Positive Nigrostriatal Dopaminergic Neurons in Motor Learning." Cell Rep 28(5): 1167-1181 e1167.

      Yang, C. F., M. C. Chiang, D. C. Gray, M. Prabhakaran, M. Alvarado, S. A. Juntti, E. K. Unger, J. A. Wells and N. M. Shah (2013). "Sexually dimorphic neurons in the ventromedial hypothalamus govern mating in both sexes and aggression in males." Cell 153(4): 896-909.

    1. eLife Assessment

      This important study is the first characterization of the phenotype caused by a lack of Eml3 expression in mice. Mutant animals present a disrupted pial basement membrane, leading to focal extrusions from the cerebral cortex, called ectopias. The methodology is convincing and the conclusions are solid, although further investigations on the mechanisms and inclusion of the experiments performed, but not presented, will improve the manuscript. This work would be of interest to neural development biologists and human geneticists working on brain disorders.

    2. Reviewer #1 (Public review):

      Summary:

      The paper describes the initial characterization of Eml3 knockout mice. Eml3 global inactivation leads to delayed embryonic development, perinatal lethality apparently due to failure to inflate lungs, and a cobblestone brain-like phenotype represented by focal neuronal ectopias in the marginal zone or subarachnoid space of dorsal telencephalon. The neural ectopias are associated with interruptions in the pial basal membrane (PBM), which appear around E11.5. The authors also confirmed previously described protein interactions, using coIP-MS experiments of placenta and embryonic tissues (TUBB3, several 14-3-3 proteins, and DYNLL). The authors generated mice carrying a TQT86AAA homozygous mutation in EML3 (a motif required for EML3-DYNLL interactions) that were normal and showed no focal neuronal ectopias, indicating that this particular protein interaction is dispensable. The authors propose Eml3 knockout mice as a model of cobblestone brain malformation.

      Strengths:

      The brain phenotype described in this work is relevant for the neural development field and with potential clinical relevance. The initial phenotyping is appropriate but will require additional experiments to establish the cause of the failure to inflate the lungs. The study shows convincing data regarding the main characteristics of the brain phenotype and data supporting the timing when these abnormalities arise during development.

      Weaknesses:

      The study would benefit from clearer evidence and additional experiments that would help to establish the molecular and cellular mechanisms underlying the brain phenotype, the central topic of the work.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors investigate the role of the microtubule-binding protein EML3 during cortical development through the generation and characterization of an Eml3 mouse mutant. The authors focus mainly on the effects of EML3 loss on brain development, although Eml3 mouse mutants also present with developmental delay and growth restriction, and die perinatally due to respiratory distress caused by delayed maturation of the lungs. The main finding in the developing cortex is the presence of focal neuronal ectopias, which contain neurons from all cortical layers, as revealed by immunostaining. The authors use electron microscopy to show that ectopias seem to be caused by disruption to the pial basement membrane at early stages of development, which allows neurons to breach through it. To find a functional link between EML3 and the observed phenotype, studies are conducted that demonstrate expression of EML3 in radial glia cells and mesenchymal cells, both cell types involved in the formation and maintenance of the pial basement membrane. Furthermore, interaction partners for EML3 are identified through coIP-MS analysis, including tubulin beta-3, 14-3-3 proteins, and cytoplasmic dynein light chain. However, mice carrying a mutant EML3 allele engineered to abolish the interaction between EML3 and cytoplasmic dynein light chain do not recapitulate any of the symptoms of complete EML3 loss.

      Strengths:

      The manuscript offers several important strengths that contribute significantly to the field. This study presents the first characterization of Eml3 knockout animals, providing novel insights into the role of Eml3 in vivo. Information on Eml3 function so far was restricted to cell culture data, so the results in this manuscript start to fill an important gap in our knowledge about this microtubule-binding protein. The experimental approach is carefully designed, with appropriate controls that ensure the reliability of the data. Moreover, the authors have addressed a key challenge in the analysis, namely the developmental delay of the knockout animals. By implementing a strategy to match developmental stages between wild-type and knockout groups, they allow for meaningful and valid comparisons between the two genotypes. Importantly, the authors have successfully generated three different Eml3 mutant mouse lines (knockout, floxed, and with disrupted binding to cytoplasmic dynein light chain), which are very valuable tools for the broader scientific community to further study the roles of this gene in development and disease in the future.

      Weaknesses:

      While the manuscript presents valuable data, there are also several weaknesses that limit the overall impact of the study. Most notably, there is no clear mechanistic link established between the loss of Eml3 function and the observed phenotype, leaving the biological significance of the findings somewhat speculative, as it is not straightforward how a microtubule-associated protein can have an impact on the stability of the pial basement membrane. In this respect, but also in general for the whole manuscript, there seems to be a considerable amount of experimental work that has been conducted but is not presented, possibly due to the negative nature of the results. At least some of those results could be shown, particularly (but not only) the stainings for the composition of the ECM components. Additionally, the phenotype reported appears to be dependent on the genetic background, as it is absent in the CD1 strain. This observation raises concerns as to how robust the results are and how much they can be generalized to other mouse strains, but, more importantly, to humans. There is no data included in the manuscript about the generation and analysis of the Eml3AAA/AAA mouse line. This is an important omission, especially as no details on the validation or phenotypic characterization of this additional mouse line are provided. Including these elements would greatly strengthen the rigor and interpretability of the work, especially if that mouse line is to be shared with the scientific community.

    4. Reviewer #3 (Public review):

      Summary:

      This work aims to understand the role of Echinoderm Microtubule-associated Protein-like 3 (EML3) in embryogenesis and neocortical development. Importantly, this work shows that depletion of EML3 causes focal neuronal ectopias by disrupting the structural integrity of the pial basement membrane, describing a new model of cobblestone brain malformation. Another member of the EML family, EML1, has already been shown to trigger neuronal migration disorders, particularly subcortical band heterotopia, by affecting cell polarity. The results presented here point to a different mechanism of action. The authors show that EML3 is expressed in radial glia cells and mesenchymal cells in the pial region, and upon EML3 depletion (i.e., Eml3 mutant mice), the pial basement membrane is structurally damaged, allowing migrating neuroblasts to ectopically migrate through. Answering, in this case, that the weakening of the pial basement membrane is a prerequisite for focal neuronal ectopias. The authors provide a meticulous characterization of the Eml3 mutant mice, strengthening the conclusions of the results.

      Strengths:

      The authors provide a very detailed analysis of the defects observed in Eml3 mutant mice, by providing not only results by inferred day of conception but also by classifying embryos by their number of somite pairs.

      Weaknesses:

      (1) Besides the data provided in the figures, the authors report a significant amount of experiments/results as "Data not shown". Negative data is still important data to report, and the authors may want to choose some crucial "not shown data" to report in the manuscript.

      (2) Results in Figure 3A apparently contradict results in 3B. A better explanation of the results should improve understanding of the data. Even though the conclusion that the "onset and progression of neurogenesis is normal in Eml3 null mice" seems logical based on the data, the final numbers are not (Figure 3A) and this should be acknowledged, as well.

      (3) The authors should define which cell types are identified by SOX1 and PAX6.

    5. Author response:

      Reviewer #1 (Public Review):

      The study would benefit from clearer evidence and additional experiments that would help to establish the molecular and cellular mechanisms underlying the brain phenotype, the central topic of the work.

      We agree that additional experiments are necessary to elucidate the mechanism(s) by which EML3 deficiency causes the observed developmental phenotypes. However, as no further experimentation is possible due to the closure of our laboratory, we are committed to sharing available materials—including custom antibodies and cryopreserved sperm from our mouse lines. We will include previously generated experimental data not presented in the original submission. While these additional data do not reveal the mechanisms, we believe that sharing hypotheses that were experimentally ruled out will benefit the scientific community.

      Reviewer #2 (Public Review):

      While the manuscript presents valuable data, there are also several weaknesses that limit the overall impact of the study. Most notably, there is no clear mechanistic link established between the loss of Eml3 function and the observed phenotype, leaving the biological significance of the findings somewhat speculative, as it is not straightforward how a microtubule-associated protein can have an impact on the stability of the pial basement membrane. In this respect, but also in general for the whole manuscript, there seems to be a considerable amount of experimental work that has been conducted but is not presented, possibly due to the negative nature of the results. At least some of those results could be shown, particularly (but not only) the stainings for the composition of the ECM components.

      We agree that additional experiments are necessary to elucidate the mechanisms at play. While we cannot conduct further experiments, we will include additional existing data, including supplemental ECM component staining, in a new figure or panel. As this reviewer rightly anticipated, these results might not clarify the mechanism but sharing the hypotheses that were already experimentally tested will be helpful.

      Additionally, the phenotype reported appears to be dependent on the genetic background, as it is absent in the CD1 strain. This observation raises concerns as to how robust the results are and how much they can be generalized to other mouse strains, but, more importantly, to humans.

      Indeed, we have determined that genetic background greatly influences the manifestation of developmental defects caused by absence or mutation of the EML3 protein in mice. Modifier genes appear to play a significant role in phenotypic expression. In humans, the presence or absence of such modifiers may result in a broad spectrum of outcomes—from no clinical relevance, as seen in CD1 mice, to potential intrauterine mortality. We agree that this underscores the challenge of translating mouse model findings to human implications. Future studies could include a search for EML3 non-coding regulatory mutations and expanded analysis of neuronal development defects, such as COB, as well as cases of intrauterine growth restriction (IUGR).

      There is no data included in the manuscript about the generation and analysis of the Eml3AAA/AAA mouse line. This is an important omission, especially as no details on the validation or phenotypic characterization of this additional mouse line are provided. Including these elements would greatly strengthen the rigor and interpretability of the work, especially if that mouse line is to be shared with the scientific community.

      We acknowledge this oversight and will add a Materials and Methods section describing the generation of Eml3 TQT86AAA mice as well as validation and phenotypic characterizations that were done for that mouse line.

      Reviewer #3 (Public Review):

      Besides the data provided in the figures, the authors report a significant amount of experiments/results as "Data not shown". Negative data is still important data to report, and the authors may want to choose some crucial "not shown data" to report in the manuscript.

      We will incorporate key datasets previously omitted, with priority given to those requested by Reviewer #2.

      Results in Figure 3A apparently contradict results in 3B. A better explanation of the results should improve understanding of the data. Even though the conclusion that the "onset and progression of neurogenesis is normal in Eml3 null mice" seems logical based on the data, the final numbers are not (Figure 3A) and this should be acknowledged, as well.

      We will provide further explanations for the data presented in figures 3A and 3B to better convey the fact that the two datasets are not contradicting. In essence, since Eml3 null mice are developmentally delayed (as determined by the number of somites at a specific age, Fig. 1C), the milestones in neurogenesis are reached at a later age in Eml3 null mice (Fig. 3A). However, Eml3 null mice have reached the same neurogenesis milestones as their WT counterparts when they have the same number of somites (Fig. 3B).

      The authors should define which cell types are identified by SOX1 and PAX6.

      We will expand our manuscript to define the expression timing and cell identity marked by SOX1 and PAX6 in neural progenitors during cortical development.

    1. eLife Assessment

      This paper presents the important finding that BNIP3/NIX, a mitophagy receptor, and its binding to ATG18 are required for mitophagy during muscle cell reorganization in Drosophila. Although the involvement of the BNIP3-ATG18/WIPI axis in mitophagy induction has been reported in mammalian cell culture systems, this study provides the first compelling evidence for this pathway in vivo in animals. The physiological significance of this BNIP3-dependent mitophagy will require further investigation.

    2. Reviewer #1 (Public review):

      During early Drosophila pupal development, a subset of larval abdominal muscles (DIOMs) is remodelled using an autophagy dependent mechanism.

      To better understand this not very well studied process, the authors have generated a systematic transcriptomics time course using dissected larval abdominal muscles of various stages from wild type and autophagy deficient mutants. The authors have further identified a function for BNIP3 for executing mitophagy during DIOM remodelling.

      Strengths:

      The paper does provide a detailed mRNA time course resource for the DIOM remodelling.

      The paper does find an interesting BNIP3 loss of function phenotype, a block of mitophagy during muscle remodelling and hence identifies a specific linker between mitochondria and the core autophagy

      machinery. This adds to the mechanism how mitochondria are degraded.

      Sophisticated fly genetics demonstrates that the larval muscle mitochondria are, to a large extend, degraded by autophagy during DIOM remodelling.

      Quantitative electron microscopy data show that BNIP3 is required for initiating mito-phagosomes. It needs either its LIR and MER domain for function.

      Weakness:

      Mitophagy during DIOM remodelling is not novel (earlier papers from Fujita et al.).

      Other weaknesses have been eliminated during the revision.

    3. Reviewer #2 (Public review):

      Summary:

      Autophagy (macroautophagy) is known to be essential for muscle function in flies and mammals. To date, many mitophagy (selective mitochondrial autophagy) receptors have been identified in mammals and other species. While loss of mitophagy receptors has been shown to impair mitochondrial degradation (e.g., OPTN and NDP52 in Parkin-mediated mitophagy and NIX and BNIP3 in hypoxia-induced mitophagy) at the level of cultured cells, it remains unclear, especially under physiological conditions in vivo. In this study, the authors revealed that one of the receptors BNIP3 plays a critical role in mitochondrial degradation during muscle remodeling in vivo.

      Overall, the manuscript provides solid evidence that BNIP3 is involved in mitophagy during muscle remodeling with in vivo analyses performed. In particular, all experiments in this study are well designed. The text is well written and the figures are very clear.

      Strengths:

      (1) In each experiment, appropriate positive and negative controls are used to indicate what is responsible for the phenomenon observed by the authors: e.g. FIP200, Atg18, Stx17 siRNAs during DIOM remodeling in Fig2 and Full, del-LIR, del-MER in Fig5.

      (2) Although the transcriptional dynamics of DIOM remodeling during metamorphosis is autophagy-independent, the transcriptome data obtained by the authors would be valuable for future studies.

      (3) In addition to the simple observation that loss of BNIP3 causes mitochondrial accumulation, the authors further observed that, by combining siRNA against STX17, which is required for fusion of autophagosomes with lysosomes, BNIP3 KO abolishes mitophagosome formation, which will provide solid evidence for BNIP3-mediated mitophagy. Furthermore, using a Gal80 temperature-sensitive approach, the authors showed that mitochondria derived from larval muscle, but not those synthesized during hypertrophy, remain in BNIP3 KO fly muscles.

      Weaknesses:

      (1) Because BNIP3 KO causes mitochondrial accumulation, it is expected that adult flies will have some physiological defects, but this has not been fully analyzed or sufficiently mentioned in the manuscript.

      (2) In Fig 5, the authors showed that BNIP3 binds to Atg18a by co-IP, but no data are provided on whether MER-mut or del-MER attenuates the affinity for Atg18a.

      Comments on revisions: The authors answered all the reviewer's concerns.

    4. Reviewer #3 (Public review):

      Summary:

      Fujita et al build on their earlier, 2017 eLife paper that showed the role of autophagy in the developmental remodeling of a group of muscles (DIOM) in the abdomen of Drosophila. Most larval muscles undergo histolysis during metamorphosis, while DIOMs are programmed to regrow after initial atrophy to give rise to temporary adult muscles, which survive for only 1 day after eclosion of the adult flies (J Neurosci. 1990;10:403-1. and BMC Dev Biol 16, 12, 2016). The authors carry out transcriptomics profiling of these muscles during metamorphosis, which are in agreement with the atrophy and regrowth phases of these muscles. Expression of the known mitophagy receptor BNIP3/NIX is high during atrophy, so the authors start to delve more into the role of this protein/mitophagy in their model. BNIP3 KO indeed impairs mitophagy and muscle atrophy, which they convincingly demonstrate via nice microscopy images. They also show that the already known Atg8a-binding LIR and Atg18a-binding MER motifs of human NIX are conserved in the Drosophila protein, although the LIR turned out to be less critical for in vivo protein function than the MER motif.

      Strengths:

      Established methodology, convincing data, in vivo model

      Weaknesses:

      Significance for Drosophila physiology and for human muscles remains to be established

    5. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary: 

      During early Drosophila pupal development, a subset of larval abdominal muscles (DIOMs) is remodelled using an autophagy-dependent mechanism. 

      To better understand this not very well studied process, the authors have generated a transcriptomics time course using dissected abdominal muscles of various stages from wild-type and autophagy-deficient mutants. The authors have further identified a function for BNIP3 in muscle mitophagy using this system. 

      Strengths: 

      (1) The paper does provide a detailed mRNA time course resource for DIOM remodeling. 

      (2) The paper does find an interesting BNIP3 loss of function phenotype, a block of mitophagy during muscle remodeling, and hence identifies a specific linker between mitochondria and the core autophagy machinery. This adds to the mechanism of how mitochondria are degraded. 

      (3) Sophisticated fly genetics demonstrates that the larval muscle mitochondria are, to a large extent, degraded by autophagy during DIOM remodeling. 

      Weaknesses: 

      (1) Mitophagy during DIOM remodeling is not novel (earlier papers from Fujita et al.). 

      (2) The transcriptomics time course data are not well connected to the autophagy part. Both could be separated into 2 independent manuscripts. 

      (3) The muscle phenotypes need better quantifications, both for the EM and light microscopy data in various figures. 

      (4) The transcriptomics data are hard to browse in the provided PDF format. 

      Thank you for reviewing our manuscript and for your feedback. While we understand and appreciate the suggestion to divide the manuscript into two separate studies, we believe that presenting the work as a single manuscript is more appropriate. This is because the time-course RNA-seq of DIOMs provides critical insight into BNIP3-mediated mitophagy during DIOM remodeling, which ties together the two components of our study. In response to Reviewer #1’s recommendations, we have quantified data from both EM and confocal images, and we have revised the RNA counts table in Supplementary File 1 accordingly. Please see our detailed responses and revisions on the following pages.

      Reviewer #2 (Public review): 

      Summary: 

      Autophagy (macroautophagy) is known to be essential for muscle function in flies and mammals. To date, many mitophagy (selective mitochondrial autophagy) receptors have been identified in mammals and other species. While the loss of mitophagy receptors has been shown to impair mitochondrial degradation (e.g., OPTN and NDP52 in Parkin-mediated mitophagy and NIX and BNIP3 in hypoxia-induced mitophagy) at the level of cultured cells, it remains unclear, especially under physiological conditions in vivo. In this study, the authors revealed that one of the receptors BNIP3 plays a critical role in mitochondrial degradation during muscle remodeling in vivo. 

      Overall, the manuscript provides solid evidence that BNIP3 is involved in mitophagy during muscle remodeling with in vivo analyses performed. In particular, all experiments in this study are well-designed. The text is well written and the figures are very clear. 

      Strengths: 

      (1) In each experiment, appropriate positive and negative controls are used to indicate what is responsible for the phenomenon observed by the authors: e.g. FIP200, Atg18, Stx17 siRNAs during DIOM remodeling in Figure 2 and Full, del-LIR, del-MER in Figure 5. 

      (2) Although the transcriptional dynamics of DIOM remodeling during metamorphosis is autophagy-independent, the transcriptome data obtained by the authors would be valuable for future studies. 

      (3) In addition to the simple observation that loss of BNIP3 causes mitochondrial accumulation, the authors further observed that, by combining siRNA against STX17, which is required for fusion of autophagosomes with lysosomes, BNIP3 KO abolishes mitophagosome formation, which will provide solid evidence for BNIP3-mediated mitophagy. Furthermore, using a Gal80 temperature-sensitive approach, the authors showed that mitochondria derived from larval muscle, but not those synthesized during hypertrophy, remain in BNIP3 KO fly muscles. 

      Weaknesses: 

      (1) Because BNIP3 KO causes mitochondrial accumulation, it is expected that adult flies will have some physiological defects, but this has not been fully analyzed or sufficiently mentioned in the manuscript. 

      (2) In Figure 5, the authors showed that BNIP3 binds to Atg18a by co-IP, but no data are provided on whether MER-mut or del-MER attenuates the affinity for Atg18a. 

      Thank you for pointing out the critical issues in the previous version of our manuscript. In this revision, we have conducted several physiological assays using BNIP3 KO flies, as well as co-IP experiments to confirm that the DMER weakens the interaction with Atg18a. We have also addressed all the recommendations provided. Please see our detailed point-by-point responses below.

      Reviewer #3 (Public review): 

      Summary: 

      Fujita et al build on their earlier, 2017 eLife paper that showed the role of autophagy in the developmental remodeling of a group of muscles (DIOM) in the abdomen of Drosophila. Most larval muscles undergo histolysis during metamorphosis, while DIOMs are programmed to regrow after initial atrophy to give rise to temporary adult muscles, which survive for only 1 day after eclosion of the adult flies (J Neurosci. 1990;10:403-1. and BMC Dev Biol 16, 12, 2016). The authors carry out transcriptomics profiling of these muscles during metamorphosis, which is in agreement with the atrophy and regrowth phases of these muscles. Expression of the known mitophagy receptor BNIP3/NIX is high during atrophy, so the authors have started to delve more into the role of this protein/mitophagy in their model. BNIP3 KO indeed impairs mitophagy and muscle atrophy, which they convincingly demonstrate via nice microscopy images. They also show that the already known Atg8a-binding LIR and Atg18a-binding MER motifs of human NIX are conserved in the Drosophila protein, although the LIR turned out to be less critical for in vivo protein function than the MER motif. 

      Strengths: 

      Established methodology, convincing data, in vivo model. 

      Weaknesses: 

      The significance for Drosophila physiology and for human muscles remains to be established. 

      Thank you for reviewing our manuscript. In response to the comment, we have performed lifespan, adult locomotion, and eclosion assays in BNIP3 KO flies. Although we observed substantial mitochondrial accumulation in the DIOMs of BNIP3 KO flies, no significant differences were detected in these physiological assays under our experimental conditions. We plan to further investigate the physiological role of BNIP3 in flies and extend our studies to human muscle in future work. Please see our detailed responses below.

      Reviewer #1 (Recommendations for the authors): 

      Major points: 

      (1) Unfortunately, the RNA counts file table in Supplementary file 1 is a PDF and not an Excel sheet. The labelling makes it unclear from which time points and genotype the listed values on the 650-page files are. 

      We have now corrected the labelling of time points and genotypes in Supplementary File 1 to improve clarity and have provided the updated Excel file.

      Looking at these counts it seems that sarcomere genes (Mhc, bt, sls, wupA, TpnC ) are 10x to 100x lower in sample "ctrl_1" compared to the three other control samples. Which time point is that? It is essential to have access to the full dataset, wild type and autophagy-deficient, to be able to assess the quality of the RNA SEQ data. These need to be deposited in a public database or to be provided in a useful format. 

      Thank you for pointing that out. In the previous version, “Ctrl_1” referred to the Control sample at 1 day APF, when atrophy occurs. We have corrected the labeling in Supplementary File 1 accordingly and have deposited the RNA-seq data to GEO, where it is now publicly available (GSE293359).

      (2) Which statistical test was used to assess the differences in muscle volumes in Figure 2E? I was not able to find a table with the measured data.

      In Figure 2E, we used the Mann-Whitney test for statistical analysis. The raw data used for quantification have also been provided (Supplementary File 2).

      The shown volumes do not correlate with the scheme shown in Figure 2A, in particular at the larval stage the muscle seems much larger.

      We have revised the schematic models of muscle cells in Figures 1C and 2A in accordance with the reviewer’s suggestion.

      (3) It is important to remember that adult Drosophila muscles are not homogenous, at least not the adult leg and abdominal muscles, as they are organised as tubes with myofibrils closer to the surface, and nuclei as well as mitochondria largely in the centre (see PMID 33828099). Hence, only showing a single plane in the muscle images can be very misleading. The authors should at least provide virtual XZ-cross section views in Figure 3G to ensure that similar muscle planes are compared. This applies to the interpretation of both, the mitochondria and the myofibril phenotypes in wildtype vs BNIP3-KO. 

      Thank you for your comment. As suggested, we have added XZ-cross-sectional views in Figure 3G. The XY plane corresponds to a central section of the Z-stack, as indicated in the figure.

      (4) The EM images are nice, however only 2 of the 4 conditions shown were quantified. As the section plane can be misleading, at least several planes should be analysed also for wild type and BNIP3-KO, and not only for stx17 RNAi and the double mutant. 

      In response to the comment, we quantified the TEM images of wild-type and BNIP3-KO DIOMs and added the resulting graph to Figure 4C. The corresponding raw data have also been provided (Supplementary File 2).

      (5) How was Figure 5D, 5D' quantified? What corresponds to "regular", "medium", "high"? A statistical test is missing. I would rather conclude that MIR and LIR are redundant as double mutant appears to be stronger than both singles. This is also concluded in some sections of the text, so the authors seem to contradict themselves. Why not measure the mitochondria areas as done in Figure 6A' instead? 

      In the previous version, we manually categorized pooled, blinded images from different genotypes. However, as the reviewer pointed out, this approach was not quantitative. In the revised version, we analyzed the images using ImageJ to quantify the mitochondrial area per cell. Statistical significance was assessed using the Kruskal-Wallis test. Accordingly, we have revised Figure 5D, the method section, and the figure legend.

      (6) Figure 6B data seem to come from a single image per genotype only. At least 3 or 4 animals should be measured and the values reported. 

      We analyzed Pearson’s correlation coefficients (R values) from at least five images per genotype and performed statistical analysis. The resulting quantification is presented in Figure 6B’, and the corresponding text has been revised accordingly.

      (7) As BNIP3 mutants are viable, it would be interesting to report if they can fly and how long they live. 

      Additional data on adult lifespan, climbing ability, and elapsed time for eclosion in BNIP3 KO flies have been included as supplemental information (Figure 3-figure supplement 2). No significant differences were observed in those assays under our experimental conditions.

      (8) The transcriptomics data are not well linked to the autophagy mechanism. In particular, the mutant transcriptomics data are confusing, as the abstract seems to suggest that blocking autophagy impacts transcriptomics, which is not (strongly) the case. I would at least re-write this part, as it is currently misleading and sparks wrong expectations to the reader. Also throughout the text, the authors need to make clear if there are transcriptomic changes or not and if there are, how these are linked to autophagy. 

      In the abstract, we described the findings as “transcriptional dynamics independent of autophagy” (line 49) because the loss of autophagy had only a minimal effect on transcriptional changes. This conclusion is supported by the data presented in our manuscript. In the result section, we state: “In contrast to our prediction, the knockdown of Atg18a, FIP200, or Stx17 only had a slight impact on transcriptomic dynamics in DIOM remodeling (Fig. 2C), with only minor changes detected (Fig. 2-figure supplement 2G)” (lines 199-201). In the Discussion section, we further note: “The transcriptional dynamics associated with DIOM remodeling are largely independent of autophagy (Fig.2). Instead, our RNA-seq data suggest that it is regulated primarily by ecdysone signaling, with minimal influence from autophagy inhibition” (lines 326-328).

      (9) No table with the measured data is provided. 

      We have provided the raw data files corresponding to all quantified results as Supplementary File 2.

      Minor points: 

      (1) To my knowledge, it is standard to indicate the time after puparium formation in hours, instead of days, (e.g. 24h, 48h etc.). 

      Thank you for the comments. In our previous publications on DIOM remodeling during metamorphosis (PMID: 28063257 and 33077556), we used days rather than hours to indicate developmental time points. To maintain consistency across our studies, we have chosen to continue using days in the present manuscript.

      (2) "Myofibrils typically form beneath the sarcolemma (Mao et al., 2022; Sanger et al., 2010); therefore, when mitochondria accumulate, myofibrils are restricted to the cell periphery." This is quite a general statement that does not always hold, in particular not in Drosophila flight muscles and likely also not in abdominal muscles (see PMIDs 29846170, 28174246). 

      Thank you for pointing that out. We rewrote the sentence as follows: In the absence of BNIP3, mitochondria derived from the larval muscle accumulate and cluster in the cell center, physically obstructing myofibril formation during hypertrophy and restricting myofibrils to the cell periphery (Fig. 6E) (lines 392-394).

      Reviewer #2 (Recommendations for the authors): 

      Suggestions for improved or additional experiments, data or analyses. 

      The authors should test, by a co-IP experiment, whether BNIP3 mutants lose the interaction with HA-Atg18a. 

      As requested, we tested the effect of MER deletion on the interaction between BNIP3 and Atg18a in co-IP experiment. As shown in the new Fig. 5C, the deletion of MER weakened the interaction. This result was confirmed in three independent experiments. Its corresponding text has also been revised as follows: “We confirmed that HA-tagged Drosophila Atg18a co-immunoprecipitated with GFP-tagged full-length Drosophila BNIP3, and that this interaction was attenuated by the deletion of the MER (residues 42-53) (Fig. 5C)” (lines 270-273).

      Minor corrections to the text and figures 

      (1) In the list of authors, Kawaguchi Kohei could be Kohei Kawaguchi_._ 

      Thank you very much. It has been corrected.

      (2) In Fig3D, other receptors (Zonda, CG12511, Key, Ref2P) should be mentioned briefly. 

      Thank you for the suggestion. We have revised the sentences as follows: “The time course RNA-seq data (Fig. 1 and 2) indicated that, among the known mitophagy regulators, only BNIP3 was robustly expressed in 1 d APF DIOMs. In contrast, Zonda, CG12511, Pink1, Park, Key, Ref(2)P, and IKKe—the Drosophila orthologs of FKBP8, FUNDC1, PINK1, Parkin, Optineurin, p62, and TBK1, respectively—showed little or undetectable expression at this stage (Fig. 3D).” (lines 230-234).

      Reviewer #3 (Recommendations for the authors): 

      Remarks: 

      (1) What is the consequence of impaired muscle remodeling on the organismal level? Is the eclosion of adult flies impaired? One could think of assays for this, such as quantifying failed eclosions and/or video microscopy of the eclosion process. Is muscle function impaired? One could measure the contractile force of isolated fibers during electrical stimulation as well, etc. I believe that showing the physiological importance of muscle remodeling would be the biggest advantage that could arise from using a complete animal model.

      We appreciate the comments. We have added data on adult lifespan, climbing ability, and the elapsed time for eclosion in BNIP3 KO flies as supplemental information (Figures 3-figure supplement 2). In BNIP3 KO DIOMs, despite the massive accumulation of mitochondria, an organized peripheral myofibril layer with contractile function is retained. However, we have not measured the contractile force of isolated muscle cells due to technical limitations. We plan to address this in future studies.

      A related note is that I missed the proper discussion of the function and fate of these short-lived adult muscles (please see references in my summary). 

      We have added a sentence regarding the function and fate of DIOMs in the introduction (lines 80-82) as follows: “The remodeled adult DIOMs function during eclosion, persist for approximately 12 hours, and are subsequently eliminated via programmed cell death (Kimura and Truman, 1990; J Neurosci. 1990;10:403-1)”.

      (2) I don't think that "data not shown" should be used these days, when supplemental data allow the inclusion of not-so-critical results. 

      We have added the data as Figure 5-figure supplement 2. As shown in the figure, overexpression of GFP-BNIP3 in 3IL BWMs did not induce the formation of tdTomato-positive autolysosomes, which are abundantly accumulated in DIOMs at 1 and 2 d APF.

      (3) The term "naked mitochondria" does not sound scientific enough to this reviewer. I suggest "cytosolic mitochondria" or "unengulfed mitochondria". 

      In accordance with the reviewer’s suggestion, we have replaced “naked mitochondria” with “unengulfed mitochondria” (lines 251 and 670).

    1. eLife Assessment

      Understanding how neural circuits mediate decision-making is a core problem in neuroscience. In this interesting and important work, the authors use detailed behavioral analysis and rigorous quantitative modeling to convincingly support the idea that the nematode C. elegans uses an "accept-reject" behavioral strategy, based on learned features of its environment, to make decisions upon encountering food patches. The work expands our understanding of the behavioral repertoire of this species, providing a foundation for future mechanistic studies in this powerful model system.

    2. Reviewer #1 (Public review):

      Summary:

      This work uses a novel, ethologically relevant behavioral task to explore decision-making paradigms in C. elegans foraging behavior. By rigorously quantifying multiple features of animal behavior as they navigate in a patch food environment, the authors provide strong evidence that worms exhibit one of three qualitatively distinct behavioral responses upon encountering a patch: (1) "search", in which the encountered patch is below the detection threshold; (2) "sample", in which animals detect a patch encounter and reduce their motor speed, but do not stay to exploit the resource and are therefore considered to have "rejected" it; and (3) "exploit", in which animals "accept" the patch and exploit the resource for tens of minutes. Interestingly, the probability of these outcomes varies with the density of the patch as well as the prior experience of the animal. Together, these experiments provide an interesting new framework for understanding the ability of the C. elegans nervous system to use sensory information and internal state to implement behavioral state decisions.

      Strengths:

      The work uses a novel, neuroethologically-inspired approach to studying foraging behavior

      The studies are carried out with an exceptional level of quantitative rigor and attention to detail

      Powerful quantitative modeling approaches including GLMs are used to study the behavioral states that worms enter upon encountering food, and the parameters that govern the decision about which state to enter

      The work provides strong evidence that C. elegans can make 'accept-reject' decisions upon encountering a food resource

      Accept-reject decisions depend on the quality of the food resource encountered as well as on internally represented features that provide measurements of multiple dimensions of internal state, including feeding status and time.

    3. Reviewer #2 (Public review):

      This study provides an experimental and computational framework to examine and understand how C. elegans make decisions while foraging environments with patches of food. The authors show that C. elegans reject or accept food patches depending on a number of internal and external factors.

      The key novelty of this paper is the explicit demonstration of behavior analysis and quantitative modeling to elucidate decision-making processes. In particular, the description of the exploring vs. exploiting phases, and sensing vs. non-sensing categories of foraging behavior based on the clustering of behavioral states defined in a multi-dimensional behavior-metrics space, and the implementation of a generalized linear model (GLM) whose parameters can provide quantitative biological interpretations.

      The work builds on the literature of C. elegans foraging by adding the reject/accept framework.

    4. Reviewer #3 (Public review):

      Summary:

      In this study by Haley et al, the authors investigated explore-exploit foraging using C. elegans as a model system. Through an elegant set of patchy environment assays, the authors built a GLM based on past experience that predicts whether an animal will decide to stay on a patch to feed and exploit that resource, instead of choosing to leave and explore other patches.

      Strengths:

      I really enjoyed reading this paper. The experiments are simple and elegant, and address fundamental questions of foraging theory in a well-defined system. The experimental design is thoroughly vetted, and the authors provide a considerable volume of data to prove their points.

      Weakness:

      History-dependence of the GLM. The logistic GLM seems like a logical way to model a binary choice, and I think the parameters you chose are certainly important. However, the framing of them seem odd to me. I do not doubt the animals are assessing the current state of the patch with an assessment of past experience; that makes perfect logical sense. However, it seems odd to reduce past experience to the categories of recently exploited patch, recently encountered patch, and time since last exploitation. This implies the animals have some way of discriminating these past patch experiences and committing them to memory. Also, it seems logical that the time on these patches, not just their density, should also matter, just as the time without food matters. Time is inherent to memory. This model also imposes a prior categorization in trying to distinguish between sensed vs. not-sensed patches, which I criticized earlier. Only "sensed" patches are used in the model, but it is questionable whether worms genuinely do not "sense" these patches.

      It seems more likely the worm simply has some memory of chemosensation and relative satiety, both of which increase on patches, and decrease while off of patches. The magnitudes are likely a function of patch density. That being said, I leave it up to the reader to decide how best to interpret the data.

      Impact:

      I think this work will have a solid impact on the field, as it provides tangible variables to test how animals assess their environment and decide to exploit resources. I think the strength of this research could be strengthened by a reassessment of their model that would both simplify it and provide testable timescales of satiety/starvation memory.

    5. Author Response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This work uses a novel, ethologically relevant behavioral task to explore decision-making paradigms in C. elegans foraging behavior. By rigorously quantifying multiple features of animal behavior as they navigate in a patch food environment, the authors provide strong evidence that worms exhibit one of three qualitatively distinct behavioral responses upon encountering a patch: (1) "search", in which the encountered patch is below the detection threshold; (2) "sample", in which animals detect a patch encounter and reduce their motor speed, but do not stay to exploit the resource and are therefore considered to have "rejected" it; and (3) "exploit", in which animals "accept" the patch and exploit the resource for tens of minutes. Interestingly, the probability of these outcomes varies with the density of the patch as well as the prior experience of the animal. Together, these experiments provide an interesting new framework for understanding the ability of the C. elegans nervous system to use sensory information and internal state to implement behavioral state decisions.

      Strengths:

      The work uses a novel, neuroethologically-inspired approach to studying foraging behavior

      The studies are carried out with an exceptional level of quantitative rigor and attention to detail

      Powerful quantitative modeling approaches including GLMs are used to study the behavioral states that worms enter upon encountering food, and the parameters that govern the decision about which state to enter

      The work provides strong evidence that C. elegans can make 'accept-reject' decisions upon encountering a food resource

      Accept-reject decisions depend on the quality of the food resource encountered as well as on internally represented features that provide measurements of multiple dimensions of internal state, including feeding status and time

      Reviewer #2 (Public review):

      This study provides an experimental and computational framework to examine and understand how C. elegans make decisions while foraging environments with patches of food. The authors show that C. elegans reject or accept food patches depending on a number of internal and external factors.

      The key novelty of this paper is the explicit demonstration of behavior analysis and quantitative modeling to elucidate decision-making processes. In particular, the description of the exploring vs. exploiting phases, and sensing vs. non-sensing categories of foraging behavior based on the clustering of behavioral states defined in a multi-dimensional behavior-metrics space, and the implementation of a generalized linear model (GLM) whose parameters can provide quantitative biological interpretations.

      The work builds on the literature of C. elegans foraging by adding the reject/accept framework.

      Reviewer #3 (Public review):

      Summary:

      In this study by Haley et al, the authors investigated explore-exploit foraging using C. elegans as a model system. Through an elegant set of patchy environment assays, the authors built a GLM based on past experience that predicts whether an animal will decide to stay on a patch to feed and exploit that resource, instead of choosing to leave and explore other patches.

      Strengths:

      I really enjoyed reading this paper. The experiments are simple and elegant, and address fundamental questions of foraging theory in a well-defined system. The experimental design is thoroughly vetted, and the authors provide a considerable volume of data to prove their points. My only criticisms have to do with the data interpretation, which I think are easily addressable.

      Weaknesses:

      History-dependence of the GLM

      The logistic GLM seems like a logical way to model a binary choice, and I think the parameters you chose are certainly important. However, the framing of them seem odd to me. I do not doubt the animals are assessing the current state of the patch with an assessment of past experience; that makes perfect logical sense. However, it seems odd to reduce past experience to the categories of recently exploited patch, recently encountered patch, and time since last exploitation. This implies the animals have some way of discriminating these past patch experiences and committing them to memory. Also, it seems logical that the time on these patches, not just their density, should also matter, just as the time without food matters. Time is inherent to memory. This model also imposes a prior categorization in trying to distinguish between sensed vs. not-sensed patches, which I criticized earlier. Only "sensed" patches are used in the model, but it is questionable whether worms genuinely do not "sense" these patches.

      It seems more likely that the worm simply has some memory of chemosensation and relative satiety, both of which increase on patches and decrease while off of patches. The magnitudes are likely a function of patch density. That being said, I leave it up to the reader to decide how best to interpret the data.

      Model design: We agree with the reviewer that past experience is not likely to be discretized into the exact parameters of our model. We have added to our manuscript to further clarify this point (lines 645-647). Investigating the mechanisms behind this behavior is beyond the scope of this project but is certainly an exciting trajectory for future C. elegans research.

      osm-6

      The argument is that osm-6 animals can't sense food very well, so when they sense it, they enter the exploitation state by default. That is what they appear to do, but why? Clearly they are sensing the food in some other way, correct? Are ciliated neurons the only way worms can sense food? Don't they also actively pump on food, and can therefore sense the food entering their pharynx? I think you could provide further insight by commenting on this. Perhaps your decision model is dependent on comparing environmental sensing with pharyngeal sensing? Food intake certainly influences their decision, no? Perhaps food intake triggers exploitation behavior, which can be over-run by chemo/mechanosensory information?

      osm-6 behavior: We thank the reviewer for pointing out the need to further elaborate on a mechanistic hypothesis to explain the behavior of osm-6 sensory mutants. We agree with the reviewer’s speculation that post-ingestive and other non-ciliary sensory cues likely drive detection of food. We have added additional commentary to our manuscript to state this (lines 529-538).

      Impact

      I think this work will have a solid impact on the field, as it provides tangible variables to test how animals assess their environment and decide to exploit resources. I think the strength of this research could be strengthened by a reassessment of their model that would both simplify it and provide testable timescales of satiety/starvation memory.

      Reviewer #2 (Recommendations for the authors):

      The authors have addressed most of my concerns.

      Reviewer #3 (Recommendations for the authors):

      The authors provide a considerable amount of processed data (great, thank you!), but it would be even better if they provided the raw data of the worm coordinates, and when and where these coordinates overlapped with patches. This is the raw data that was ultimately used for all the quantifications in the paper, and would be incredibly useful to readers who are interested in modeling the data themselves.

      This should not be prohibitive.

      Data Availability: We thank the reviewer for pointing out this need. We are uploading all processed data (e.g. worm coordinates relative to the arena and patches) to a curated data storage server. We have updated our data availability statement to state this (lines 684-688).

      Search vs. sample & sensing vs. non-sensing.

      The different definitions of behaviors in Figures 2H-K are a bit confusing. I think the confusion stems in part from the changing terms and color associations in Figures 2 H-K. Essentially the explore density in Figure 2 H is split into two densities based on the two densities (sensing vs. non-responding) observed in Figure 2I. In turn, the sensing density in Figure 2I is split into two densities (explore vs exploit) based on the two densities observed in Figure 2 H. But the way the figures are colored, yellow means search (Figure 2H) and non-responding (Figure 2I), green means exploit (Figure 2H) which includes sensing and non-responding, but also exclusively sensing (Figure 2I), and blue consistently means exploit in both figures. It might help to use two different color codes for Figures 2H and 2I, and then in 2J you define search as explore AND non-responding, sample as explore AND sensing, and exploit as exploit.

      Color schema: While we understand the confusion, we believe that introducing additional colors may also present some misunderstandings. We have decided to leave the figure as it is.

    1. eLife Assessment

      This is a well-written study that presents a solid genetic screen to identify regulators of adipose morphology and remodeling in zebrafish. The authors generated a rigorous screening platform based on live, whole animal imaging and statistical methods that revealed both novel and known genes critical for adipose regulation. This work is valuable because it provides several candidate genes relevant to metabolic health and a quantitative screening pipeline that will be beneficial for future studies. A limitation of the study is that it precludes a definitive distinction between developmental and remodeling effects.

    2. Joint Public Review:

      In this manuscript, Wafer and Tandon et al. present a thoughtful and well-designed genetic screen for regulators of adipose remodeling using zebrafish as a model system. The authors cross-referenced several human adipocyte-related transcriptomic and genetic association datasets to identify candidate genes, which they then tested in zebrafish. Importantly, the authors devised an unbiased microscopy-based screening platform to document quantitative adipose phenotypes with whole animal imaging, while also employing rigorous statistical methods. From their screen, the authors identified 6 genes that resulted in robust adipose phenotypes out of a total of 25 that were tested. Overall, this work will be a useful resource for the field because of both the genes identified and the quantitative, rigorous screening pipeline. However, there are limitations that preclude a definitive distinction between developmental and remodeling effects that should be acknowledged and discussed, or addressed with new experiments.

      Strengths:

      (1) This work combines multiple omic datasets to identify candidate genes that informed a CRISPR-based screen to identify genes underlying adipose tissue development and adaptation. This approach offers a new avenue to improve our understanding and testing of new genetic mechanisms underlying the development of obesity.

      (2) Using a clever screening approach, this study identifies new genes that are associated with adipose tissue lipid droplet size change. Importantly, the study provides further validation using a stable CRISPR line to show the phenotype in basal and high-fat diet conditions.

      (3) The experiments are well-designed and rigorous. Sample sizes are large. Statistical analyses are highly rigorous, contributing to a high-quality study.

      Weaknesses:

      (1) The image quantification established in Figures 3 and 4 and used in CRISPR screening showed the relationship among zebrafish development, adipose tissue size, and lipid droplet size. Although adipose tissue development patterning is linked with adipose tissue adaptation, as shown by the evidence provided in this paper, it will be more powerful if the imaging method and pipeline were established to directly access the adipose tissue plasticity rather than just the developmental patterning. Furthermore, the authors should perform additional analysis of their existing data to more accurately determine lipid droplet size along the AP axis in response to HFD.

      (2) In the absence of tissue-specific manipulations, definitively establishing the mechanisms underlying the genetic regulation of adipose tissue physiology presents limitations.

    1. eLife Assessment

      This paper makes a valuable contribution to our understanding of the tradeoffs in eye design - specifically between improvements in optics and in photoreceptor performance. The authors successfully build a formal theory that enables comparisons across a wide range of species and eye types. One notable example is that how space should be allocated to optics and photoreceptors depends on eye type - with particularly notable differences between compound and simple eyes. The framework introduced to compare different design properties is convincing and provides a nice example of how to study tradeoffs in seemingly disparate design properties.

    2. Reviewer #1 (Public review):

      Summary:

      Two important factors in visual performance are the resolving power of the lens and the signal-to-noise ratio of the photoreceptors. These both compete for space: a larger lens has improved resolving power over a smaller one, and longer photoreceptors capture more photons and hence generate responses with lower noise. The current paper explores the tradeoff of these two factors, asking how space should be allocated to maximize eye performance (measured as encoded information).

      The revisions, to my read, have greatly improved the paper. Most of this was due to setting clear expectations from the start of the paper. Nice work!

    3. Reviewer #2 (Public review):

      Summary:

      In short, the paper presents a theoretical framework that predicts how resources should be optimally distributed between receptors and optics in eyes.

      After revision of an already excellent contribution, the manuscript is now even better. The authors have responded carefully to all reviewer comments.

      Strengths:

      The authors build on the principle of resource allocation within an organism and develop a formal theory for optimal distribution of resources within an eye between the receptor array and the optics. Because the two parts of eyes, receptor arrays and optics, share the same role of providing visual information to the animal it is possible to isolate these from resource allocation in the rest of the animal. This allows for a novel and powerful way of exploring the principles that govern eye design. By clever and thoughtful assumptions/constraints, the authors have built a formal theory of resource allocation between the receptor array and the optics for two major types of compound eye as well as for camera-type eyes. The theory is formalized with variables that are well characterized in a number of different animal eyes, resulting in testable predictions.

      The authors use the theory to explain a number of design features that depend on different optimal distribution of resources between the receptor array and the optics in different types of eye. As an example, they successfully explain why eye regions with different spatial resolution should be built in different ways. They also explain differences between different types of eye, such as long photoreceptors in apposition compound eyes and much shorter receptors in camera type eyes. The predictive power in the theory is impressive.

      To keep the number of parameters at a minimum, the theory was developed for two types of compound eye (neural superposition, and apposition) and for camera-type eyes. It is possible to extend the theory to other types of eye, although it would likely require more variables and assumptions/constraints to the theory. It is thus good to introduce the conceptual ideas without overdoing the applications of the theory.

      The paper extends a previous theory, developed by the senior author, that develops performance surfaces for optimal cost/benefit design of eyes. By combining this with resource allocation between receptors and optics, the theoretical understanding of eye design takes a major leap and provides entirely new sets of predictions and explanations for why eyes are built the way they are.

      The paper is well written and even though the theory development in the Results may be difficult to take in for many biologists, the Discussion very nicely lists all the major predictions under separate headings, and here the text is more tuned for readers that are not entirely comfortable with the formalism of the Results section. I must point out though that the Results section is kept exemplary concise. The figures are excellent and help explain concepts that otherwise may go above the head of many biologists.

    4. Reviewer #3 (Public review):

      Summary:

      This is a proposal for a new theory for the geometry of insect eyes. The novel cost-benefit function combines the cost of the optical portion with the photoreceptor portion of the eye. These quantities are put on the same footing using a specific (normalized) volume measure, plus an energy factor for the photoreceptor compartment. An optimal information transmission rate then specifies each parameter and resource allocation ratio for a variable total cost. The elegant treatment allows for comparison across a wide range of species and eye types. Simple eyes are found to be several times more efficient across a range of eye parameters than neural superposition eyes. Some trends in eye parameters can be explained by optimal allocation of resources between the optics and photoreceptors compartments of the eye.

      Strengths:

      Data from a variety of species roughly align with rough trends in the cost analysis, e.g. as a function of expanding the length of the photoreceptor compartment.

      New data could be added to the framework once collected, and many species can be compared.

      Eyes of different shapes are compared.

      Weaknesses:

      Detailed quantitative conclusions are not possible given the approximations and simplifying assumptions in the models and weak accounting for trends in the data across eye types.

      Comments on revisions:

      I have no additional comments for the authors and appreciate the revisions and corrections implemented - I think those changes have improved the clarity of the manuscript and expanded the potential readership for the paper.

    5. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Two important factors in visual performance are the resolving power of the lens and the signal-to-noise ratio of the photoreceptors. These both compete for space: a larger lens has improved resolving power over a smaller one, and longer photoreceptors capture more photons and hence generate responses with lower noise. The current paper explores the tradeoff of these two factors, asking how space should be allocated to maximize eye performance (measured as encoded information).

      Your summary is clear, concise and elegant. The competition is not just for space, it is for space, materials and energy. We  now emphasise that we are considering these three costs in our rewrites of the Abstract and the first paragraph of the Discussion.  

      Strengths:

      The topic of the paper is interesting and not well studied. The approach is clearly described and seems appropriate (with a few exceptions - see weaknesses below). In most cases, the parameter space of the models are well explored and tradeoffs are clear.  

      Weaknesses:

      Light level

      The calculations in the paper assume high light levels (which reduces the number of parameters that need to be considered). The impact of this assumption is not clear. A concern is that the optimization may be quite different at lower light levels. Such a dependence on light level could explain why the model predictions and experiment are not in particularly good agreement. The paper would benefit from exploring this issue.

      Thank you for raising this point. We briefly explained in our original Discussion, under Understanding the adaptive radiation of eyes (Version 1, Iines 756 – 762), how our method can be modified to investigate eyes adapted for lower light levels. We have some thoughts on how eyes might be adapted. In general, transduction rates are increased by increasing D, reducing f, increasing d<sub>rh</sub> and increasing L . In addition, d<sub>rh</sub> is increased to allow for a larger D within the constraint of eye radius/corneal surface area, and to avoid wasteful oversampling (the changes in D, f and d<sub>rh</sub> increase acceptance angle ∆ρ). We suspect that in eyes optimised for the efficient use of space, materials and energy the increases in L will be relatively small, first because  increasing D, reducing f and increasing d<sub>rh</sub> are much more effective at increasing transduction rate than increasing L. Second, increasing sensitivity by reducing f decreases the cost Vo whereas increasing sensitivity by increasing L increases the cost V<sub>ph</sub>. This disadvantage, together with exponential absorption, might explain why L is only 10% - 20% longer in the apposition eyes of nocturnal bees (Somanathan et al, J. comp. Physiol. A195, 571583, 2009). Because this line of argument is speculative and enters new territory, we have not included it in our revised version. We already present a lot of new material for readers to digest, and we agree with referee 2 that “It is possible to extend the theory to other types of eyes, although it would likely require more variables and assumptions/constraints to the theory. It is thus good to introduce the conceptual ideas without overdoing the applications of the theory”. Nonetheless, we take your point that some of the eyes in our data set might be adapted for lower light levels, and we have rewritten the Discussion section, How efficiently do insects allocate resources within their apposition eyes accordingly. On line 827 – 843 we address the assumption that eyes are adapted for full daylight,  and also take the opportunity  to mention two more reasons for increasing the eye parameter p: namely increasing image velocity (Snyder, 1979), and constructing  bright zones that increase the detectability of small targets (van Hateren et al., 1989; Straw et al., 2006).

      Discontinuities

      The discontinuities and non-monotonicity of the optimal parameters plotted in Figure 4 are concerning. Are these a numerical artifact? Some discussion of their origin would be quite helpful.

      Good points, we now address the discontinuities in the Results, where they are first observed (lines 311 - 319) 

      Discrepancies between predictions and experiment

      As the authors clearly describe, experimental measurements of eye parameters differ systematically from those predicted. This makes it difficult to know what to take away from the paper. The qualitative arguments about how resources should be allocated are pretty general, and the full model seems a complex way to arrive at those arguments. Could this reflect a failure of one of the assumptions that the model rests on - e.g. high light levels, or that the cost of space for photoreceptors and optics is similar? Given these discrepancies between model and experiment, it is also hard to evaluate conclusions about the competition between optics and photoreceptors (e.g. at the end of the abstract) and about the importance for evolution (end of introduction).

      Your misgivings boil down to two issues: what use is a model that fails to fit the data, and do we need a complicated model to show something that seems to be intuitively obvious?  Our study is useful because it introduces new approaches, methods, factors and explanations which advance our analysis and understanding of eye design and evolution. Your comments make it clear that we failed to get this message across and we have revised the manuscript accordingly. We have rewritten the Abstract and the first paragraph of the Discussion to emphasise the value of our new measure of cost, specific volume, by including more of its practical advantages. In particular, our use of specific volume 1) opens the door to the morphospace of all eyes of given type and cost. 2) This allows one to construct performance surfaces across morphospace that not only identify optima, but by evaluating the sub-optimal cast light on efficiency and adaptability. 3) Shows that photoreceptor energy costs have a major impact on design and efficiency, and 4) allows us to calculate and compare the capacities and efficiencies of compound eyes and simple eyes using a superior measure of cost. It is also possible that your dissatisfaction was deepened by disappointment. The first sentence of our original Abstract said that the goal of design is to maximize performance, so you might have expected to see that eyes are optimised.  Given that optimization provides cast iron proof that a system is designed to be efficient, and previous studies of coding by fly LMCs (Laughlin, 1981; Srinivasan et al., 1982 & van Hateren 1992) validated Barlow’s Efficient Coding Hypothesis by showing that coding is optimised, your expectation is reasonable. However, our investigation of how the allocation of resources to optics and photoreceptors affects an eye’s performance, efficiency and design does not depend a priori  on finding optima, therefore we have removed the “maximized”. Our revised Abstract now says, “to improve performance”.  

      In short, our study illustrates an old adage in statistics “All models fail to fit, but some are useful”. As is often the case, the way in which our model fails is useful. In the original version of the Results and Discussion, we argued that the allocation of resources is efficient, and identified factors that can, in principle, explain the scattering of data points. Indeed, our modelling identifies two of these deficiencies; a lack of data on species-specific energy usage, and the need for models that quantify the relationship between the quality of the captured image and the behavioural tasks for which an eye might be specialised. Thus, by examining the model’s failings we identify critical factors and pose new questions for future research.  We have rewritten the Discussion section How efficiently do insects allocate resources…. to make these points. We hope that these revisions will convince you that we have established a starting point for definitive studies, invented a vehicle that has travelled far enough to discover new territory, and shown that it can be modified to cope with difficult terrain.

      Turning to the need for a complicated model, because the costs and benefits depend on elementary optics and geometry, we too thought that there ought to be a simple model. However, when we tried to formulate a simple set of equations that approximate the definitive findings of our more complicated model we discovered that this is not as straightforward as we thought.  Many of the parameters in our model interact to determine costs and benefits, and many of these interactions are non-linear (e.g. the volumes of shells in spheres involve quadratic and cubic terms, and information depends on the log of a square root). So, rather than hold back publication of our complicated model, we decided to explain how it works as clearly as we can and demonstrate its value.

      In response to your final comment, “it is hard to evaluate conclusions about the competition between optics and photoreceptors (e.g. at the end of the abstract) and about the importance for evolution (end of introduction)”, we stand by our original argument. There must be competition in an eye of fixed cost, and because competition favours a heavy investment in photoreceptors, both in theory and in practice, it  is a significant factor in eye design. A match between investments in optics and photoreceptors is predicted by theory and observed in fly NS eyes, therefore this is a design principle. As for evolution, no one would deny that it is important to view the adaptive radiation of eyes through a cost-benefit lens. Our lens is the first to view the whole eye, optics and photoreceptor array, and the first to treat the costs of space, materials and energy. Although the view through our lens is a bit fuzzy, it reveals that costs, benefits and trade-offs are important. Thus we have established a promising starting point for a new and more comprehensive cost-benefit approach to understanding eye design and evolution.  As for the involvement of genes, when there are heritable changes in phenotype genes must be involved and if, as we suggest, efficient resource allocation is beneficial, the developmental mechanisms responsible for allocating resources to optics and photoreceptor array will be playing a formative role in eye evolution.

      Reviewer #2 (Public Review):

      Summary:

      In short, the paper presents a theoretical framework that predicts how resources should be optimally distributed between receptors and optics in eyes.

      Strengths:

      The authors build on the principle of resource allocation within an organism and develop a formal theory for optimal distribution of resources within an eye between the receptor array and the optics. Because the two parts of eyes, receptor arrays and optics, share the same role of providing visual information to the animal it is possible to isolate these from resource allocation in the rest of the animal. This allows for a novel and powerful way of exploring the principles that govern eye design. By clever and thoughtful assumptions/constraints, the authors have built a formal theory of resource allocation between the receptor array and the optics for two major types of compound eye as well as for camera-type eyes. The theory is formalized with variables that are well characterized in a number of different animal eyes, resulting in testable predictions.

      The authors use the theory to explain a number of design features that depend on different optimal distribution of resources between the receptor array and the optics in different types of eyes. As an example, they successfully explain why eye regions with different spatial resolution should be built in different ways. They also explain differences between different types of eyes, such as long photoreceptors in apposition compound eyes and much shorter receptors in camera type eyes. The predictive power in the theory is impressive.

      To keep the number of parameters at a minimum, the theory was developed for two types of compound eye (neural superposition, and apposition) and for camera-type eyes. It is possible to extend the theory to other types of eyes, although it would likely require more variables and assumptions/constraints to the theory. It is thus good to introduce the conceptual ideas without overdoing the applications of the theory.

      The paper extends a previous theory, developed by the senior author, that develops performance surfaces for optimal cost/benefit design of eyes. By combining this with resource allocation between receptors and optics, the theoretical understanding of eye design takes a major leap and provides entirely new sets of predictions and explanations for why eyes are built the way they are.

      The paper is well written and even though the theory development in the Results may be difficult to take in for many biologists, the Discussion very nicely lists all the major predictions under separate headings, and here the text is more tuned for readers that are not entirely comfortable with the formalism of the Results section. I must point out though that the Results section is kept exemplary concise. The figures are excellent and help explain concepts that otherwise may go above the head of many biologists.

      We are heartened by your appreciation of our manuscript - it persuaded us not to undertake extensive revisions – thank you.

      Reviewer #3 (Public Review):

      Summary:

      This is a proposal for a new theory for the geometry of insect eyes. The novel costbenefit function combines the cost of the optical portion with the photoreceptor portion of the eye. These quantities are put on the same footing using a specific (normalized) volume measure, plus an energy factor for the photoreceptor compartment. An optimal information transmission rate then specifies each parameter and resource allocation ratio for a variable total cost. The elegant treatment allows for comparison across a wide range of species and eye types. Simple eyes are found to be several times more efficient across a range of eye parameters than neural superposition eyes. Some trends in eye parameters can be explained by optimal allocation of resources between the optics and photoreceptors compartments of the eye.

      Strengths:

      Data from a variety of species roughly align with rough trends in the cost analysis, e.g. as a function of expanding the length of the photoreceptor compartment.

      New data could be added to the framework once collected, and many species can be compared.

      Eyes of different shapes are compared.

      Weaknesses:

      Detailed quantitative conclusions are not possible given the approximations and simplifying assumptions in the models and poor accounting for trends in the data across eye types.

      Reviewer #1 (Recommendations For The Authors):

      Figure 1: Panel E defines the parameters described in panel d. Consider swapping the order of those panels (or defining D and Delta Phi in the figure legend for d). Order follows narrative, eye types then match 

      We think that you are referring to Figure 1. We modified the legend.

      Lines 143-145: How does a different relative cost impact your results?

      Thank you for raising this question. Because our assumption that relative costs are the same is our starting point, and for optics it is not an obvious mistake, we do not raise your question here. We address your question where you next raise it because, for photoreceptors the assumption is obviously wrong.  We now emphasise that our method for accounting for photoreceptor energy costs can be applied to other costs. 

      Lines 187-190: Same as above - how do your results change if this assumption is not accurate?

      We have revised our manuscript to emphasise that we are dealing with the situation in which our initial assumption (costs per unit volume are equal) breaks down. On (lines 203 - 208) we write “ However, this assumption breaks down when we consider specific metabolic rates. To enable and power phototransduction, photoreceptors have an exceptionally high specific metabolic rate (energy consumed per gram, and hence unit volume, per second) (Laughlin et al., 1998; Niven et al., 2007; Pangršič et al., 2005). We account for this extra cost by applying an energy surcharge, S<sub>E</sub>. To equate…. 

      We also revised part of the Discussion section, Specific volume is a useful measure of cost to make it clear that we are able take account for situations in which the costs per unit volume are not equal, and we give our treatment of photoreceptor energy costs as an example of how this is done. On lines 626 - 640 we say  

      Cost estimates can be adjusted for situations in which costs per unit volume are not equal, as illustratedby our treatment of photoreceptor energy consumption.  To support transduction the photoreceptor array has an exceptionally high metabolic rate (Laughlin et al., 1998; Niven et al., 2007; Pangršič et al., 2005). We account forthis higher energy cost by using the animal’s specific metabolic rate (power per unit mass and hence power per unit volume) to convert an array’s power consumption into an equivalent volume (Methods). Photoreceptor ion pumps are the major consumers of energy and the smaller contribution of pigmented glia (Coles, 1989) is included in our calculation of the energy tariff K<sub>E</sub>. (Methods) The higher costs of materials and their turnover in the photoreceptor array can be added the energy tariff K<sub>E</sub> but given the magnitude of the light-gated current (Laughlin et al., 1998) the relative increase will be very small. Thus for our intents and purposes the effects of these additional costs are covered by our models. For want of sufficient data…”.

      Reviewer #2 (Recommendations For The Authors):

      A few comments for consideration by the authors:

      (1) In the abstract, Maybe give another example explaining why other eyes should be different to those of fast diurnal insects.

      This worthwhile extrapolation is best kept to the Discussion.

      (2) Would it be worthwhile mentioning that the photopigment density is low in rhabdoms compared to vertebrate outer segments? This will have major effects on the relative size of retina and optics.

      Thank you, we now make this good point in the Discussion (lines 698-702).

      (3) It took me a while to understand what you mean by an energy tariff. For the less initiated reader many other variables may be difficult to comprehend. A possible remedy would be to make a table with all variables explained first very briefly in a formal way and then explained again with a few more words for readers less fluent in the formalism.

      A very useful suggestion. We have taken your advice (p.4).

      (4) The "easy explanation" on lines 356-357 need a few more words to be understandable.

      We have expanded this argument, and corrected a mistake, the width of the head front to back is not 250 μm, it is 600 μm (lines 402-407)

      (5) Maybe devote a short paragraph in the Discussion to other types of eye, such as optical superposition eyes and pinhole eyes. This could be done very shortly and without formalism. I'm sure the authors already have a good idea of the optimal ratio of receptor arrays and optics in these eye types.

      We do not discuss this because we have not found a full account of the trade-offs and their  effects on costs and benefits. We hope that our analysis of apposition and simple eyes will encourage people to analyse the relationships between costs and benefits in other eye types. To this end we pointed out in the Discussion that recent advances in imaging and modelling could be helpful.

      (6)  Could the sentence on lines 668-671 be made a little clearer?

      “Efficiency is also depressed by increasing the photoreceptor energy tariff K<sub>E</sub>, and in line with the greater impact of photoreceptor energy costs in simple eyes, the reduction in efficiency is much greater in simple eyes (Figure 8b).0.

      We replaced this sentence with “In both simple and apposition eyes efficiency is reduced by increasing the photoreceptor energy tariff K<sub>E</sub>. This effect is much greater in simple eyes, thus as found for reductions in photoreceptor length (Figure 7b),K<sub>E</sub> has more impact on the design of simple eyes” (lines749 – 752).

      (7)  I have some reservations about the text on lines 789-796. The problem is that optics can do very little to improve the performance of a directional photoreceptor where delrho should optimally be very wide. Here, membrane folding is the only efficient way to improve performance (SNR). The option to reduce delrho for better performance comes later when simultaneous spatial resolution (multiple pixels) is introduced.

      Yes, we have been careless. We have rewritten this paragraph to say (lines 920-931)

      “Two key steps in the evolution of eyes were the stacking of photoreceptive membranes to absorb more photons, and the formation of optics to intercept more photons and concentrate them according to angle of incidence to form an image (Nilsson, 2013, 2021). Our modelling of well-developed image forming eyes shows that to improve performance stacked membranes (rhabdomeres) compete with optics for the resources invested in an eye, and this competition profoundly influences both form and function. It is likely that competition between optics and photoreceptors was shaping eyes as lenses evolved to support low resolution spatial vision. Thus the developmental mechanisms that allocate resources within modern high resolution eyes (Casares & MacGregor, 2021), by controlling cell size and shape, and as our study emphasises, gradients in size and shape across an eye, will have analogues or homologues in more ancient eyes. Their discovery….” (lines 920-931

      Reviewer #3 (Recommendations For The Authors):

      Suggestions for major revisions:

      While the approach is novel and elegant, the results from the analysis of insect morphology do not broadly support the optimization argument and hardly constrain parameters, like the energy tariff value, at all. The most striking result of the paper is the flat plateau in information across a broad range of shape parameters and the length, and resolution trend in Figure 5.

      At no point in the Results and Discussion do we argue that resource allocation is optimized. Indeed, we frequently observe that it is not. Our mistake was to start the Abstract by observing that animals evolve to minimise costs. We have rewritten the Abstract accordingly.

      The information peaks are quite shallow. This might actually be a very important and interesting result in the paper - the fact that the information plateaus could give the insect eye quite a wide range of parameters to slide between while achieving relatively efficient sensing of the environment. Instead of attempting to use a rather ad hoc and poorly supported measure of energetics in PR cost, perhaps the pitch could focus on this flexibility. K<sub>E</sub> does not seem to constrain eye parameters and does not add much to the paper.

      We agree, being able to construct performance surfaces across morphospace is an important advance in the field of eye design and evolution, and the performance surface’s flat top has interesting implications for the evolution of adaptations. Encouraged by your remarks, we have rewritten the Abstract and the introductory paragraph of the Discussion to draw attention to these points. 

      We are disappointed that we failed to convince you that our energy tariff, K<sub>E</sub> , is no better than a poorly supported ad hoc parameter that does not add much to the paper. In our opinion a resource allocation model that ignores photoreceptor energy consumption is obviously inadequate because the high energy cost of phototransduction is both wellknown and considered to be a formative factor in eye evolution (Niven and Laughlin, 2008). One of the advantages of modelling is that one can assess the impact of factors that are known to be present, are thought to be important, but have not been quantified. We followed standard modelling practice by introducing a cost that has the same units as the other costs and, for good physiological reasons, increases linearly with the number of microvilli, according to K<sub>E</sub>. We then vary this unknown cost parameter to discover when and why it is significant. We were pleased to discover that we could combine data on photoreceptor energy demands and whole animal metabolic rates to establish the likely range of K<sub>E</sub>. This procedure enabled us to unify the cost-benefit analyses of optics and photoreceptors, and to discover that realistic values of K<sub>E</sub> have a profound impact on the structure and performance of an efficient eye. We hope that this advance will encourage people to collect the data needed to evaluate K<sub>E</sub>.To emphasise the importance of K<sub>E</sub> and dispel doubts associated with the failure of the model to fit the data, we have revised two sections:  Flies invest efficiently in costly photoreceptor arrays in the Results, and How efficiently do insects allocate resources within their apposition eyes?  in the Discussion. These rewrites also explain why it is impossible for us to infer K<sub>E</sub> by adjusting its value so that the model’s predictions fit the data.

      The graphics after Figure 3 are quite dense and hard to follow. None of the plateau extent shown in Fig 3 is carried through to the subsequent plots, which makes the conclusions drawn from these figures very hard to parse. If the peak information occurs on a flat plateau, it would be more helpful to see those ranges of parameters displayed in the figures.

      Ideally one should do as you suggest and plot the extent of the plateau, but in our situation this is not very helpful. In the best data set, flies, optimised models predict D well, get close to ∆φ in larger eyes, and demonstrate that these optimum values are not very sensitive to K<sub>E</sub> L is a different matter, it is very sensitive to K<sub>E</sub> L which, as we show (and frequently remind) is poorly constrained by experimental data. The best we can do is estimate the envelope of L vs C<sub>tot</sub>  curves, as defined by a plausible range of K<sub>E</sub>L . Because most of the plateau boundaries you ask for will fall within this envelope, plotting them does little to clear the fog of uncertainty. We note that all three referees agree that our model can account for two robust trends, i) in apposition eyes L increase with optical resolving power and acuity, both within individual eyes and among eyes of different sizes, and ii) L is much longer is apposition eyes than in simple eyes. Nonetheless, the scatter of data points and their failure to fit creates a bad impression. We gave a number of reasons why the model does not fit the data points, but these were scattered throughout the Results and Discussion and, as referees 1 and 3 point out, this makes it difficult to draw convincing conclusions. To rectify this failing, we have rewritten two sections, in the Results Flies invest efficiently in costly photoreceptor arrays and in the Discussion, How efficiently do insects allocate resources within their apposition eyes?, to discuss these reasons en bloc, draw conclusions and suggest how better data and refinements to modelling could resolve these issues.  

      Throughout the figures, the discontinuities in the optimal cuts through parameter space are not sufficiently explained.

      We added a couple of sentences that address the “jumps” (lines 313 – 318)

      None of the data seems to hug any of the optimal lines and only weakly follow the trends shown in the plots. This makes interpretation difficult for the reader and should be better explained. The text can be a little telegraphic in the Results after roughly page 10, and requires several readings to glean insight into the manuscript's conclusions.

      We revised the Results section in which we compare the best data set, flies’  NS eyes with theoretical predictions, Flies invest efficiently in costly photoreceptor arrays,  to expand our interpretation of the data and clarify our arguments. The remaining sections have not been expanded. In the next section, which is on fused rhabdom apposition eyes, our interpretation of the scattering of data points follows the same line of argument. The remaining Results sections are entirely theoretical.  

      Overall, the rough conclusions outlined in the Results seem moderately supported by the matches of the data to the optimal information transmission cuts through parameter space, but only weakly.

      We agree, more data is required to test and refine our theoretical predictions.

      The Discussion is long and well-argued, and contains the most cogent writing in the manuscript.

      Thank you: this is most pleasing. We submitted our study to eLife because it allows longer Discussions, but we worried that ours was too long. However, we felt that our extensive Discussion was necessary for two reasons. First, we are introducing a new approach to understanding of eye design and evolution. Second, because the data on eye morphology and costs are limited, we had to make a number of assumptions and by discussing these, warts and all, we hoped to encourage experimentalists to gather more data and focus their efforts on the most revealing material.  

      Minor comments:

      We have acted upon most of your minor comments and we confine our remarks to our disagreements. We are grateful for your attention to details that we \textshould have picked up on.  

      It's a more standard convention to say "cost-benefit" rather than with a colon. 

      "equation" should be abbreviated "eq" or "eqn", never with a "t"

      when referring to the work of van Hateren, quote the paper and the database using "van Hateren" not just "Hateren"

      small latex note: use "\textit{SNR}" to get the proper formatting for those letters when in the math environment

      Line 100-110: "f" is introduced, but only f' is referenced in the figure. This should be explained in order. d_rh is not included in the figure. Also in this section, d_rh/f is also referenced before \Delta \rho_rf, which is the same quantity, without explanation.  

      Figure 1 shows eye structure and geometry. f’ is a lineal dimension of the eye but f is not, so f is not shown in Fig 1e. We eliminated the confusion surrounding ∆ρ<sub>rh</sub>  by deleting “and changing the acceptance angle of the photoreceptive waveguide ∆ρ<sub>rh</sub> (Snyder, 1979)”.  

      Fig 1 caption: this says "From dorsal to ventral," then describes trends that run ventral to dorsal, which is a confusing typo.

      Fig 3 - adding some data points to these plots might help the reader understand how (or if) K_E is constrained by the data.

      It is not possible to add data points because to total cost, Ctot ,is unknown.

      Fig 4c (and in other subplots): the jumps in L with C_tot could be explained better in the text - it wasn't clear to this reviewer why there are these discontinuities.

      Dealt with in the revised text (lines  310-318).

      Fig 4d: The caption for this subplot could be more clearly written.

      We have rewritten the subscript for subplot 4d.

      Fig 5 and other plots with data: please indicate which symbols are samples from the same species. This info is hard to reconstruct from the tables.

      We have revised Figure 5 accordingly. Species were already indicated in Figure 6.

      Line 328: missing equation number

    1. eLife Assessment

      This work is a important resource for hypothesis testing of candidate upstream transcriptional regulatory factors that control the spatiotemporal expression of selector genes and their targets for GABAergic vs glutamatergic neuron fate in the anterior brainstem. Extensive high-quality datasets were generated and state of the art computational methods were convincingly implemented to identify candidate regulatory elements. The work will be of interest to biologists working to understand neuronal gene regulatory networks.

    2. Reviewer #1 (Public review):

      The objectives of this research are to understand how key selector transcription factors, Tal1, Gata2, Gata3, determine GABAergic vs glutamatergic neuron fate from the rhombencephalic V2 precursor domain and how their spatiotemporal expression is controlled by upstream regulators. Toward these goals, the authors have generated an impressive array of scRNA, scATAC-seq, and CUT&Tag datasets obtained from dissociated E12.5 ventral R1 dissections. The rV2 was subsetted with well-known markers. The authors use an extensive set of computational approaches to identify temporal patterns of chromatin accessibility, TF motif binding activities (footprints), gene expression and regulatory motifs at the different selector gene loci. These analyses are used to predict upstream regulators, candidate accessible CREs, and DNA binding motifs through which the selectors may be controlled in rV2 by upstream regulators. Further analyses predict auto- and cross-regulatory interactions for maintenance of selector expression and the downstream effectors of alternative transmitter identities controlled by the selectors. The authors have achieved their aim of making predictions about upstream and downstream selector TF regulatory networks; their conclusions and predictions are largely well supported. The work clearly illustrates the daunting gene regulatory complexity likely at play in controlling rV2 transmitter fate.

      This is data-rich study and a valuable resource for future hypothesis testing, through perturbation approaches, of the many putative regulators and motifs identified in the study. The strengths of this work are the overall high quality of the datasets and in depth analyses. Through its comprehensive data and predictions, it is likely to have impact in advancing the understanding of GABAergic vs glutamatergic neuron fate decisions. The authors present a "simplified" gene regulatory model. However, the model does not illustrate the complexity of potential stage-specific upstream TF interactions with Tal1 and Vsx2 selector genes uncovered in TF footprinting analyses. While this seems nearly impossible to achieve given the plethora of potential functional TF inputs, the authors should consider assembling a focussed model by selectively illustrating the most robust, evidence-backed upstream TF input predictions, which are considered the strongest candidates for future hypothesis-driven perturbation experiments. It seems Insm1, Sox4, E2f1, Ebf1 and Tead2 TFs might be the strongest upstream candidates for future testing of Tal1 activation given the extensive analyses of their spatiotemporal expression patterns relative to Tal1, presented in Fig 4.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, the authors seek to discover putative gene regulatory interactions underlying the lineage bifurcation process of neural progenitor cells in the embryonic mouse anterior brainstem into GABAergic and glutamatergic neuronal subtypes. The authors analyze single-cell RNA-seq and single-cell ATAC-seq datasets derived from the ventral rhombomere 1 of embryonic mouse brainstems to annotate cell types and make predictions or where TFs bind upstream and downstream of the effector TFs using computational methods. They add data on the genomic distributions of some of the key transcription factors, and layer these onto the single cell data to develop a model of the transcription factors interactions that define this fate choice.

      Strengths:

      The authors use a well-defined fate decision point from brainstem progenitors that can make two very different kinds of neurons. They already know the key TFs for selecting the neuronal type from genetic studies, so they focus their gene regulatory analysis on the mechanisms that are immediately upstream and downstream of these key factors. The authors use a combination of single-cell and bulk sequencing data, prediction and validation, and computation.

      Weaknesses:

      The study does not go as far as to experimentally test the transcription factor network from their model.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The objective of this research is to understand how the expression of key selector transcription factors, Tal1, Gata2, Gata3, involved in GABAergic vs glutamatergic neuron fate from a single anterior hindbrain progenitor domain is transcriptionally controlled. With suitable scRNAseq, scATAC-seq, CUT&TAG, and footprinting datasets, the authors use an extensive set of computational approaches to identify putative regulatory elements and upstream transcription factors that may control selector TF expression. This data-rich study will be a valuable resource for future hypothesis testing, through perturbation approaches, of the many putative regulators identified in the study. The data are displayed in some of the main and supplemental figures in a way that makes it difficult to appreciate and understand the authors' presentation and interpretation of the data in the Results narrative. Primary images used for studying the timing and coexpression of putative upstream regulators, Insm1, E2f1, Ebf1, and Tead2 with Tal1 are difficult to interpret and do not convincingly support the authors' conclusions. There appears to be little overlap in the fluorescent labeling, and it is not clear whether the signals are located in the cell soma nucleus.

      Strengths:

      The main strength is that it is a data-rich compilation of putative upstream regulators of selector TFs that control GABAergic vs glutamatergic neuron fates in the brainstem. This resource now enables future perturbation-based hypothesis testing of the gene regulatory networks that help to build brain circuitry.

      We thank Reviewer #1 for the thoughtful assessment and recognition of the extensive datasets and computational approaches employed in our study. We appreciate the acknowledgment that our efforts in compiling data-rich resources for identifying putative regulators of key selector transcription factors (TFs)—Tal1, Gata2, and Gata3—are valuable for future hypothesis-driven research.

      Weaknesses:

      Some of the findings could be better displayed and discussed.

      We acknowledge the concerns raised regarding the clarity and interpretability of certain figures, particularly those related to expression analyses of candidate upstream regulators such as Insm1, E2f1, Ebf1, and Tead2 in relation to Tal1. We agree that clearer visualization and improved annotation of fluorescence signals are crucial to accurately support our conclusions. In our revised manuscript, we will enhance image clarity and clearly indicate sites of co-expression for Tal1 and its putative regulators, ensuring the results are more readily interpretable. Additionally, we will expand explanatory narratives within the figure legends to better align the figures with the results section.

      Reviewer #2 (Public review):

      Summary:

      In the manuscript, the authors seek to discover putative gene regulatory interactions underlying the lineage bifurcation process of neural progenitor cells in the embryonic mouse anterior brainstem into GABAergic and glutamatergic neuronal subtypes. The authors analyze single-cell RNA-seq and single-cell ATAC-seq datasets derived from the ventral rhombomere 1 of embryonic mouse brainstems to annotate cell types and make predictions or where TFs bind upstream and downstream of the effector TFs using computational methods. They add data on the genomic distributions of some of the key transcription factors and layer these onto the single-cell data to get a sense of the transcriptional dynamics.

      Strengths:

      The authors use a well-defined fate decision point from brainstem progenitors that can make two very different kinds of neurons. They already know the key TFs for selecting the neuronal type from genetic studies, so they focus their gene regulatory analysis squarely on the mechanisms that are immediately upstream and downstream of these key factors. The authors use a combination of single-cell and bulk sequencing data, prediction and validation, and computation.

      We also appreciate the thoughtful comments from Reviewer #2, highlighting the strengths of our approach in elucidating gene regulatory interactions that govern neuronal fate decisions in the embryonic mouse brainstem. We are pleased that our focus on a critical cell-fate decision point and the integration of diverse data modalities, combined with computational analyses, has been recognized as a key strength.

      Weaknesses:

      The study generates a lot of data about transcription factor binding sites, both predicted and validated, but the data are substantially descriptive. It remains challenging to understand how the integration of all these different TFs works together to switch terminal programs on and off.

      Reviewer #2 correctly points out that while our study provides extensive data on predicted and validated transcription factor binding sites, clearly illustrating how these factors collectively interact to regulate terminal neuronal differentiation programs remains challenging. We acknowledge the inherently descriptive nature of the current interpretation of our combined datasets.

      In our revision, we will clarify how the different data types support and corroborate one another, highlighting what we consider the most reliable observations of TF activity. Additionally, we will revise the discussion to address the challenges associated with interpreting the highly complex networks of interactions within the gene regulatory landscape.

      We sincerely thank both reviewers for their constructive feedback, which we believe will significantly enhance the quality and accessibility of our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The results in Figure 3 and several associated supplements are mainly a description/inventory of putative CREs some of which are backed to some extent by previous transgenic studies. But given the way the authors chose to display the transgenic data in the Supplements, it is difficult to fully appreciate how well the transgenic data provide functional support. Take, for example, the Tal +40kb feature that maps to a midbrain enhancer: where exactly does +40kb map to the enhancer region? Is Tal +40kb really about 1kb long? The legend in Supplemental Figure 6 makes it difficult to interpret the bar charts; what is the meaning of: features not linked to gene -Enh? Some of the authors' claims are not readily evident or are inscrutable. For example, Tal locus features accessible in all cell groups are not evident (Fig 2A,B). Other cCREs are said to closely correlate with selector expression for example, Tal +.7kb and +40kb. However, inspection of the data seems to indicate that the two cCREs have very different dynamics and only +40kb seems to correlate with the expression track above it. Some features are described redundantly such as the Gata2 +22 kb, +25.3 kb, and +32.8 kb cCREs above and below the Gata3 cCRE. What is meant by: The feature is accessible at 3' position early, and gains accessibility at 5' positions ... Detailed feature analysis later indicated the binding of Nkx6-1 and Ascl1 that are expressed in the rV2 neuronal progenitors, at 3' positions, and binding of Insm1 and Tal1 TFs that are activated in early precursors, at 5' positions (Figure 3C).

      To allow easier assessment of the overlap of the features described in this study in reference to the transgenic studies, we have added further information about the scATAC features, cCREs and previously published enhancers, as well as visual schematics of the feature-enhancer overlaps in the Supplementary table 4. The Supplementary Table 4 column contents are also now explained in detail in the table legend (under the table). We hope those changes make the feature descriptions clearer. To answer the reviewer's question about the Tal1+40kb enhancer, the length of the published enhancer element is 685 bp and the overlapping scATAC feature length is 2067 bp (Supplementary Table 3, sheet Tal1, row 103).

      The legend and the chart labelling in the Supplementary Figure 5 (formerly Supplementary figure 6) have been elaborated, and the shown categories explained more clearly.

      Regarding the features at the Tal1 locus, the text has been revised and the references to the features accessible in all cell groups were removed. These features showed differences in the intensity of signal but were accessible in all cell groups. As the accessibility of these features does not correlate with Tal1 expression, they are of less interest in the context of this paper.

      The gain in accessibility of the +0.7kb and +40 kb features correlates with the onset of Tal1 RNA expression. This is now more clearly stated in the text, as " For example, the gain in the accessibility of Tal1 cCREs at +0.7 and +40 kb correlated temporally with the expression of Tal1 mRNA (Figure 2B), strongly increasing in the earliest GABAergic precursors (GA1) and maintained at a lower level in the more mature GABAergic precursor groups (GA2-GA6), " (Results, page 4). The reviewer is right that the later dynamics of the +0.7 and +40 cCREs differ and this is now stated more clearly in the text (Results, page 5, last chapter).

      The repetition in the description of the Gata2 +22 kb, +25.3 kb, and +32.8 kb cCREs has been removed.

      The Tal1 +23 kb cCRE showed within-feature differences in accessibility signal. This is explained in the text on page 5, referring to the relevant figure 2A, showing the accessibility or scATAC signal in cell groups and the features labelled below, and 3C, showing the location of the Nkx6-1 and Ascl1 binding sites in this feature: "The Tal1 +23 kb cCRE contained two scATAC-seq peaks, having temporally different patterns of accessibility. The feature is accessible at 3' position early, and gains accessibility at 5' positions concomitant with GABAergic differentiation (Figure 2A, accessibility). Detailed feature analysis later indicated that the 3' end of this feature contains binding sites of Nkx6-1 and Ascl1 that are expressed in the rV2 neuronal progenitors, while the 5' end contains TF binding sites of Insm1 and Tal1 TFs that are activated in early precursors (described below, see Figure 3C)."

      (2) Supplementary Figure 3 is not presented in the Results.

      Essential parts of previous Supplementary Figure 3 have been incorporated into the Figure 4 and the previous Supplementary Figure omitted.

      (3) The significance of Figure 3 and the many related supplements is difficult to understand. A large number of footprints with wide-ranging scores, many very weak or unbound, are displayed in the various temporal cell groups in different epigenomic regions of Tal1 and Vsx2. The footprints for GA1 and Ga2 are combined despite Tal1 showing stronger expression in GA1 and stronger accessibility (Figure 2). Many possibilities are outlined in the Results for how the many different kinds of motifs in the cCREs might bind particular TFs to control downstream TF expression, but no experiments are performed to test any of the possibilities. How well do the TOBIAS footprints align with C&T peaks? How was C&T used to validate footprints? Are Gata2, 3, and Vsx2 known to control Tal1 expression from perturbation experiments?

      Figure 3 and related supplements present examples of the primary data and summarise the results of comprehensive analysis. The methods of identifying the selector TF regulatory features and the regulators are described in the Methods (Materials and Methods page 16). Briefly, the correlation between feature accessibility and selector TF RNA expression (assessed by the LinkPeaks score and p-value) were used to select features shown in the Figure 3.

      We are aware of differences in Tal1 expression and accessibility between GA1 and GA2. However, number of cells in GA2 was not high enough for reliable footprint calculations and therefore we opted for combining related groups throughout the rV2 lineage for footprinting.

      As suggested, CUT&Tag could be used to validate the footprinting results with some restrictions. In the revised manuscript, we included analysis of CUT&Tag peak location and footprints similarly to an earlier study (Eastman et al. 2025). In summary, we analysed whether CUT&Tag peaks overlap locations in which footprinting was also recognized and vice versa. Per each TF with CUT&Tag data we calculated a) Total number of CUT&Tag consensus peaks b) Total number of bound TFBS (footprints) c) Percentage of CUT&Tag overlapping bound TFBS d) Percentage of bound TFBS overlapping CUT&Tag. These results are shown in Supplementary Table 6 and in Supplementary figure 11 with analysis described in Methods (Materials and Methods, page 19). There is considerable overlap between CUT&Tag peaks and bound footprints, comparable to one shown in Eastman et al. 2025. However, these two methods are not assumed to be completely matching for several reasons: binding by related/redundant TFs, antigen masking in the TF complex, chromatin association without DNA binding, etc. In addition, some CUT&Tag peaks with unbound footprints could arise from non-rV2 cells that were part of the bulk CUT&Tag analysis but not of the scATAC footprint analysis.

      The evidence for cross-regulation of selector genes and the regulation of Tal1 by Gata2, Gata3 and Vsx2 is now discussed (Discussion, chapter Selector TFs directly autoregulate themselves and cross-regulate each other, page 12-13). The regulation of Tal1 expression by Vsx2 has, to our knowledge, not been earlier studied.

      (4) Figure 4 findings are problematic as the primary images seem uninterpretable and unconvincing in supporting the authors' claims. There is a lack of clear evidence in support of TF coexpression and that their expression precedes Tal1.

      Figure 4 has been entirely redrawn with higher resolution images and a more logical layout. In the revised Figure 4, only the most relevant ISH images are shown and arrowheads are added showing the colocalization of the mRNA in the cell cytoplasm. Next to the plots of RNA expression along the apical-basal axis of r1, an explanatory image of the quantification process is added (Figure 4D).

      (5) What was gained from also performing ChromVAR other than finding more potential regulators and do the results of the two kinds of analyses corroborate one another? What is a dual GATA:TAL BS?

      Our motivation for ChromVAR analysis is now more clearly stated in the text (Results, page 9): “In addition to the regulatory elements of GABAergic fate selectors, we wanted to understand the genome-wide TF activity during rV2 neuron differentiation. To this aim we applied ChromVAR (Schep et al., 2017)" Also, further explanation about the Tal1and Gata binding sites has been added in this chapter (Results, page 9).

      The dual GATA:Tal BS (TAL1.H12CORE.0.P.B) is a 19-bp motif that consists of an E-box and GATA sequence, and is likely bound by heteromeric Gata2-Tal1 TF complex, but may also be bound by Gata2, Gata3 or Tal1 TFs separately. The other TFBSs of Tal1 contain a strong E-box motif and showed either a lower activity (TAL1.H12CORE.1.P.B) or an earlier peak of activity in common precursors with a decline after differentiation (TAL1.H12CORE.2.P.B) (Results, page 9).

      (6) The way the data are displayed it is difficult to see how the C&T confirmed the binding of Ebf1 and Insm1, Tal1, Gata2, and Gata3 (Supplementary Figures 9-11). Are there strong footprints (scores) centered at these peaks? One can't assess this with the way the displays are organized in Figure 3. What is the importance of the H3K4me3 C&T? Replicate consistency, while very strong for some TFs, seems low for other TFs, e.g. Vsx2 C&T on Tal1 and Gata2. The overlaps do not appear very strong in Supplementary Figure 10. Panels are not letter labeled.

      We have added an analysis of footprint locations within the CUT&Tag peaks (Supplementary Figure 11). The Figure shows that the footprints are enriched at the middle regions of the CUT&Tag peaks, which is expected if TF binding at the footprinted TFBS site was causative for the CUT&Tag peaks.

      The aim of the Supplementary Figures 9-11 (Supplementary Figures 8-10 in the revised manuscript) was to show the quality and replicability of the CUT&Tag.

      The anti-H3K4me3 antibody, as well as the anti-IgG antibody, was used in CUT&Tag as part of experiment technical controls. A strong CUT&Tag signal was detected in all our CUT&Tag experiments with H3K4me3. The H3K4me3 signal was not used in downstream analyses.

      We have now labelled the H3K4me3 data more clearly as "positive controls" in the Supplementary Figure 8. The control samples are shown only on Supplementary Figure 8 and not in the revised Supplementary Figure 10, to avoid repetition. The corresponding figure legends have been modified accordingly.

      To show replicate consistency, the genome view showing the Vsx2 CUT&Tag signal at Gata2 gene has been replaced by a more representative region (Supplementary Figure 8, Vsx2). The Vsx2 CUT&Tag signal at the Gata2 locus is weak, explaining why the replicability may have seemed low based on that example.

      Panel labelling is added on Supplementary Figures S8, S9, S10.  

      (7) It would be illuminating to present 1-2 detailed examples of specific target genes fulfilling the multiple criteria outlined in Methods and Figure 6A.

      We now present examples of the supporting evidence used in the definition of selector gene target features and target genes. The new Supplementary Figure 12 shows an example gene Lmo1 that was identified as a target gene of Tal1, Gata2 and Gata3.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors perform CUT&Tag to ask whether Tal1 and other TFs indeed bind putative CREs computed. However, it is unclear whether some of the antibodies (such as Gata3, Vsx2, Insm1, Tead2, Ebf1) used are knock-out validated for CUT&Tag or a similar type of assay such as ChIP-seq and therefore whether the peaks called are specific. The authors should either provide specificity data for these or a reference that has these data. The Vsx2 signal in Figure S9 looks particularly unconvincing.

      Information about the target specificity of the antibodies can be found in previous studies or in the product information. The references to the studies have been now added in the Methods (Materials and Methods, CUT&Tag, pages 18-19). Some of the antibodies are indeed not yet validated for ChIP-seq, Cut-and-run or CUT&Tag. This is now clearly stated in the Materials and Methods (page 19): "The anti-Ebf1, anti-Tal1, anti-IgG and anti-H3K4me3 antibodies were tested on Cut-and-Run or ChIP-seq previously (Boller et al., 2016b; Courtial et al., 2012) and Cell Signalling product information). The anti-Gata2 and anti-Gata3 antibodies are ChIP-validated ((Ahluwalia et al., 2020a) and Abcam product information). There are no previous results on ChIP, ChIP-seq or CUT&Tag with the anti-Insm1, anti-Tead2 and anti-Vsx2 antibodies used here. The specificity and nuclear localization have been demonstrated in immunohistochemistry with anti-Vsx2 (Ahluwalia et al., 2020b) and anti-Tead2 (Biorbyt product information). We observed good correlation between replicates with anti-Insm1, similar to all antibodies used here, but its specificity to target was not specifically tested". We admit that specificity testing with knockout samples would increase confidence in our data. However, we have observed robust signals and good replicability in the CUT&Tag for the antibodies shown here.

      Vsx2 CUT&Tag signal at the loci previously shown in Supplementary Figure S9 (now Supplementary Figure 8) is weak, explaining why the replicability may seem low based on those examples. The genome view showing the Vsx2 CUT&Tag signal at Gata2 gene locus in Supplementary Figure 8 (previously Supplementary figure 9) has now been replaced by a view of Vsx2 locus that is more representative of the signal.

      (2) It is unclear why the authors chose to focus on the transcription factor genes described in line 626 as opposed to the many other putative TFs described in Figure 3/Supplementary Figure 8. This is the major challenge of the paper - the authors are trying to tell a very targeted story but they show a lot of different names of TFs and it is hard to follow which are most important.

      We agree with the reviewer that the process of selection of the genes of interest is not always transparent. We are aware that interpretations of a paper are based on the known functions of the putative regulatory TFs, however additional aspects of regulation could be revealed even if the biological functions of all the TFs were known. This is now stated in the Discussion “Caveats of the study” chapter. It would be relevant to study all identified candidate genes, but as often is the case, our possibilities were limited by the availability of materials (probes, antibodies), time, and financial resources. In the revised manuscript, we now briefly describe the biological processes related to the selected candidate regulatory TFs of the Tal1 gene (Results, page 8, "Pattern of expression of the putative regulators of Tal1 in the r1"). We hope this justifies the focus on them in our RNA co-expression analysis. The TFs analysed by RNAscope ISH are examples, which demonstrate alignment of the tissue expression patterns with the scRNA-seq data, suggesting that the dynamics of gene expression detected by scRNA-seq generally reflects the pattern of expression in the developing brainstem.

      (3) How is the RNA expression level in Figure 5B and 4D-L computed? These are the clusters defined by scATAC-seq. Is this an inferred RNA expression? This should be made more clear in the text.

      The charts in Figures 5B and 4G,H,I show inferred RNA expression. The Y-axis labels have now been corrected and include the term inferred’. RNA expression in the scATAC-seq cell clusters is inferred from the scRNA-seq cells after the integration of the datasets.

      (4) The convergence of the GABA TFs on a common set of target genes reminds me of a nice study from the Rubenstein lab PMID: 34921112 that looked at a set of TFs in cortical progenitors. This might be a good comparison study for the authors to use as a model to discuss the convergence data.

      We thank the reviewer for bringing this article to our attention. The article is now discussed in the manuscript (Discussion, page 11).

      (5) The data in Figure 4, the in-situ figure, needs significant work. First, the images especially B, F, and J appear to be of quite low resolution, so they are hard to see. It is unclear exactly what is being graphed in C, G, and K and it does not seem to match the text of the results section. Perhaps better labeling of the figure and a more thorough description will make it clear. It is not clear how D, H, and L were supposed to relate to the images - presumably, this is a case where cell type is spatially organized, but this was unclear in the text if this is known and it needs to be more clearly described. Overall, as currently presented this figure does not support the descriptions and conclusions in the text.

      Figure 4 has been entirely redrawn with higher resolution images and more logical layout. In the revised Figure 4, the ISH data and the quantification plots are better presented; arrows showing the colocalization of the mRNA in the cell cytoplasm were added; and an explanatory image of the quantification process is added on (D).

      Minor points

      (1) Helpful if the authors include scATAC-seq coverage plots for neuronal subtype markers in Figure 1/S1.

      We are unfortunately uncertain what is meant with this request. Subtype markers in Figure 1/S1 scATAC-seq based clusters are shown from inferred RNA expression, and therefore these marker expression plots do not have any coverage information available.

      (2) The authors in line 429 mention the testing of features within TADs. They should make it clear in the main text (although tadmap is mentioned in the methods) that this is a prediction made by aggregating HiC datasets.

      Good point and that this detail has been added to both page 3 and 16.

      (3) The authors should include a table with the phastcons output described between lines 511 and 521 in the main or supplementary figures.

      We have now clarified int the text that we did not recalculate any phastcons results, we merely used already published and available conservation score per nucleotide as provided by the original authors (Siepel et al. 2005). (Results, page 5: revised text is " To that aim, we used nucleotide conservation scores from UCSC (Siepel et al., 2005). We overlaid conservation information and scATAC-seq features to both validate feature definition as well as to provide corroborating evidence to recognize cCRE elements.")

      (4) It is very difficult to read the names of the transcription factor genes described in Figure 3B-D and Supplementary Figure 8 - it would be helpful to resize the text.

      The Figures 3B-D and Supplementary Figure 7 (former Supplementary figure 8) have been modified, removing unnecessary elements and increasing the size of text.

      (5) It is unclear what strain of mouse is used in the study - this should be mentioned in the methods.

      Outbred NMRI mouse strain was used in this study. Information about the mouse strain is added in Materials and Methods: scRNA-seq samples (page 14), scATAC-seq samples (page 15), RNAscope in situ hybridization (page 17) and CUT&Tag (page 18).

      (6) Text size in Figure 6 should be larger. R-T could be moved to a Supplementary Figure.

      The Figure 6 has been revised, making the charts clearer and the labels of charts larger. The Figure 6R-S have been replaced by Supplementary table 8 and the Figure 6T is now shown as a new Figure (Figure 7).

      Additional corrections in figures

      Figure 6 D,I,N had wrong y-axis scale. It has been corrected, though it does not have an effect on the interpretation of the data as Pos.link and Neg.link counts were compared to each other’s (ratio).

      On Figure 2B, the heatmap labels were shifted making it difficult to identify the feature name per row. This is now corrected.

    1. eLife Assessment

      This valuable study reports the physiological function of a putative transmembrane UDP-N-acetylglucosamine transporter called SLC35G3 in spermatogenesis. The conclusion that SLC35G3 is a new and essential factor for male fertility in mice and probably in humans is supported by convincing data. This study will be of interest to reproductive biologists and physicians working on male infertility.

    2. Reviewer #1 (Public review):

      Summary:

      In the present manuscript, Mashiko and colleagues describe a novel phenotype associated with deficient SLC35G3, a testis-specific sugar transporter that is important in glycosylation of key proteins in sperm function. The study characterizes a knockout mouse for this gene and the multifaceted male infertility that ensues. The manuscript is well-written and describes novel physiology through a broad set of appropriate assays.

      Strengths:

      Robust analysis with detailed functional and molecular assays

      Weaknesses:

      (1) The abstract references reported mutations in human SLC35G3, but this is not discussed or correlated to the murine findings to a sufficient degree in the manuscript. The HEK293T experiments are reasonable and add value, but a more detailed discussion of the clinical phenotype of the known mutations in this gene and whether they are recapitulated in this study (or not) would be beneficial.

      (2) Can the authors expand on how this mutation causes such a wide array of phenotypic defects? I am surprised there is a morphological defect, a fertilization defect, and a transit defect. Do the authors believe all of these are present in humans as well?

    3. Reviewer #2 (Public review):

      Summary:

      This study characterized the function of SLC35G3, a putative transmembrane UDP-N-acetylglucosamine transporter, in spermatogenesis. They showed that SLC35G3 is testis-specific and expressed in round spermatids. Slc35g3-null males were sterile, but females were fertile. Slc35g3-null males produced a normal sperm count, but sperm showed subtle head morphology. Sperm from Slc35g3-null males have defects in uterotubal junction passage, ZP binding, and oocyte fusion. Loss of SLC35G3 causes abnormal processing and glycosylation of a number of sperm proteins in the testis and sperm. They demonstrated that SLC35G3 functions as a UDP-GlcNAc transporter in cell lines. Two human SLC35G3 variants impaired their transporter activity, implicating these variants in human infertility.

      Strengths:

      This study is thorough. The mutant phenotype is strong and interesting. The major conclusions are supported by the data. This study demonstrated SLC35G3 as a new and essential factor for male fertility in mice, which is likely conserved in humans.

      Weaknesses:

      Some data interpretations need to be revised.

    1. eLife Assessment

      Qiu et al. present multiple dimeric structures of GPR3, which reveal the binding mode of the inverse agonist AF64394. The findings provide important insights into the regulation of GPCR3 and potentially other related orphan GPCRs. The authors present convincing evidence of their claims through thoughtful analysis of their cryo-EM structures, mutagenesis, and cell-based assays. This work will be of interest to GPCR investigators, especially those studying the signaling of orphan receptors.

    2. Reviewer #1 (Public review):

      Summary:

      GPR3 is an orphan receptor that plays a crucial role in central nervous system development and cold-induced thermogenesis, with potential implications for treating neurodegenerative and metabolic diseases. Although previous structural studies of GPR3 have been reported, Qiu et al. presented both active and inactive structures of GPR3 in its dimeric form. Notably, they identified AF64394 as a negative allosteric modulator that binds at the dimerization interface. This interface, primarily formed by transmembrane helices TM5 and TM6, is significantly larger than the dimerization interfaces previously reported for class A GPCRs. The authors further elucidate GPR3's activation mechanism and propose that dimerization may serve as a regulatory feature of GPR3 function. Overall, the study is well-executed, and the conclusions are sound.

      Strengths:

      Reported a unique dimerization interface of GPR3 and identified AF64394 as a negative allosteric modulator that binds at the dimerization interface.

      Weaknesses:

      There are some minor issues in the figure presentation.

    3. Reviewer #2 (Public review):

      Qiu et al. present active and inactive state dimeric structures of GPR3 with and without the previously identified inverse agonist AF64394. The manuscript combines cryo-EM processing, mutagenesis studies, and live-cell cAMP measurements to provide insights into the mechanism of action of AF64394 as a negative allosteric modulator of GPR3. All resolved structures show the density of a presumably hydrophobic endogenous, co-purified ligand in the orthosteric receptor binding pocket, supporting previous publications by this and other groups that endogenous lipids are endogenous ligands of GPR3. However, the authors also show that none of the proposed endogenous lipids (e.g., oleoylethanolamide) are able to further increase cAMP in living cells in a GPR3-dependent manner when applied exogenously. These data are in contrast to previous studies, but are of interest to the field as they may suggest that GPR3 expressed in different cell types is already saturated by endogenous lipids.

      The overall findings are novel and exciting. GPR3 has not previously been proposed to assemble into a homodimeric complex, and no information has been published on where AF64394 binds to the receptor. Several comparative analyses between GPR3 and its close relatives, GPR6 and GPR12, including live cell experiments with GPR3/6 chimera, provide intriguing mechanistic explanations for the different dimerisation behaviour and activity of AF64394 at this GPCR cluster.

      The only weakness of the study is that the population shift towards homodimer induced by AF6439, as suggested by 2D classifications of purified GPR3, is not supported by live cell experiments. The fact that AF64394 reduces GPR3-mediated cAMP production in a concentration-dependent manner may also be due to mechanisms independent of homodimerisation. Therefore, a live cell assay that directly detects dimer formation and/or dissociation upon different stimuli would significantly strengthen the findings of Qiu et al.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Qiu and co-workers describes the single-particle cryo-electron microscopy structures of various oligomeric states of the orphan GPCR, GPR3. It describes the monomeric and dimeric structure of a mutant of GPR3 with a modified G-protein complex (miniGs) and then builds on this work to attempt an inactive 'apo' dimer and an allosteric modulator (AF bound dimer structure, by using an ICL3 insertion and stabilizing FAB fragments.

      In general, I'm supportive of the work done in this study, and it does indeed provide valuable insight into GPR3 function. It may be that dimerization of certain class A GPCRs may be a means of signalling regulation or perhaps even amplification. However, some of the interpretation of the single particle data needs some extra attention to strengthen the hypothesis presented in the manuscript.

      Firstly, I want to thank the authors for providing the unfiltered half-maps and PDB models for careful assessment. During this review, I did my own post-processing of the half-maps and used the resultant maps for careful analysis of models.

      So to begin, I understand that the authors didn't model any lipid in the binding orthosteric binding site in any of the maps, but it may be worthwhile to model something in there, as many readers only download coordinates and not the maps.

      A more general point about all the maps. In no case were any focussed refinements carried out. As the point of this paper are some of the finer details between active and intermediate states and the effect of an allosteric modulator, masking out hypervariable portions of the structure and doing local Euler searches would most certainly provide richer insights of the details in GPR3 (especially as the BRIL:Fab structures are not of interest). And also, generally, no 3D-variability studies were performed to see if minor differences in, say, TM4/5/6 positions were due to large variation in the single particles or were a stable consensus position.

      As for the PFK dimeric structure. It appears to be refined with C2 point group symmetry (which is not mentioned anywhere except in a tiny bit of text in a supplemental figure). Was this also calculated in C1 to assess if there is any difference in either GPR3 protomer? Also, how certain are the authors of the cholesterol positions at the bottom of TM4/5? At lower map thresholds in the PFK dimer structure, one of them appears to be continuous with the orthosteric lipid. It also appears that there are many unmodelled lipids in this structure, and only two were assigned as cholesterol. It appears that many of the unmodelled lipids are forming bridging connections between the GPR3 protomers. Also, it may be worthwhile to provide a table of the key interactions between the protomers (although I note that there was a figure highlighting them).

      With the PFK monomer structure, there was weak density for the same cholesterol, which was not modelled in this one; perhaps some commentary on the authors' approach for deciding how to assign density would be helpful. It also appears that the refinement mask was probably a bit tight in this one (something that cryoSPARC is notorious for), and rerefining with a much looser mask around the TM domain may be helpful in resolving the inner lipid leaflet positions.

      The Apo structure, I think, I have the most issues with. Firstly, it is not 'apo'. There is definitely unaccounted for density in the orthosteric site. Also, the structure definitely needs a bit more attention. Firstly, masking out the BRIL and FABs would be a good start in helping better resolve the TMD regions, and then even focussing on a single monomer to increase the map interpretability. My major problem here is that, if this is being called 'apo' and inactive, the map doesn't reflect this; also, the TM5/6 does not look to be in a fully inactive position. The map density (at least around one of the protomers) in this region looks to be poorly resolved, most likely due to averaging due to internal motion. I think some 3DVA is certainly warranted here to strengthen the hypothesis that they have solved an 'apo' inactive.

      The AF (allosteric modulator) bound structure is of significantly better quality. But again, only AF is modelled, and no lipids are. How are the authors sure? Perhaps some focussed refinements (and changing the Euler Origin to centre it on the AF molecule could be a good start). To this reviewer, at least in one of the protomers, adjacent to the AF position, there is a density that looks very much like the allosteric modulator, so it could even be forming a bridging dimer. Also, some potential assignments of the lipids may enlighten some of the structure-activity relationship of this modulator, as it seems to make as many contacts with surrounding lipids as it does with TM4/5. Also, it may be worthwhile exploring carefully the 3DVA of this data. In our studies (Russel et al.), we noted that the orthosteric lipid appears to ratchet back-and-forth in concert with TM4/5 twisting. Perhaps in the AF bound structure, as it binds at the 'exit' site of the lipid, perhaps it is locking in a specific conformation.

    1. eLife Assessment

      This study provides valuable insights into the crosstalk between ATG2A with components of the early secretory pathway, namely RAB1A and ARFGAP1. The evidence supporting the claims is convincing. However, the manuscript would benefit from a more in-depth exploration of the details of the role of RAB1A in autophagy and the functional implications of its interaction with ATG2A. In addition, the molecular details of the role of ARFGAP1 in this complex need further clarification

    2. Reviewer #1 (Public review):

      Summary:

      D. Fuller et al. set out to study the molecular partners that cooperate with ATG2A, a lipid transfer protein essential for phagophore elongation, during the process of autophagy. Through a series of experiments combining microscopy and biochemistry, the authors identify ARFGAP1 and Rab1A as components of early autophagic membranes, which accumulate at the periphery of aberrant pre-autophagosomal structures induced by loss of ATG2. While ARFGAP1 has no apparent function in autophagy, the authors show that RAB1A is implicated in autophagy, although the mechanisms are not explored in the manuscript.

      Strengths:

      The work presented by Fuller et al. provides new insights into the composition of early autophagic membranes. The authors provide a series of MS experiments identifying proteins in close proximity to ATG2A, which is a valuable dataset for the field. Furthermore, they show for the first time the interaction between ATG2A and RAB1A, both in fed and starved conditions, which extends the characterisation of the pre-autophagosomal structures observed in ATG2 DKO cells.

      Weaknesses:

      The authors claim that this study elucidates the role of early secretory membranes in phagophore formation. However, this work is largely observational, which presents compelling evidence on the association between RAB1A GTPase and ATG2A without providing mechanistic insights into the functional relevance of this interaction. It remains unclear whether Rab1A depletion phenocopies ATG2A depletion in terms of autophagy progression or accumulation of pre-autophagosomal structures.

      Furthermore, this research is conducted exclusively in HEK293 cells. Including at least one additional cell line would significantly strengthen the main findings (i.e., effects on LC3-II accumulation observed for RAB1A/B knockdown, given the previously published data on this topic).

      A notable weakness of this manuscript, in this reviewer's opinion, lies in the discussion of the data in the context of existing literature. The discussion is rather short, mostly focused on the phenotype observed in ATG2 DKO cells. While this phenotype is certainly intriguing, it feels the discussion overlooks some important aspects, as outlined in the comments to the authors.

    3. Reviewer #2 (Public review):

      The mechanisms governing autophagic membrane expansion remain incompletely understood. ATG2 is known to function as a lipid transfer protein critical for this process; however, how ATG2 is coordinated with the broader autophagic machinery and endomembrane systems has remained elusive. In this study, the authors employ an elegant proximity labeling approach and identify two ER-Golgi intermediate compartment (ERGIC)-localized proteins-Rab1 and ARFGAP1-as novel regulators of ATG2 during autophagic membrane expansion.

      Their findings support a model in which autophagosome formation occurs within a specialized subdomain of the ER that is enriched in both ER exit sites (ERES) and ERGIC, providing valuable mechanistic insight. The overall study is well-executed and offers an important contribution to our understanding of autophagy.

      Specific Comments

      (1) Integration with Prior Literature<br /> The data convincingly implicate the ERES-ERGIC interface in autophagosome biogenesis. It would strengthen the manuscript to discuss previous studies reporting ERES and ERGIC remodeling and formation of ERERS-ERGIC contact sites (PMID: 34561617; PMID: 28754694) in the context of the current findings.

      (2) Experimental Conditions<br /> In Figures 2A-C and Figure 4, it is unclear how the cells were treated. Were they starved in EBSS? This information should be included in the corresponding figure legends.

      (3) LC3 Lipidation vs. Cleavage<br /> In Figure 2A, ARFGAP1 knockdown appears to reduce LC3 lipidation without affecting Halo-LC3 cleavage. Clarifying this observation would help readers better understand the functional specificity of ARFGAP1 in the pathway.

      (4) Use of HT-mGFP in Figure 2C<br /> It should be clarified whether the assay in Figure 2C was performed in the presence of HT-mGFP. Explaining the rationale would aid the interpretation of the results.

      (5) COPII Inhibition Strategy<br /> The authors used the dominant-active SAR1(H79G) mutant to inhibit COPII function. While this is effective in in vitro budding assays, the GDP-locked mutant SAR1(T39N) has been shown to be more effective in blocking COPII-mediated trafficking in cells. Including SAR1(T39N) in the analysis would provide stronger support for the conclusions.

    4. Reviewer #3 (Public review):

      The manuscript by Fuller et al describes a crosstalk between ARTG2A with components of the early secretory pathway, namely RAB1A and ARFGAP1. They show that ATG2A is recruited to membranes positive for RAB1A, which they also show to interact with ATG2A. In agreement with earlier findings by other groups, silencing RAB1A negatively affects autophagy. While ARFGAP1 was also found on ATG2A-positive membranes, silencing ARFGAP1 had no impact on autophagy. Notably, these ARFGAP1-positive membranes are not Golgi membranes.

      The findings are interesting, and in general, the data are of good quality; however, I have outstanding questions. An answer to any of these questions might strengthen the manuscript:

      (1) Are the membranes to which ATG2A is recruited a form of ERGIC?

      (2) Figure 3A/B: Is it possible to show a better example? The difference is barely detectable by eye. Since immunoblotting is not really a quantitative method, I think that such a weak effect is prone to be wrong. Is there another tool/assay to validate this result?

      (3) Is the curvature-sensitive region of ARFGAP1 required for its co-localization with ATG2A?

      (4) What does Rab1A do? What is its effector? Or does the GTPase itself remodel the membrane?

      (5) What about Arf1? It appears that the role of ARFGAP1 is unrelated to Arf1 and COPI? Thus, one would predict that Arf1 does not localize to these structures and does not affect ATG2A function.

      (6) Does ARFGAP1 promote fission of the membrane from its donor compartment?

      (7) What are ARFGAP1 and Rab1A recruited to? What is the lipid composition or protein that recruits these two players to regulate autophagy?

    1. eLife Assessment

      This important study is of relevance for the fields of predictive processing, perception and learning, with a well-designed paradigm allowing the authors to avoid several common confounds in investigating predictions, such as adaptation. Using a state-of-the-art multivariate EEG approach, the authors test the opposing process theory and find evidence in support of it - i.e., the persuasive within trial effects. However, the interactions across block are not well motivated and much less persuasive, such that the support for the conclusions is only incomplete at present.

    2. Reviewer #1 (Public review):

      Summary:

      In this lovely paper, McDermott and colleagues tackle an enduring puzzle in the cognitive neuroscience of perceptual prediction. Though many scientists agree that top-down predictions shape perception, previous studies have yielded incompatible results - with studies showing 'sharpened' representations of expected signals, and others showing a 'dampening' of predictable signals to relatively enhance surprising prediction errors. To deepen the paradox further, it seems like there are good reasons that we would want to see both influences on perception in different contexts.

      Here, the authors aim to test one possible resolution to this 'paradox' - the opposing process theory (OPT). This theory makes distinct predictions about how the timecourse of 'sharpening' and 'dampening' effects should unfold. The researchers present a clever twist on a leading-trailing perceptual prediction paradigm, using AI to generate a large dataset of test and training stimuli, so that it is possible to form expectations about certain categories without repeating any particular stimuli. This provides a powerful way of distinguishing expectation effects from repetition effects - a perennial problem in this line of work.

      Using EEG decoding, the researchers find evidence to support the OPT. Namely, they find that neural encoding of expected events is superior in earlier time ranges (sharpening-like) followed by a relative advantage for unexpected events in later time ranges (dampening-like). On top of this, the authors also show that these two separate influences may emerge differently in different phases of learning - with superior decoding of surprising prediction errors being found more in early phases of the task, and enhanced decoding of predicted events being found in the later phases of the experiment.

      Strengths:

      As noted above, a major strength of this work lies in important experimental design choices. Alongside removing any possible influence of repetition suppression mechanisms in this task, the experiment also allows us to see how effects emerge in 'real time' as agents learn to make predictions. This contrasts with many other studies in this area - where researchers 'over-train' expectations into observers to create the strongest possible effects, or rely on prior knowledge that was likely to be crystallised outside the lab.

      Weaknesses:

      This study reveals a great deal about how certain neural representations are altered by expectation and learning on shorter and longer timescales, so I am loath to describe certain limitations as 'weaknesses'. But one limitation inherent in this experimental design is that, by focusing on implicit, task-irrelevant predictions, there is not much opportunity to connect the predictive influences seen at the neural level to perceptual performance itself (e.g., how participants make perceptual decisions about expected or unexpected events, or how these events are detected or appear).

    3. Reviewer #2 (Public review):

      Summary:

      There are two accounts in the literature that propose that expectations suppress activity of neurons that are (a) not tuned to the expected stimulus to increase the signal-to-noise ratio for expected stimuli (sharpening model) or (b) tuned to the expected stimulus to highlight novel information (dampening model). One recent account, the opposing process theory, brings the two models together and suggests that both processes occur, but at different time points: initial sharpening is followed by later dampening of the neural activity of the expected stimulus. In this study, the authors aim to test the opposing process theory in a statistical learning task by applying multivariate EEG analyses and find evidence for the opposing process theory based on the within-trial dynamics.

      Strengths:

      This study addresses a very timely research question about the underlying mechanisms of expectation suppression. The applied EEG decoding approach offers an elegant way to investigate the temporal characteristics of expectation effects. A strength of the study lies in the experimental design that aims to control for repetition effects, one of the common confounds in prediction suppression studies. The reported results are novel in the field and have the potential to improve our understanding of expectation suppression in visual perception.

      Weaknesses:

      Although some of the findings are in line with the opposing process theory, especially the EEG results only partly support the hypothesis. While the initial dampening effect occurs in the grand average ERP and in image memory decoding, the expected later sharpening effect is lacking. Moreover, some methodological decisions still remain arbitrary. One of the interesting aspects of the study - prediction decoding - had to be removed due to the fact that it could not be disentangled from category decoding. This weakens the overall scope and impact of the manuscript.

    4. Reviewer #3 (Public review):

      Summary:

      In their study McDermott et al. investigate the neurocomputational mechanism underlying sensory prediction errors. They contrast two accounts: representational sharpening and dampening. Representational sharpening suggests that predictions increase the fidelity of the neural representations of expected inputs, while representational dampening suggests the opposite (decreased fidelity for expected stimuli). The authors performed decoding analyses on EEG data, showing that first expected stimuli could be better decoded (sharpening), followed by a reversal during later response windows where unexpected inputs could be better decoded (dampening). These results are interpreted in the context of opposing process theory (OPT), which suggests that such a reversal would support perception to be both veridical (i.e., initial sharpening to increase the accuracy of perception) and informative (i.e., later dampening to highlight surprising, but informative inputs).

      Strengths:

      The topic of the present study is of significant relevance for the field of predictive processing. The experimental paradigm used by McDermott et al. is well designed, allowing the authors to avoid several common confounds in investigating predictions, such as stimulus familiarity and adaptation. The introduction of the manuscript provides a well written summery of the main arguments for the two accounts of interest (sharpening and dampening), as well as OPT. Overall, the manuscript serves as a good overview of the current state of the field.

      Weaknesses:

      In my opinion some details of the methods, results and manuscript raise some doubts about the reliability of the reported findings. Key concerns are:

      (1) In the previous round of comments, I noted that: "I am not fully convinced that Figures 3A/B and the associated results support the idea that early learning stages result in dampening and later stages in sharpening. The inference made requires, in my opinion, not only a significant effect in one-time bin and the absence of an effect in other bins. Instead to reliably make this inference one would need a contrast showing a difference in decoding accuracy between bins, or ideally an analysis not contingent on seemingly arbitrary binning of data, but a decrease (or increase) in the slope of the decoding accuracy across trials. Moreover, the decoding analyses seem to be at the edge of SNR, hence making any interpretation that depends on the absence of an effect in some bins yet more problematic and implausible". The authors responded: "we fitted a logarithmic model to quantify the change of the decoding benefit over trials, then found the trial index for which the change of the logarithmic fit was < 0.1%. Given the results of this analysis and to ensure a sufficient number of trials, we focused our further analyses on bins 1-2". However, I do not see how this new analysis addresses the concern that the conclusion highlights differences in decoding performance between bins 1 and 2, yet no contrast between these bins are performed. While I appreciate the addition of the new model, in my current understanding it does not solve the problem I raised. I still believe that if the authors wish to conclude that an effect differs between two bins they must contrast these directly and/or use a different appropriate analysis approach.

      Relatedly, the logarithmic model fitting and how it justifies the focus on analysis bin 1-2 needs to be explained better, especially the rationale of the analysis, the choice of parameters (e.g., why logarithmic, why change of logarithmic fit < 0.1% as criterion, etc), and why certain inferences follow from this analysis. Also, the reporting of the associated results seems rather sparse in the current iteration of the manuscript.

      (2) A critical point the authors raise is that they investigate the buildup of expectations during training. They go on to show that the dampening effect disappears quickly, concluding: "the decoding benefit of invalid predictions [...] disappeared after approximately 15 minutes (or 50 trials per condition)". Maybe the authors can correct me, but my best understanding is as follows: Each bin has 50 trials per condition. The 2:1 condition has 4 leading images, this would mean ~12 trials per leading stimulus, 25% of which are unexpected, so ~9 expected trials per pair. Bin 1 represents the first time the participants see the associations. Therefore, the conclusion is that participants learn the associations so rapidly that ~9 expected trials per pair suffice to not only learn the expectations (in a probabilistic context) but learn them sufficiently well such that they result in a significant decoding difference in that same bin. If so, this would seem surprisingly fast, given that participants learn by means of incidental statistical learning (i.e. they were not informed about the statistical regularities). I acknowledge that we do not know how quickly the dampening/sharpening effects develop, however surprising results should be accompanied with a critical evaluation and exceptionally strong evidence (see point 1). Consider for example the following alternative account to explain these results. Category pairs were fixed across and within participants, i.e. the same leading image categories always predicted the same trailing image categories for all participants. Some category pairings will necessarily result in a larger representational overlap (i.e., visual similarity, etc.) and hence differences in decoding accuracy due to adaptation and related effects. For example, house  barn will result in a different decoding performance compared to coffee cup  barn, simply due to the larger visual and semantic similarity between house and barn compared to coffee cup and barn. These effects should occur upon first stimulus presentation, independent of statistical learning, and may attenuate over time e.g., due to increasing familiarity with the categories (i.e., an overall attenuation leading to smaller between condition differences) or pairs.

      (3) In response to my previous comment, why the authors think their study may have found different results compared to multiple previous studies (e.g. Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011), particularly the sharpening to dampening switch, the authors emphasize the use of non-repeated stimuli (no repetition suppression and no familiarity confound) in their design. However, I fail to see how familiarity or RS could account for the absence of sharpening/dampening inversion in previous studies.

      First, if the authors argument is about stimulus novelty and familiarity as described by Feuerriegel et al., 2021, I believe this point does not apply to the cited studies. Feuerriegel et al., 2021 note: "Relative stimulus novelty can be an important confound in situations where expected stimulus identities are presented often within an experiment, but neutral or surprising stimuli are presented only rarely", which indeed is a critical confound. However, none of the studies (Han et al., 2019; Richter et al., 2018; Kumar et al., 2017; Meyer and Olson, 2011) contained this confound, because all stimuli served as expected and unexpected stimuli, with the expectation status solely determined by the preceding cue. Thus, participants were equally familiar with the images across expectation conditions.

      Second, for a similar reason the authors argument for RS accounting for the different results does not hold either in my opinion. Again, as Feuerriegel et al. 2021 correctly point out: "Adaptation-related effects can mimic ES when the expected stimuli are a repetition of the last-seen stimulus or have been encountered more recently than stimuli in neutral expectation conditions." However, it is critical to consider the precise design of previous studies. Taking again the example of Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011. To my knowledge none of these studies contained manipulations that would result in a more frequent or recent repetition of any specific stimulus in the expected compared to unexpected condition. The crucial manipulation in all these previous studies is not that a single stimulus or stimulus feature (which could be subject to familiarity or RS) determines the expectation status, but rather the transitional probability (i.e. cue-stimulus pairing) of a particular stimulus given the cue. Therefore, unless I am missing something critical, simple RS seems unlikely to differ between expectation condition in the previous studies and hence seems implausible to account for differences in results compared to the current study.

      Moreover, studies cited by the authors (e.g. Todorovic & de Lange, 2012) showed that RS and ES are separable in time, again making me wonder how avoiding stimulus repetition should account for the difference in the present study compared to previous ones. I am happy to be corrected in my understanding, but with the currently provided arguments by the authors I do not see how RS and familiarity can account for the discrepancy in results.

      I agree with the authors that stimulus familiarity is a clear difference compared to previous designs, but without a valid explanation why this should affect results I find this account rather unsatisfying. I see the key difference in that the authors manipulated category predictability, instead of exemplar prediction - i.e. searching for a car instead of your car. However, if results in support of OPT would indeed depend on using novel images (i.e. without stimulus repetition), would this not severely limit the scope of the account and hence also its relevance? Certainly, the account provided by the authors casts the net wider and tries to explain visual prediction. Relatedly, if OPT only applies during training, as the authors seem to argue, would this again not significantly narrow the scope of the theory? Combined these two caveats would seem to demote the account from a general account of prediction and perception to one about perception during very specific circumstances. In my understanding the appeal of OPT is that it accounts for multiple challenges faced by the perceptual system, elegantly integrating them into a cohesive framework. Most of this would be lost by claiming that OPT's primary prediction would only apply to specific circumstances - novel stimuli during learning of predictions. Moreover, in the original formulation of the account, as outlined by Press et al., I do not see any particular reason why it should be limited to these specific circumstances. This does of course not mean that the present results are incorrect, however it does require an adequate discussion and acknowledgement in the manuscript.

      Impact:

      McDermott et al. present an interesting study with potentially impactful results. However, given my concerns raised in this and the previous round of comments, I am not entirely convinced of the reliability of the results. Moreover, the difficulty of reconciling some of the present results with previous studies highlights the need for more convincing explanations of these discrepancies and a stronger discussion of the present results in the context of the literature.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public reviews:

      Reviewer 1 (Public Review):

      Many thanks for the positive and constructive feedback on the manuscript.

      This study reveals a great deal about how certain neural representations are altered by expectation and learning on shorter and longer timescales, so I am loath to describe certain limitations as 'weaknesses'. But one limitation inherent in this experimental design is that, by focusing on implicit, task-irrelevant predictions, there is not much opportunity to connect the predictive influences seen at the neural level to the perceptual performance itself (e.g., how participants make perceptual decisions about expected or unexpected events, or how these events are detected or appear).

      Thank you for the interesting comment. We now discuss the limitation of task-irrelevant prediction . In brief, some studies which showed sharpening found that task demands were relevant, while some studies which showed dampening were based on task-irrelevant predictions, but it is unlikely that task relevance - which was not manipulated in the current study - would explain the switch between sharpening and dampening that we observe within and across trials.

      The behavioural data that is displayed (from a post-recording behavioural session) shows that these predictions do influence perceptual choice - leading to faster reaction times when expectations are valid. In broad strokes, we may think that such a result is broadly consistent with a 'sharpening' view of perceptual prediction, and the fact that sharpening effects are found in the study to be larger at the end of the task than at the beginning. But it strikes me that the strongest test of the relevance of these (very interesting) EEG findings would be some evidence that the neural effects relate to behavioural influences (e.g., are participants actually more behaviourally sensitive to invalid signals in earlier phases of the experiment, given that this is where the neural effects show the most 'dampening' a.k.a., prediction error advantage?)

      Thank you for the suggestion. We calculated Pearson’s correlation coefficients for behavioural responses (difference in mean reaction times), neural responses during the sharpening effect (difference in decoding accuracy), and neural responses during the dampening effect for each participant, which resulted in null findings.

      Reviewer 2 (Public Review):

      Thank you for your helpful and constructive comments on the manuscript.

      The strength in controlling for repetition effects by introducing a neutral (50% expectation) condition also adds a weakness to the current version of the manuscript, as this neutral condition is not integrated into the behavioral (reaction times) and EEG (ERP and decoding) analyses. This procedure remained unclear to me. The reported results would be strengthened by showing differences between the neutral and expected (valid) conditions on the behavioral and neural levels. This would also provide a more rigorous check that participants had implicitly learned the associations between the picture category pairings.

      Following the reviewer's suggestion, we have included the neutral condition in the behavioural analysis and performed a repeated measures ANOVA on all three conditions.

      It is not entirely clear to me what is actually decoded in the prediction condition and why the authors did not perform decoding over trial bins in prediction decoding as potential differences across time could be hidden by averaging the data. The manuscript would generally benefit from a more detailed description of the analysis rationale and methods.

      In the original version of the manuscript, prediction decoding aimed at testing if the upcoming stimulus category can be decoded from the response to the preceding ( leading) stimulus. However, in response to the other Reviewers’ comments we have decided to remove the prediction decoding analysis from the revised manuscript as it is now apparent that prediction decoding cannot be separated from category decoding based on pixel information.

      Finally, the scope of this study should be limited to expectation suppression in visual perception, as the generalization of these results to other sensory modalities or to the action domain remains open for future research.

      We have clarified the scope of the study in the revised manuscipt .

      Reviewer 3 (Public Review):

      Thank you for the thought-provoking and interesting comments and suggestions.

      (1) The results in Figure 2C seem to show that the leading image itself can only be decoded with ~33% accuracy (25% chance; i.e. ~8% above chance decoding). In contrast, Figure 2E suggests the prediction (surprisingly, valid or invalid) during the leading image presentation can be decoded with ~62% accuracy (50% chance; i.e. ~12% above chance decoding). Unless I am misinterpreting the analyses, it seems implausible to me that a prediction, but not actually shown image, can be better decoded using EEG than an image that is presented on-screen.

      Following this and the remaining comments by the Reviewer (see below), we have decided to remove the prediction analysis from the manuscript. Specifically, we have focused on the Reviewer’s concern that it is implausible that image prediction would be better decoded that an image that is presented on-screen. This led us to perform a control analysis, in which we tried to decode the leading image category based on pixel values alone (rather than on EEG responses). Since this decoding was above chance, we could not rule out the possibility that EEG responses to leading images reflect physical differences between image categories. This issue does not extend to trailing images, as the results of the decoding analysis based on trailing images are based on accuracy comparisons between valid and invalid trials, and thus image features are counterbalanced. We would like to thank the Reviewer for raising this issue

      (2) The "prediction decoding" analysis is described by the authors as "decoding the predictable trailing images based on the leading images". How this was done is however unclear to me. For each leading image decoding the predictable trailing images should be equivalent to decoding validity (as there were only 2 possible trailing image categories: 1 valid, 1 invalid). How is it then possible that the analysis is performed separately for valid and invalid trials? If the authors simply decode which leading image category was shown, but combine L1+L2 and L4+L5 into one class respectively, the resulting decoder would in my opinion not decode prediction, but instead dissociate the representation of L1+L2 from L4+L5, which may also explain why the time-course of the prediction peaks during the leading image stimulus-response, which is rather different compared to previous studies decoding predictions (e.g. Kok et al. 2017). Instead for the prediction analysis to be informative about the prediction, the decoder ought to decode the representation of the trailing image during the leading image and inter-stimulus interval. Therefore I am at present not convinced that the utilized analysis approach is informative about predictions.

      In this analysis, we attempted to decode ( from the response to leading images) which trailing categories ought to be presented. The analysis was split between trials where the expected category was indeed presented (valid) vs. those in which it was not (invalid). The separation of valid vs invalid trials in the prediction decoding analysis served as a sanity check as no information about trial validity was yet available to participants. However, as mentioned above, we have decided to remove the “prediction decoding” analysis based on leading images as we cannot disentangle prediction decoding from category decoding.

      (3) I may be misunderstanding the reported statistics or analyses, but it seems unlikely that >10  of the reported contrasts have the exact same statistic of Tmax= 2.76 . Similarly, it seems implausible, based on visual inspection of Figure 2, that the Tmax for the invalid condition decoding (reported as Tmax = 14.903) is substantially larger than for the valid condition decoding (reported as Tmax = 2.76), even though the valid condition appears to have superior peak decoding performance. Combined these details may raise concerns about the reliability of the reported statistics.

      Thank you for bringing this to our attention. This copy error has now been rectified.

      (4) The reported analyses and results do not seem to support the conclusion of early learning resulting in dampening and later stages in sharpening. Specifically, the authors appear to base this conclusion on the absence of a decoding effect in some time-bins, while in my opinion a contrast between time-bins, showing a difference in decoding accuracy, is required. Or better yet, a non-zero slope of decoding accuracy over time should be shown ( not contingent on post-hoc and seemingly arbitrary binning).

      Thank you for the helpful suggestion. We have performed an additional analysis to address this issue, we calculated the trial-by-trial time-series of the decoding accuracy benefit for valid vs. invalid for each participant and averaged this benefit across time points for each of the two significant time windows. Based on this, we fitted a logarithmic model to quantify the change of this benefit over trials, then found the trial index for which the change of the logarithmic fit was < 0.1% (i.e., accuracy was stabilized). Given the results of this analysis and to ensure a sufficient number of trials, we focussed our further analyses on bins 1-2 to directly assess the effects of learning. This is explained in more detail in the revised manuscript .

      (5) The present results both within and across trials are difficult to reconcile with previous studies using MEG (Kok et al., 2017; Han et al., 2019), single-unit and multi-unit recordings (Kumar et al., 2017; Meyer & Olson 2011), as well as fMRI (Richter et al., 2018), which investigated similar questions but yielded different results; i.e., no reversal within or across trials, as well as dampening effects with after more training. The authors do not provide a convincing explanation as to why their results should differ from previous studies, arguably further compounding doubts about the present results raised by the methods and results concerns noted above.

      The discussion of these findings has been expanded in the revised manuscript . In short, the experimental design of the above studies did not allow for an assessment of these effects prior to learning. Several of them also used repeated stimuli (albeit some studies changed the pairings of stimuli between trials), potentially allowing for RS to confound their results.

      Recommendations for the Authors:

      Reviewer 1 (Recommendations for the authors):

      (1) On a first read, I was initially very confused by the statement on p.7 that each stimulus was only presented once - as I couldn't then work out how expectations were supposed to be learned! It became clear after reading the Methods that expectations are formed at the level of stimulus category (so categories are repeated multiple times even if exemplars are not). I suspect other readers could have a similar confusion, so it would be helpful if the description of the task in the 'Results' section (e.g., around p.7) was more explicit about the way that expectations were generated, and the (very large) stimulus set that examples are being drawn from.

      Following your suggestion, we have clarified the paradigm by adding details about the categories and the manner in which expectations are formed.

      (2) p.23: the authors write that their 1D decoding images were "subjected to statistical inference amounting to a paired t-test between valid and invalid categories". What is meant by 'amounting to' here? Was it a paired t-test or something statistically equivalent? If so, I would just say 'subjected to a paired t-test' to avoid any confusion, or explaining explicitly which statistic inference was done over.

      We have rephrased this as “subjected to (1) a one-sample t-test against chance-level, equivalent to a fixed-effects analysis, and (2) a paired t-test”.

      Relatedly, this description of an analysis amounting to a 'paired t-test' only seems relevant for the sensory decoding and memory decoding analyses (where there are validity effects) rather than the prediction decoding analysis. As far as I can tell the important thing is that the expected image category can be decoded, not that it can be decoded better or worse on valid or invalid trials.

      In the previous version of the manuscript, the comparison of prediction decoding between valid and invalid trials was meant as a sanity check. However, in response to the other Reviewers’ comments we have decided to remove the prediction decoding analysis from the revised manuscript due to confounds.

      It would be helpful if authors could say a bit more about how the statistical inferences were done for the prediction decoding analyses and the 'condition against baseline' contrasts (e.g., when it is stated that decoding accuracy in valid trials *,in general,* is above 0 at some cluster-wise corrected value). My guess is that this amounts to something like a one-sample t-test - but it may be worth noting that one-sample t-tests on information measures like decoding accuracy cannot support population-level inference, because these measures cannot meaningfully be below 0 (see Allefeld et al, 2016).

      When testing for decoding accuracy against baseline, we used one-sample t-tests against chance level (rather than against 0) throughout the manuscript. We now clarify in the manuscript that this corresponds to a fixed-effects analysis (Allefeld et al., 2016). In contrast, when testing for differences in decoding accuracy between valid and invalid conditions, we used paired-sample t-tests. As mentioned above, the prediction decoding analysis has been removed from the analysis.

      (3) By design, the researchers focus on implicit predictive learning which means the expectations being formed are ( by definition) task-irrelevant. I thought it could be interesting if the authors might speculate in the discussion on how they think their results may or may not differ when predictions are deployed in task-relevant scenarios -  particularly given that some studies have found sharpening effects do not seem to depend on task demands ( e.g., Kok et al, 2012 ; Yon et al, 2018)  while other studies have found that some dampening effects do seem to depend on what the observer is attending to ( e.g., Richter et al, 2018) . Do these results hint at a possible explanation for why this might be? Even if the authors think they don't, it might be helpful to say so!

      Thank you for the interesting comment. We have expanded on this in the revised manuscript.

      Reviewer 2  (Recommendations for the authors):

      Methods/results

      (1) The goal of this study is the assessment of expectation effects during statistical learning while controlling for repetition effects, one of the common confounds in prediction suppression studies (see, Feuerriegel et al., 2021). I agree that this is an important aspect and I assume that this was the reason why the authors introduced the P=0.5 neutral condition (Figure 1B, L3). However, I completely missed the analyses of this condition in the manuscript. In the figure caption of Figure 1C, it is stated that the reaction times of the valid, invalid, and neutral conditions are shown, but only data from the valid and invalid conditions are depicted. To ensure that participants had built up expectations and had learned the pairing, one would not only expect a difference between the valid and invalid conditions but also between the valid and neutral conditions. Moreover, it would also be important to integrate the neutral condition in the multivariate EEG analysis to actually control for repetition effects. Instead, the authors constructed another control condition based on the arbitrary pairings. But why was the neutral condition not compared to the valid and invalid prediction decoding results? Besides this, I also suggest calculating the ERP for the neutral condition and adding it to Figure 2A to provide a more complete picture.

      As mentioned above, we have included the neutral condition in the behavioural analysis, as outlined in the revised manuscript. We have also included a repeated measures ANOVA on all 3 conditions. The purpose of the neutral condition was not to avoid RS, but rather to provide a control condition. We avoided repetition by using individual, categorised stimuli. Figure 1C has been amended to include the neutral condition). In response to the remaining comments, we have decided to remove the prediction decoding analysis from the manuscript.

      (2) One of the main results that is taken as evidence for the OPT is that there is higher decoding accuracy for valid trials (indicate sharpening) early in the trial and higher decoding accuracy for invalid trials (indicate dampening) later in the trial. I would have expected this result for prediction decoding that surprisingly showed none of the two effects. Instead, the result pattern occurred in sensory decoding only, and partly (early sharpening) in memory decoding. How do the authors explain these results? Additionally, I would have expected similar results in the ERP; however, only the early effect was observed. I missed a more thorough discussion of this rather complex result pattern. The lack of the opposing effect in prediction decoding limits the overall conclusion that needs to be revised accordingly.

      Since sharpening vs. dampening rests on the comparison between valid and invalid trials, evidence for sharpening vs. dampening could only be obtained from decoding based on responses to trailing images. In prediction decoding (removed from the current version), information about the validity of the trial is not yet available. Thus, our original plan was to compare this analysis with the effects of validity on the decoding of trailing images (i.e. we expected valid trials to be decoded more accurately after the trailing image than before). The results of the memory decoding did mirror the sensory decoding of the trailing image in that we found significantly higher decoding accuracy of the valid trials from 123-180 ms. As with the sensory decoding, there was a tendency towards a later flip (280-296 ms) where decoding accuracy of invalid trials became nominally higher, but this effect did not reach statistical significance in the memory decoding.

      (3) To increase the comprehensibility of the result pattern, it would be helpful for the reader to clearly state the hypotheses for the ERP and multivariate EEG analyses. What did you expect for the separate decoding analyses? How should the results of different decoding analyses differ and why? Which result pattern would (partly, or not) support the OPT?

      Our hypotheses are now stated in the revised manuscript.

      (4) I was wondering why the authors did not test for changes during learning for prediction decoding. Despite the fact that there were no significant differences between valid and invalid conditions within-trial, differences could still emerge when the data set is separated into bins. Please test and report the results.

      As mentioned above, we have decided to remove the prediction decoding analysis from the current version of the manuscript.

      (5) To assess the effect of learning the authors write: 'Given the apparent consistency of bins 2-4, we focused our analyses on bins 1-2.' Please explain what you mean by 'apparent consistency'. Did you test for consistency or is it based on descriptive results? Why do the authors not provide the complete picture and perform the analyses for all bins? This would allow for a better assessment of changes over time between valid and invalid conditions. In Figure 3, were valid and invalid trials different in any of the QT3 or QT4 bins in sensory or memory encoding?

      We have performed an additional analysis to address this issue. The reasoning behind the decision to focus on bins 1-2 is now explained in the revised manuscript. In short, fitting a learning curve to trial-by-trial decoding estimates indicates that decoding stabilizes within <50% of the trials. To quantify changes in decoding occurring within these <50% of the trials while ensuring a sufficient number of trials for statistical comparisons, we decided to focus on bins 1-2 only.

      (6) Please provide the effect size for all statistical tests.

      Effect sizes have now been provided.

      (7) Please provide exact p-values for non-significant results and significant results larger than 0.001.

      Exact p-values have now been provided.

      (8) Decoding analyses: I suppose there is a copy/paste error in the T-values as nearly all T-values on pages 11 and 12 are identical (2.76) leading to highly significant p-values (0.001) as well as non-significant effects (>0.05). Please check.

      Thank you for bringing this to our attention. This error has now been corrected.

      (9) Page 12:  There were some misleading phrases in the result section. To give one example: 'control analyses was slightly above change' - this sounds like a close to non-significant effect, but it was indeed a highly significant effect of p<0.001. Please revise.

      This phrase was part of the prediction decoding analysis and has therefore been removed.

      (10) Sample size: How was the sample size of the study be determined (N=31)? Why did only a subgroup of participants perform the behavioral categorization task after the EEG recording? With a larger sample, it would have been interesting to test if participants who showed better learning (larger difference in reaction times between valid and invalid conditions) also showed higher decoding accuracies.

      This has been clarified in the revised manuscript. In short, the larger sample size of N=31 was based on previous research; ten participants were initially tested as part of a pilot which was then expanded to include the categorisation task.

      (11) I assume catch trials were removed before data analyses?

      We have clarified that catch trials were indeed removed prior to analyses.

      (12) Page 23, 1st line: 'In each, the decoder...' Something is missing here.

      Thank you for bringing this to our attention, this sentence has now been rephrased as “In both valid and invalid analyses” in the revised manuscript.

      Discussion

      (1) The analysis over multiple trials showed dampening within the first 15 min followed by sharpening. I found the discussion of this finding very lengthy and speculative (page 17). I recommend shortening this part and providing only the main arguments that could stimulate future research.

      Thank you for the suggestion. Since Reviewer 3 has requested additional details in this part of the discussion, we have opted to keep this paragraph in the manuscript. However, we have also made it clearer that this section is relatively speculative and the arguments provided for the across trials dynamics are meant to stimulate further research.

      (2) As this task is purely perceptual, the results support the OPT for the area of visual perception. For action, different results have been reported. Suppression within-trial has been shown to be larger for expected than unexpected features of action targets and suppression even starts before the start of the movement without showing any evidence for sharpening ( e.g., Fuehrer et al., 2022, PNAS). For suppression across trials, it has been found that suppression decreases over the course of learning to associate a sensory consequence to a specific action (e.g., Kilteni et al., 2019, ELife). Therefore, expectation suppression might function differently in perception and action (an area that still requires further research). Please clarify the scope of your study and results on perceptual expectations in the introduction, discussion, and abstract.

      We have clarified the scope of the study in the revised manuscript.

      Figures

      (1) Figure 1A: Add 't' to the arrow to indicate time.

      This has been rectified.

      (2) Figure 3:  In the figure caption, sensory and memory decoding seem to be mixed up. Please correct. Please add what the dashed horizontal line indicates.

      Thank you for bringing this to our attention, this has been rectified.

      Reviewer 3  (Recommendations for the authors):

      I applaud the authors for a well-written introduction and an excellent summary of a complicated topic, giving fair treatment to the different accounts proposed in the literature. However, I believe a few additional studies should be cited in the Introduction, particularly time-resolved studies such as Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011. This would provide the reader with a broader picture of the current state of the literature, as well as point the reader to critical time-resolved studies that did not find evidence in support of OPT, which are important to consider in the interpretation of the present results.

      The introduction has been expanded to include the aforementioned studies in the revised manuscript.

      Given previous neuroimaging studies investigating the present phenomenon, including with time-resolved measures (e.g. Kok et al., 2017; Han et al., 2019; Kumar et al., 2017; Meyer & Olson 2011), why do the authors think that their data, design, or analysis allowed them to find support for OPT but not previous studies? I do not see obvious modifications to the paradigm, data quantity or quality, or the analyses that would suggest a superior ability to test OPT predictions compared to previous studies. Given concerns regarding the data analyses (see points below), I think it is essential to convincingly answer this question to convince the reader to trust the present results.

      The most obvious alteration to the paradigm is the use of non-repeated stimuli. Each of the above time-resolved studies utilised repeated stimuli (either repeated, identical stimuli, or paired stimuli where pairings are changed but the pool of stimuli remains the same), allowing for RS to act as a confound as exemplars are still presented multiple times. By removing this confound, it is entirely plausible that we may find different time-resolved results given that it has been shown that RS and ES are separable in time (Todorovic & de Lange, 2012). We also test during learning rather than training participants on the task beforehand. By foregoing a training session, we are better equipped to assess OPT predictions as they emerge. In our across-trial results, learning appears to take place after approximately 15 minutes or 432 trials, at which point dampening reverses to sharpening. Had we trained the participants prior to testing, this effect would have been lost.

      What is actually decoded in the "prediction decoding" analysis? The authors state that it is "decoding the predictable trailing images based on the leading images" (p.11). The associated chance level (Figure 2E) is indicated as 50%. This suggests that the classes separated by the SVM are T6 vs T7. How this was done is however unclear. For each leading image decoding the predictable trailing images should be equivalent to decoding validity (as there are only 2 possible trailing images, where one is the valid and the other the invalid image). How is it then possible that the analysis is performed separately for valid and invalid trials? Are the authors simply decoding which leading image was shown, but combine L1+L2 and L4+L5 into one class respectively? If so, this needs to be better explained in the manuscript. Moreover, the resulting decoder would in my opinion not decode the predicted image, but instead learn to dissociate the representation of L1+L2 from L4+L5, which may also explain why the time course of the prediction peaks during the leading image stimulus-response, which is rather different compared to previous studies decoding (prestimulus) predictions (e.g. Kok et al. 2017). If this is indeed the case, I find it doubtful that this analysis relates to prediction. Instead for the prediction analysis to be informative about the predicted image the authors should, in my opinion, train the decoder on the representation of trailing images and test it during the prestimulus interval.

      As mentioned above, the prediction decoding analysis has been removed from the manuscript. The prediction decoding analysis was intended as a sanity check, as validity information was not yet available to participants.

      Related to the point above, were the leading/trailing image categories and their mapping to L1, L2, etc. in Figure 1B fixed across subjects? I.e. "'beach' and 'barn' as 'Leading' categories would result in 'church' as a 'Trailing' category with 75% validity" (p.20) for all participants? If so, this poses additional problems for the interpretation of the analysis discussed in the point above, as it may invalidate the control analyses depicted in Figure 2E, as systematic differences and similarities in the leading image categories could account for the observed results.

      Image categories and their mapping were indeed fixed across participants. While this may result in physical differences and similarities between images influencing results, counterbalancing categories across participants would not have addressed this issue. For example, had we swapped “beach” with “barn” in another participant, physical differences between images may still be reflected in the prediction decoding. On the other hand, counterbalancing categories across trials was not possible given our aim of examining the initial stages of learning over trials. Had we changed the mappings of categories throughout the experiment for each participant, we would have introduced reversal learning and nullified our ability to examine the initial stages of learning under flat priors. In any case, the prediction decoding analysis has been removed from the manuscript, as outlined above.

      Why was the neutral condition L3 not used for prediction decoding? After all, if during prediction decoding both the valid and invalid image can be decoded, as suggested by the authors, we would also expect significant decoding of T8/T9 during the L3 presentation.

      In the neutral condition, L3 was followed by T8 vs. T9 with 50% probability, precluding prediction decoding. While this could have served as an additional control analysis for EEG-based decoding, we have opted for removing prediction decoding from the analysis. However, in response to the other Reviewers’ comments, the neutral condition has now been included in the behavioral analysis.

      The following concern may arise due to a misunderstanding of the analyses, but I found the results in Figures 2C and 2E concerning. If my interpretation is correct, then these results suggest that the leading image itself can only be decoded with ~33% accuracy (25% chance; i.e. ~8% above chance decoding). In contrast, the predicted (valid or invalid) image during the leading image presentation can be decoded with ~62% accuracy (50% chance; i.e. ~12% above chance decoding). Does this seem reasonable? Unless I am misinterpreting the analyses, it seems implausible to me that a prediction but not actually shown image can be better decoded than an on-screen image. Moreover, to my knowledge studies reporting decoding of predictions can (1) decode expectations just above chance level (e.g. Kok et al., 2017; which is expected given the nature of what is decoded) and (2) report these prestimulus effects shortly before the anticipated stimulus onset, and not coinciding with the leading image onset ~800ms before the predicted stimulus onset. For the above reasons, the key results reported in the present manuscript seem implausible to me and may suggest the possibility of problems in the training or interpretation of the decoding analysis. If I misunderstood the analyses, the analysis text needs to be refined. If I understood the analyses correctly, at the very least the authors would need to provide strong support and arguments to convince the reader that the effects are reliable (ruling out bias and explaining why predictions can be decoded better than on-screen stimuli) and sensible (in the context of previous studies showing different time-courses and results).

      As explained above, we have addressed this concern by performing an additional analysis, implementing decoding based on image pixel values. Indeed we could not rule out the possibility that “prediction” decoding reflected stimulus differences between leading images.

      Relatedly, the authors use the prestimulus interval (-200 ms to 0 ms before predicted stimulus onset) as the baseline period. Given that this period coincides with prestimulus expectation effects ( Kok et al., 2017) , would this not result in a bias during trailing image decoding? In other words, the baseline period would contain an anticipatory representation of the expected stimulus ( Kok et al., 2017) , which is then subtracted from the subsequent EEG signal, thereby allowing the decoder to pick up on this "negative representation" of the expected image. It seems to me that a cleaner contrast would be to use the 200ms before leading image onset as the baseline.

      The analysis of trailing images aimed at testing specific hypotheses related to differences between decoding accuracy in valid vs. invalid trials. Since the baseline was by definition the same for both kinds of trials (since information about validity only appears at the onset of the trailing image), changing the baseline would not affect the results of the analysis. Valid and invalid trials would have the same prestimulus effect induced by the leading image.

      Again, maybe I misunderstood the analyses, but what exactly are the statistics reported on p. 11 onward? Why is the reported Tmax identical for multiple conditions, including the difference between conditions? Without further information this seems highly unlikely, further casting doubts on the rigor of the applied methods/analyses. For example: "In the sensory decoding analysis based on leading images, decoding accuracy was above chance for both valid (Tmax= 2.76, pFWE < 0.001) and invalid trials (Tmax= 2.76, pFWE < 0.001) from 100 ms, with no significant difference between them (Tmax= 2.76, pFWE > 0.05) (Fig. 2C)" (p.11).

      Thank you for bringing this to our attention. As previously mentioned, this copy error has been rectified in the revised manuscript.

      Relatedly, the statistics reported below in the same paragraph also seem unusual. Specifically, the Tmax difference between valid and invalid conditions seems unexpectedly large given visual inspection of the associated figure: "The decoding accuracy of both valid (Tmax = 2.76, pFWE < 0.001) and invalid trials (Tmax = 14.903, pFWE < 0.001)" (p.12). In fact, visual inspection suggests that the largest difference should probably be observed for the valid not invalid trials (i.e. larger Tmax).

      This copy error has also been rectified in the revised manuscript.

      Moreover, multiple subsequent sections of the Results continue to report the exact same Tmax value. I will not list all appearances of "Tmax = 2.76" here but would recommend the authors carefully check the reported statistics and analysis code, as it seems highly unlikely that >10 contrasts have exactly the same Tmax. Alternatively, if I misunderstand the applied methods, it would be essential to better explain the utilized method to avoid similar confusion in prospective readers.

      This error has also now been rectified. As mentioned above the prediction decoding analysis has been removed.

      I am not fully convinced that Figures 3A/B and the associated results support the idea that early learning stages result in dampening and later stages in sharpening. The inference made requires, in my opinion, not only a significant effect in one-time bin and the absence of an effect in other bins. Instead to reliably make this inference one would need a contrast showing a difference in decoding accuracy between bins, or ideally an analysis not contingent on seemingly arbitrary binning of data, but a decrease ( or increase) in the slope of the decoding accuracy across trials. Moreover, the decoding analyses seem to be at the edge of SNR, hence making any interpretation that depends on the absence of an effect in some bins yet more problematic and implausible.

      Thank you for the helpful suggestion. As previously mentioned we fitted a logarithmic model to quantify the change of the decoding benefit over trials, then found the trial index for which the change of the logarithmic fit was < 0.1 %. Given the results of this analysis and to ensure a sufficient number of trials, we focussed our further analyses on bins 1-2 . This is explained in more detail in the revised manuscript.

      Relatedly, based on the literature there is no reason to assume that the dampening effect disappears with more training, thereby placing more burden of proof on the present results. Indeed, key studies supporting the dampening account (including human fMRI and MEG studies, as well as electrophysiology in non-human primates) usually seem to entail more learning than has occurred in bin 2 of the present study. How do the authors reconcile the observation that more training in previous studies results in significant dampening, while here the dampening effect is claimed to disappear with less training?

      The discussion of these findings has been expanded on in the revised manuscript. As previously outlined, many of the studies supporting dampening did not explicitly test the effect of learning as they emerge, nor did they control for RS to the same extent.

      The Methods section is quite bare bones. This makes an exact replication difficult or even impossible. For example, the sections elaborating on the GLM and cluster-based FWE correction do not specify enough detail to replicate the procedure. Similarly, how exactly the time points for significant decoding effects were determined is unclear (e.g., p. 11). Relatedly, the explanation of the decoding analysis, e.g. the choice to perform PCA before decoding, is not well explained in the present iteration of the manuscript. Additionally, it is not mentioned how many PCs the applied threshold on average resulted in.

      Thank you for this suggestion, we have described our methods in more detail.

      To me, it is unclear whether the PCA step, which to my knowledge is not the default procedure for most decoding analyses using EEG, is essential to obtain the present results. While PCA is certainly not unusual, to my knowledge decoding of EEG data is frequently performed on the sensor level as SVMs are usually capable of dealing with the (relatively low) dimensionality of EEG data. In isolation this decision may not be too concerning, however, in combination with other doubts concerning the methods and results, I would suggest the authors replicate their analyses using a conventional decoding approach on the sensory level as well.

      Thank you for this suggestion, we have explained our decision to use PCA in the revised manuscript.

      Several choices, like the binning and the focus on bins 1-2 seem rather post-hoc. Consequently, frequentist statistics may strictly speaking not be appropriate. This further compounds above mentioned concerns regarding the reliability of the results.

      The reasoning behind our decision to focus on bins 1-2 is now explained in more detail in the revised manuscript.

      A notable difference in the present study, compared to most studies cited in the introduction motivating the present experiment, is that categories instead of exemplars were predicted.

      This seems like an important distinction to me, which surprisingly goes unaddressed in the Discussion section. This difference might be important, given that exemplar expectations allow for predictions across various feature levels (i.e., even at the pixel level), while category predictions only allow for rough (categorical) predictions.

      The decision to use categorical predictions over exemplars lies in the issue of RS, as it is impossible to control for RS while repeating stimuli over many trials. This has been discussed in more detail in the revised manuscript.

      While individually minor problems, I noticed multiple issues across several figures or associated figure texts. For example: Figure 1C only shows valid and invalid trials, but the figure text mentions the neutral condition. Why is the neutral condition not depicted but mentioned here? Additionally, the figure text lacks critical information, e.g. what the asterisk represents. The error shading in Figure 2 would benefit from transparency settings to not completely obscure the other time-courses. Increasing the figure content and font size within the figure (e.g. axis labels) would also help with legibility (e.g. consider compressing the time-course but therefore increasing the overall size of the figure). I would also recommend using more common methods to indicate statistical significance, such as a bar at the bottom of the time-course figure typically used for cluster permutation results instead of a box. Why is there no error shading in Figure 2A but all other panels? Fig 2C-F has the y-axis label "Decoding accuracy (%)" but certainly the y-axis, ranging roughly from 0.2 to 0.7, is not in %. The Figure 3 figure text gives no indication of what the error bars represent, making it impossible to interpret the depicted data. In general, I would recommend that the authors carefully revisit the figures and figure text to improve the quality and complete the information.

      Thank you for the suggestions. Figure 1C now includes the neutral condition. Asterisks denote significant results. The font size in Figure 2C-E has been increased. The y-axis on Figure 2C-E has been amended to accurately reflect decoding accuracy in percentage. Figure 2A has error shading, however, the error is sufficiently small that the error shading is difficult to see. The error bars in Figure 3 have been clarified.

      Given the choice of journal (eLife), which aims to support open science, I was surprised to find no indication of (planned) data or code sharing in the manuscript.

      Plans for sharing code/data are now outlined in the revised manuscript.

      While it is explained in sufficient detail later in the Methods section, it was not entirely clear to me, based on the method summary at the beginning of the Results section, whether categories or individual exemplars were predicted. The manuscript may benefit from clarifying this at the start of the Results section.

      Thank you for this suggestion, following this and suggestions from other reviewers, the experimental paradigm and the mappings between categories has been further explained in the revised manuscript, to make it clearer that predictions are made at the categorical level.

      "Unexpected trials resulted in a significantly increased neural response 150 ms after image onset" (p.9). I assume the authors mean the more pronounced negative deflection here. Interpreting this, especially within the Results section as "increased neural response" without additional justification may stretch the inferences we can make from ERP data; i.e. to my knowledge more pronounced ERPs could also reflect increased synchrony. That said, I do agree with the authors that it is likely to reflect increased sensory responses, it would just be useful to be more cautious in the inference.

      Thank you for the interesting comment, this has been rephrased as a “more pronounced negative deflection” in the revised manuscript.

      Why was the ERP analysis focused exclusively on Oz? Why not a cluster around Oz? For object images, we may expect a rather wide dipole.

      Feuerriegel et al (2021) have outlined issues questioning the robustness of univariate analyses for ES, as such we opted for a targeted ROI approach on the channel showing peak amplitude of the visually evoked response (Fig. 2B). More details on this are in the revised manuscript.           

      How exactly did the authors perform FWE? The description in the Method section does not appear to provide sufficient detail to replicate the procedure.

      FWE as implemented in SPM is a cluster-based method of correcting for multiple comparisons using random field theory. We have explained our thresholding methods in more detail in the revised manuscript.

      If I misunderstand the authors and they did indeed perform standard cluster permutation analyses, then I believe the results of the timing of significant clusters cannot be so readily interpreted as done here (e.g. p.11-12); see: Maris & Oostenveld 2007; Sassenhagen & Dejan 2019.

      All statistics were based on FWE under random field theory assumptions (as implemented in SPM) rather than on cluster permutation tests (as implemented in e.g.  Fieldtrip)

      Why did the authors choose not to perform spatiotemporal cluster permutation for the ERP results?

      As mentioned above, we opted to target our ERP analyses on Oz due to controversies in the literature regarding univariate effects of ES (Feuerriegel et al., 2021).

      Some results, e.g. on p.12 are reported as T29 instead of Tmax. Why?

      As mentioned above, prediction decoding analyses have been removed from the manuscript.

    1. eLife Assessment

      This valuable manuscript addresses the longstanding question of how the brain maintains serial order in working memory, proposing a biologically grounded model based on synaptic augmentation mechanisms that operates on longer time scales than facilitation. The authors show that augmentation provides a mechanism by which this order can be maintained in memory thanks to a temporal gradient of synaptic efficacies. Although the evidence remains incomplete at present, it can be made stronger by demonstrating robustness to network heterogeneity, spiking, and threshold values for encoding the working memory.

    2. Reviewer #1 (Public review):

      Summary:

      The issue of how the brain can maintain the serial order of presented items in working memory is a major unsolved question in cognitive neuroscience. It has been proposed that this serial order maintenance could be achieved thanks to periodic reactivations of different presented items at different phases of an oscillation, but the mechanisms by which this could be achieved by brain networks, as well as the mechanisms of read-out, are still unclear. In an influential 2008 paper, the authors have proposed a mechanism by which a recurrent network of neurons could maintain multiple items in working memory, thanks to `population spikes' of populations of neurons encoding for the different items, occurring at alternating times. These population spikes occur in a specific regime of the network and are a result of synaptic facilitation, an experimentally observed type of synaptic short-term dynamics with time scales of order hundreds of ms.

      In the present manuscript, the authors extend their model to include another type of experimentally observed short-term synaptic plasticity termed synaptic augmentation, which operates on longer time scales on the order of 10s. They show that while a network without augmentation loses information about serial order, augmentation provides a mechanism by which this order can be maintained in memory thanks to a temporal gradient of synaptic efficacies. The order can then be read out using a read-out network whose synapses are also endowed with synaptic augmentation. Interestingly, the read-out speed can be regulated using background inputs.

      Strengths:

      This is an elegant solution to the problem of serial order maintenance that only relies on experimentally observed features of synapses. The model is consistent with a number of experimental observations in humans and monkeys. The paper will be of interest to a broad readership, and I believe it will have a strong impact on the field.

      Weaknesses:

      (1) The network they propose is extremely simple. This simplicity has pros and cons: on the one hand, it is nice to see the basic phenomenon exposed in the simplest possible setting. On the other hand, it would also be reassuring to check that the mechanism is robust when implemented in a more realistic setting, using, for instance, a network of spiking neurons similar to the one they used in the 2008 paper. The more noisy and heterogeneous the setting, the better.

      (2) One major issue with the population spike scenario is that (to my knowledge) there is no evidence that these highly synchronized events occur in delay periods of working memory experiments. It seems that highly synchronized population spikes would imply (a) a strong regularity of spike trains of neurons, at odds with what is typically observed in vivo (b) high synchronization of neurons encoding for the same item (and also of different items in situations where multiple items have to be held in working memory), also at odds with in vivo recordings that typically indicate weak synchronization at best. It would be nice if the authors at least mention this issue, and speculate on what could possibly bridge the gap between their highly regular and synchronized network, and brain networks that seem to lie at the opposite extreme (highly irregular and weakly synchronized). Of course, if they can demonstrate using a spiking network simulation that they can bridge the gap, even better.

    3. Reviewer #2 (Public review):

      In this manuscript, the authors present a model to explain how working memory (WM) encodes both existence and timing simultaneously using transient synaptic augmentation. A simple yet intriguing idea.

      The model presented here has the potential to explain what previous theories like 'active maintenance via attractors' and 'liquid state machine' do not, and describe how novel sequences are immediately stored in WM. Altogether, the topic is of great interest to those studying higher cognitive processes, and the conclusions the authors draw are certainly thought-provoking from an experimental perspective. However, several questions remain that need to be addressed.

      The study relates to the well-known computational theory for working memory, which suggests short-term synaptic facilitation is required to maintain working memory, but doesn't rely on persistent spiking. This previous theory appears similar to the proposed theory, except for the change from facilitation to augmentation. A more detailed explanation of why the authors use augmentation instead of facilitation in this paper is warranted: is the facilitation too short to explain the whole process of WM? Can the theory with synaptic facilitation also explain the immediate storage of novel sequences in WM?

      In Figure 1, the authors mention that synaptic augmentation leads to an increased firing rate even after stimulus presentation. It would be good to determine, perhaps, what the lowest threshold is to see the encoding of a WM task, and whether that is biologically plausible.

      In the middle panel of Figure 4, after 15-16 sec, when the neuronal population prioritizes with the second retro-cue, although the second retro-cue item's synaptic spike dominates, why is the augmentation for the first retro-cue item higher than the second-cue augmentation until the 20 sec?

    4. Author response:

      Reviewer #1 (Public Review):

      (1) The network they propose is extremely simple. This simplicity has pros and cons: on the one hand, it is nice to see the basic phenomenon exposed in the simplest possible setting. On the other hand, it would also be reassuring to check that the mechanism is robust when implemented in a more realistic setting, using, for instance, a network of spiking neurons similar to the one they used in the 2008 paper. The more noisy and heterogeneous the setting, the better.

      The choice of a minimal model to illustrate our hypothesis is deliberate. Our main goal was to suggest a physiologically-grounded mechanism to rapidly encode temporally-structured information (i.e., sequences of stimuli) in Working Memory, where none was available before. Indeed, as discussed in the manuscript, previous proposals were unsatisfactory in several respects. In view of our main goal, we believe that a spiking implementation is beyond the scope of the present work.

      We would like to note that the mechanism originally proposed in Mongillo et al. (2008), has been repeatedly implemented, by many different groups, in various spiking network models with different levels of biological realism (see, e.g., Lundquivst et al. (2016), for an especially ‘detailed’ implementation) and, in all cases, the relevant dynamics has been observed. We take this as an indication of ‘robustness’; the relevant network dynamics doesn’t critically depend on many implementation details and, importantly, this dynamics is qualitatively captured by a simple rate model (see, e.g., Mi et al. (2017)).

      In the present work, we make a relatively ‘minor’ (from a dynamical point of view) extension of the original model, i.e., we just add augmentation. Accordingly, we are fairly confident that a set of parameters for the augmentation dynamics can be found such that the spiking network behaves, qualitatively, as the rate model. A meaningful study, in our opinion, then would require extensively testing the (large) parameters’ space (different models of augmentation?) to see how the network behavior compares with the relevant experimental observations (which ones? behavioral? physiological?). As said above, we believe that this is beyond the scope of the present work.       

      This being said, we definitely agree with the reviewer that not presenting a spiking implementation is a limitation of the present work. We will clearly acknowledge, and discuss, this limitation in the revised version.

      (2) One major issue with the population spike scenario is that (to my knowledge) there is no evidence that these highly synchronized events occur in delay periods of working memory experiments. It seems that highly synchronized population spikes would imply (a) a strong regularity of spike trains of neurons, at odds with what is typically observed in vivo (b) high synchronization of neurons encoding for the same item (and also of different items in situations where multiple items have to be held in working memory), also at odds with in vivo recordings that typically indicate weak synchronization at best. It would be nice if the authors at least mention this issue, and speculate on what could possibly bridge the gap between their highly regular and synchronized network, and brain networks that seem to lie at the opposite extreme (highly irregular and weakly synchronized). Of course, if they can demonstrate using a spiking network simulation that they can bridge the gap, even better.

      Direct experimental evidence (in monkeys) in support of the existence of highly synchronized events -- to be identified with the ‘population spikes’ of our model -- during the delay period of a memory task is available in the literature and we have cited it, i.e., Panichello et al. (2024). In the revised version, we will provide an explicit discussion of the results of Panichello et al. (2024) and how these results directly relate to our model. After submission, we became aware of another experimental study (in humans) specifically dealing with sequence memory, i.e., Liebe et al. (2025). Their results, again, are fully consistent with our model. We will also provide an explicit discussion of these results in the revised version.

      We note that there is no fundamental contradiction between highly synchronized events in ‘small’ neural populations (e.g., a cell assembly) on one hand, and temporally irregular (i.e., Poisson-like) spiking at the single-neuron level and weakly synchronized activity at the network level, on the other hand. This was already illustrated in our original publication, i.e., Mongillo et al. (2008) (see, in particular, Fig. S2).

      We further note that the mechanism we propose to encode temporal order -- a temporal gradient in the synaptic efficacies brought about by synaptic augmentation -- would also work if the memory of the items is maintained by ‘tonic’ persistent activity (i.e., without highly synchronized events), provided this activity occurs at suitably low rates such as to prevent the saturation of the synaptic augmentation.

      We will include a detailed discussion of these points in the revised version.

      Reviewer #2 (Public Review):

      The study relates to the well-known computational theory for working memory, which suggests short-term synaptic facilitation is required to maintain working memory, but doesn't rely on persistent spiking. This previous theory appears similar to the proposed theory, except for the change from facilitation to augmentation. A more detailed explanation of why the authors use augmentation instead of facilitation in this paper is warranted: is the facilitation too short to explain the whole process of WM? Can the theory with synaptic facilitation also explain the immediate storage of novel sequences in WM?

      In the model, synaptic dynamics displays both short-term facilitation and augmentation (and shortterm depression). Indeed, synaptic facilitation, alone, would be too short-lived to encode novel sequences. This is illustrated in Fig. 1B. We will provide a more detailed discussion of this point in the revised version. 

      In Figure 1, the authors mention that synaptic augmentation leads to an increased firing rate even after stimulus presentation. It would be good to determine, perhaps, what the lowest threshold is to see the encoding of a WM task, and whether that is biologically plausible.

      We believe that this comment is related to the above point. The reviewer is correct; augmentation alone would require fairly long stimulus presentations to encode an item in WM. ‘Fast’ encoding, indeed, is guaranteed by the presence of short-term facilitation. We will emphasize this important point in the revised version.

      In the middle panel of Figure 4, after 15-16 sec, when the neuronal population prioritizes with the second retro-cue, although the second retro-cue item's synaptic spike dominates, why is the augmentation for the first retro-cue item higher than the second-cue augmentation until the 20 sec?

      This is because of the slow build-up and slow decay of the augmentation. When the second item is prioritized, and the corresponding neuronal population re-activates, its augmentation level starts to increase. At the same time, as the first item is now de-prioritized and the corresponding neuronal population is now silent, its augmentation level starts to decrease. Because of the ‘slowness’ of both processes (i.e., augmentation build-up and decay), it takes about 5 seconds for the augmentation level of the second item to overcome the augmentation level of the first item.

      We note that the slow time scales of the augmentation dynamics, consistently with experimental observations, are necessary for our mechanism to work.

    1. eLife Assessment

      This important paper takes a novel approach to the problem of automatically reconstructing long-range axonal projections from stacks of images. The key innovation is to separate the identification of sections of an axon from the statistical rules used to constrain global structure. The authors provide compelling evidence that their method is a significant improvement over existing measures in circumstances where the labelling of axons and dendrites is relatively dense.

    2. Reviewer #1 (Public review):

      Summary:

      The authors introduce a novel algorithm for the automatic identification of long-range axonal projections. This is an important problem as modern high-throughput imaging techniques can produce large amounts of raw data, but identifying neuronal morphologies and connectivities requires large amounts of manual work. The algorithm works by first identifying points in three-dimensional space corresponding to parts of labelled neural projections, these are then used to identify short sections of axon using an optimisation algorithm and the prior knowledge that axonal diameters are relatively constant. Finally, a statistical model that assumes axons tend to be smooth is used to connect the sections together into complete and distinct neural trees. The authors demonstrate that their algorithm is far superior to existing techniques, especially when a dense labelling of the tissue means that neighbouring neurites interfere with the reconstruction. Despite this improvement, however, the accuracy of reconstruction remains below 90%, so manual proof-reading is still necessary to produce accurate reconstructions of axons.

      Strengths:

      The new algorithm combines local and global information to make a significant improvement on the state-of -the-art for automatic axonal reconstruction. The method could be applied more broadly and might have applications to reconstructions of electron microscopy data, where similar issues of high-throughput imaging and relatively slow or inaccurate reconstruction remain.

      Weaknesses:

      There are three weaknesses with the algorithm and manuscript.

      (1) The best reconstruction accuracy is below 90%, which does not fully solve the problem of needing manual proof-reading.

      (2) The 'minimum information flow tree' model the authors use to construct connected axonal trees has the potential to bias data collection. In particular, the assumption that axons should always be as smooth as possible is not always correct. This is a good rule-of-thumb for reconstructions, but real axons in many systems can take quite sharp turns and this is also seen in the data presented in the paper (Fig 1C). I would like to see explicit acknowledgement of this bias in the current manuscript and ideally a relaxation of this rule in any later versions of the algorithm.

      (3) The writing of the manuscript is not always as clear as it could be. The manuscript would benefit from careful copy editing for language, and the Methods section in particular should be expanded to more clearly explain what each algorithm is doing. The pseudo code of the Supplemental Information could be brought into the Methods if possible as these algorithms are so fundamental to the manuscript.

      Comments on revisions: I have no further comments or recommendations.

    3. Reviewer #2 (Public review):

      The authors have addressed my comments in this revised version of their manuscript. PointTree is an improved method for the reconstruction of neuronal anatomy that will be useful for neuroscientists.

      In this manuscript, Cai et al. introduce PointTree, a new automated method for the reconstruction of complex neuronal projections. This method has the potential to drastically speed up the process of reconstructing complex neurites. The authors use semi-automated manual reconstruction of neurons and neurites to provide a 'ground-truth' for comparison between PointTree and other automated reconstruction methods. The reconstruction performance is evaluated for precision, recall and F1-score and positions. The performance of PointTree compared to other automated reconstruction methods is impressive based on these 3 criteria.

      As an experimentalist, I will not comment on the computational aspects of the manuscript. Rather, I am interested in how PointTree's performance decrease in noisy samples. This is because many imaging datasets contain some level of background noise for which the human eye appears essential for accurate reconstruction of neurites. Although the samples presented in Figure 5 represent an inherent challenge for any reconstruction method, the signal to noise ratio is extremely high (also the case in all raw data images in the paper). It would be interesting to see how PointTree's performance change in increasingly noisy samples, and for the author to provide general guidance to the scientific community as to what samples might not be accurately reconstructed with PointTree.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors introduce a novel algorithm for the automatic identification of longrange axonal projections. This is an important problem as modern high-throughput imaging techniques can produce large amounts of raw data, but identifying neuronal morphologies and connectivities requires large amounts of manual work. The algorithm works by first identifying points in three-dimensional space corresponding to parts of labelled neural projections, these are then used to identify short sections of axons using an optimisation algorithm and the prior knowledge that axonal diameters are relatively constant. Finally, a statistical model that assumes axons tend to be smooth is used to connect the sections together into complete and distinct neural trees. The authors demonstrate that their algorithm is far superior to existing techniques, especially when dense labelling of the tissue means that neighbouring neurites interfere with the reconstruction. Despite this improvement, however, the accuracy of reconstruction remains below 90%, so manual proofreading is still necessary to produce accurate reconstructions of axons.

      Strengths:

      The new algorithm combines local and global information to make a significant improvement on the state-of-the-art for automatic axonal reconstruction. The method could be applied more broadly and might have applications to reconstructions of electron microscopy data, where similar issues of highthroughput imaging and relatively slow or inaccurate reconstruction remain.

      We thank the reviewer for their positive comments and for taking the time to review our manuscript. We are truly grateful that the reviewer recognized the value of our method in automatically reconstructing long-range axonal projections. While we report that our method achieves reconstruction accuracy of approximately 85%, we fully acknowledge that manual proofreading is still necessary to ensure accuracy greater than 95%. We also appreciate the reviewer’s insightful suggestion regarding the potential adaptation of our algorithm for reconstructing electron microscopy (EM) data, where similar challenges in high-throughput imaging and relatively slow or inaccurate reconstruction persist. We look forward to exploring ways to integrate our method with EM data in future work.

      Weaknesses:

      There are three weaknesses in the algorithm and manuscript.

      (1) The best reconstruction accuracy is below 90%, which does not fully solve the problem of needing manual proofreading.

      We sincerely appreciate the reviewer's valuable insights regarding reconstruction accuracy. Indeed, as illustrated in Figure S4, our current best automated reconstruction accuracy on fMOST data is still below 90%. This indicates that manual proofreading remains essential to ensure reliability.

      For the reconstruction of long-range axonal projections, ensuring the accuracy of the reconstruction process necessitates manual revision of the automatically generated results. Existing literature has demonstrated that a higher accuracy in automatic reconstruction correlates with a reduced need for manual revisions, thereby facilitating an accelerated reconstruction process (Winnubst et al., Cell 2019; Liu et al., Nature Methods 2025).

      As the reviewer rightly points out, achieving an accuracy exceeding 95% currently necessitates manual proofreading. Although our method does not completely eliminate this requirement, it significantly alleviates the proofreading workload by: 1) Minimizing common errors in regions with dense neuron distributions; 2) Providing more reliable initial reconstructions; and 3) Reducing the number of corrections needed during the proofreading process.

      In the future, we will continue to enhance our reconstruction framework. As imaging systems achieve higher signal-to-noise ratios and deep learning techniques facilitate more accurate foreground detection, we anticipate that our method will attain even greater reconstruction accuracy. Furthermore, we plan to develop a software system capable of predicting potential error locations in our automated reconstruction results, thereby streamlining manual revisions. This approach distinguishes itself from existing models by obviating the need for individual traversal of the brain regions associated with each neuron reconstruction.

      (2) The 'minimum information flow tree' model the authors use to construct connected axonal trees has the potential to bias data collection. In particular, the assumption that axons should always be as smooth as possible is not always correct. This is a good rule-of-thumb for reconstructions, but real axons in many systems can take quite sharp turns and this is also seen in the data presented in the paper (Figure 1C). I would like to see explicit acknowledgement of this bias in the current manuscript and ideally a relaxation of this rule in any later versions of the algorithm.

      We appreciate the reviewer's insightful opinion regarding the potential bias introduced by our minimum information flow tree model. The reviewer is absolutely correct in noting that while axon smoothness serves as a useful reconstruction heuristic, it should not be treated as an absolute constraint given that real axons can exhibit sharp turns (as shown in Figure 1C). In response to this valuable feedback, we add explicit discussion of this limitation in Discussion section as follow: “Finally, the minimal information flow tree’s fundamental assumption, that axons should be as smooth as possible does not always hold true.

      In fact, real axons can take quite sharp turns leading the algorithm to erroneously separate a single continuous axon into disjoint neurites.”

      In our reconstruction process, the post-processing approach partially mitigates erroneous reconstructions derived from this rule. Specifically: The minimum information flow tree will decompose such structures into two separate branches (Fig. S7A), but the decomposition node is explicitly recorded. The newly decomposed branches attempt to reconnect by searching for plausible neurites starting from their head nodes (determined by the minimum information flow tree). If no connectable neurites are found, the branch is automatically reconnected to its originally recorded decomposition node (Fig. S7B). In Fig.S7C, two reconstruction examples demonstrate the effectiveness of the post-processing approach.

      As pointed out by the reviewers, the proposed rule for revising neuron reconstruction does not encompass all scenarios. Relaxing the constraints of this rule may lead to numerous new erroneous connections. Currently, the proposed rule is solely based on the positions of neurite centerlines and does not integrate information regarding the intensity of the original images or segmentation data. Incorporating these elements into the rule could potentially reduce reconstruction errors. 

      (3) The writing of the manuscript is not always as clear as it could be. The manuscript would benefit from careful copy editing for language, and the Methods section in particular should be expanded to more clearly explain what each algorithm is doing. The pseudo-code of the Supplemental Information could be brought into the Methods if possible as these algorithms are so fundamental to the manuscript.

      We sincerely thank the reviewer for these valuable suggestions to improve our manuscript’s clarity and methodological presentation. We have implemented the following revisions:

      (1) Language Enhancement: we have conducted rigorous internal linguistic reviews to address grammatical inaccuracies and improve textual clarity.

      (2) Methods Expansion and Pseudo-code Integration: we have incorporated all relevant derivations from the Supplementary Materials into the Methods section, with additional explanatory text to clarify the purpose and implementation of each algorithm. All mathematical formulations have been systematically rederived with modifications to variable nomenclature, subscript/superscript notations and identified errors in the original submission. All pseudocode from Supplementary Materials has been integrated into their corresponding methods subsection.

      Reviewer #2 (Public review):

      In this manuscript, Cai et al. introduce PointTree, a new automated method for the reconstruction of complex neuronal projections. This method has the potential to drastically speed up the process of reconstructing complex neurites. The authors use semi-automated manual reconstruction of neurons and neurites to provide a 'ground-truth' for comparison between PointTree and other automated reconstruction methods. The reconstruction performance is evaluated for precision, recall, and F1-score and positions. The performance of PointTree compared to other automated reconstruction methods is impressive based on these 3 criteria.

      As an experimentalist, I will not comment on the computational aspects of the manuscript. Rather, I am interested in how PointTree's performance decreases in noisy samples. This is because many imaging datasets contain some level of background noise for which the human eye appears essential for the accurate reconstruction of neurites. Although the samples presented in Figure 5 represent an inherent challenge for any reconstruction method, the signal-to-noise ratio is extremely high (also the case in all raw data images in the paper). It would be interesting to see how PointTree's performance changes in increasingly noisy samples, and for the author to provide general guidance to the scientific community as to what samples might not be accurately reconstructed with PointTree.

      We thank the reviewer for her/his time reviewing our manuscript and the interest on how PointTree perform on noisy samples. It is important to clarify that PointTree is solely responsible for the reconstruction of neurons from the foreground regions of neural images. The foreground regions of these neuronal images are obtained through a deep learning segmentation network. In cases where the image has a low signal-to-noise ratio, if the segmentation network can accurately identify the foreground areas, then PointTree will be able to accurately reconstruct neurons. In fact, existing deep learning networks have demonstrated their capability to effectively extract foreground regions from low signal-to-noise ratio images; therefore, PointTree is well-suited for processing neuronal images characterized by low signal-to-noise ratios.

      In the revised manuscript, we conducted experiments on datasets with varying signal-to-noise ratios (SNR). The results demonstrate that Unet3D is capable of identifying the foreground regions in low-SNR images, thereby supporting the assertion that PointTree has broad applicability across diverse neuronal imaging datasets. 

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      It would be interesting to see how PointTree's performance changes in increasingly noisy samples, and for the author to provide general guidance to the scientific community as to what samples might not be accurately reconstructed with PointTree.

      We extend our heartfelt gratitude to the reviewer for their insightful suggestion concerning experiments involving different noisy samples. Here are the details of the datasets used:

      LSM dataset: Mean SNR = 5.01, with 25 samples, and a volume size of 192×192×192.

      fMOST dataset: Mean SNR = 8.68, with 25 samples, and a volume size of 192×192×192.

      HD-fMOST dataset: Mean SNR = 11.4, with 25 samples, and a volume size of 192×192×192.

      The experimental results reveal that, thanks to the deep learning network's robust feature extraction capabilities, even when working with low-SNR data (as depicted in Figure 4B, first two columns of the top row), satisfactory segmentation results (Figure 4B, first two columns of the third row) were achieved. These results laid a solid foundation for subsequent accurate reconstruction.

      PointTree demonstrated consistent mean F1-scores of 91.0%, 90.0%, and 93.3% across the three datasets, respectively. This underscores its reconstruction robustness under varying SNR conditions when supported by the segmentation network. For more in-depth information, please refer to the manuscript section titled "Reconstruction of data with different signal-to-noise ratios" and Figure 4.

    1. eLife Assessment

      This important work substantially advances our understanding of the interaction among gut microbiota, lipid metabolism, and the host in type 2 diabetes. The evidence supporting the claims of the authors is solid, although additional experiments for the control FMT are not yet satisfactory. The work will be of interest to medical biologists working on microbiota and diabetes.

    2. Reviewer #1 (Public review):

      Summary:

      The authors tried to identify the relationships between gut microbiota, lipid metabolites and the host in type 2 diabetes (T2DM) by using spontaneously developed T2DM in macaques, considered among the best human models.

      Strengths:

      The authors compared comprehensively the gut microbiota, plasma fatty acids between spontaneous T2DM and the control macaques, and tried verified the results with macaques in high-fat diet-fed mice model.

      Weaknesses:

      The observed multi-omics on macaques can be done on humans, which weakens the conclusion of the manuscript, unless the observation/data on macaques could cover during the onset of T2DM that would be difficult to obtain from humans.<br /> Regarding the metabolomic analysis on fatty acids, the authors did not include the results obtained form the macaque fecal samples which should be important considering the authors claimed the importance of gut microbiota in the pathogenesis of T2DM. Instead, the authors measured palmitic acid in the mouse model and tried to validate their conclusions with that.

      In murine experiments, palmitic acid-containing diet were fed to mice to induce diabetic condition, but this does not mimic spontaneous T2DM in macaques, since the authors did not measure in macaque feces (or at least did not show the data from macaque feces of) palmitic acid or other fatty acids; instead, they assumed from blood metabolome data that palmitic acid would be absorbed from the intestine to affect the host metabolism, and added palmitic acid in the diet in mouse experiments. Here involves the probable leap of logic to support their conclusions and title of the study.

      In addition, the authors measured omics data after, but not before, the onset of spontaneous T2DM of macaques. This can reveal microbiota dysbiosis driven purely by disease progression, but does not support the causative effect of gut microbiota on T2DM development that the authors claims.

    1. eLife Assessment

      This study provides valuable evidence indicating that SynGap1 regulates the synaptic drive and membrane excitability of parvalbumin- and somatostatin-positive interneurons in the auditory cortex. Since haplo-insufficiency of SynGap1 has been linked to intellectual disabilities without a well-defined underlying cause, the central question of this study is timely. The experimental data is solid, as in their revisions the authors successfully addressed questions related to changes in thalamocortical presynaptic excitability, the contradiction between spontaneous and mini EPSCs data, and the anatomical analysis of excitatory synapses.

    2. Reviewer #2 (Public review):

      In this manuscript, the authors investigated how partial loss of SynGap1 affects inhibitory neurons derived from the MGE in the auditory cortex, focusing on their synaptic inputs and excitability. While haplo-insufficiently of SynGap1 is known to lead to intellectual disabilities, the underlying mechanisms remain unclear.

      This is the third revision of the manuscript that has improved further, and the main issues were addressed. Specifically, the Authors addressed the contradiction of mEPSC and sEPSC data of the previous version by new experiments and revision of the manuscript text. While alternative explanations are still possible, the new control experiments provide necessary background for reproducibility and the manuscript text puts the observations in the right context. Furthermore, the manuscript now appropriately emphasizes that anatomical analysis was restricted to somatic excitatory synapses. Thus, the readers will be aware of the potential limitations of these measurements.

      Strengths:

      The questions are novel and relevant. Most of the issues in the experimental design are solved or answered.

      Weaknesses:

      Despite the interesting and novel questions, there are potential alternative interpretations of the observations, but these cannot be addressed within the breadth of a single paper.

    3. Author Response:

      The following is the authors’ response to the previous reviews

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors investigated how partial loss of SynGap1 affects inhibitory neurons derived from the MGE in the auditory cortex, focusing on their synaptic inputs and excitability. While haplo-insufficiently of SynGap1 is known to lead to intellectual disabilities, the underlying mechanisms remain unclear.

      Strengths:

      The questions are novel

      Weaknesses:

      Despite the interesting and novel questions, there are significant issues regarding the experimental design and potential misinterpretations of key findings. Consequently, the manuscript contributes little to our understanding of SynGap1 loss mechanisms.

      Major issues in the second version of the manuscript:

      In the review of the first version there were major issues and contradictions with the sEPSC and mEPSC data, and were not resolved after the revision, and the new control experiments rather confirmed the contradiction.

      In the original review I stated: "One major concern is the inconsistency and confusion in the intermediate conclusions drawn from the results. For instance, while the sEPSC data indicates decreased amplitude in PV+ and SOM+ cells in cHet animals, the frequency of events remains unchanged. In contrast, the mEPSC data shows no change in amplitudes in PV+ cells, but a significant decrease in event frequency. The authors conclude that the former observation implies decreased excitability. However, traditionally, such observations on mEPSC parameters are considered indicative of presynaptic mechanisms rather than changes of network activity. The subsequent synapse counting experiments align more closely with the traditional conclusions. This issue can be resolved by rephrasing the text. However, it would remain unexplained why the sEPSC frequency shows no significant difference. If the majority of sEPSC events were indeed mediated by spiking (which is blocked by TTX), the average amplitudes and frequency of mEPSCs should be substantially lower than those of sEPSCs. Yet, they fall within a very similar range, suggesting that most sEPSCs may actually be independent of action potentials. But if that was indeed the case, the changes of purported sEPSC and mEPSC results should have been similar." Contradictions remained after the revision of the manuscript. On one hand, the authors claimed in the revised version that "We found no difference in mEPSC amplitude between the two genotypes (Fig. 1g), indicating that the observed difference in sEPSC amplitude (Figure 1b) could arise from decreased network excitability". On the other hand, later they show "no significative difference in either amplitude or inter-event intervals between sEPSC and mEPSC, suggesting that in acute slices from adult A1, most sEPSCs may actually be AP independent." The latter means that sEPSCs and mEPSCs are the same type of events, which should have the same sensitivity to manipulations.

      We thank the reviewer for the detailed comments. Our results suggest a diverse population of PV+ cells, with varying reliance on action potential-dependent and -independent release. Several PV+ cells indeed show TTX sensitivity (reduced EPSC event amplitudes following TTX application: See new Supplementary Figure 2b-e), but their individual responses are diluted when all cells are pooled together. To account for this variability, we recorded sEPSC followed by mEPSC from more mice of both genotypes (new Figure 1f-j). Further, following the editors and reviewers’ suggestions, we removed speculations about the role of network activity changes.

      In summary, our data confirmed that TTX blocked APs in PV+ cells and that recordings were stable as indicated by lack of changes in series resistance during the recording period in our experimental setup (new Suppl. Figure 2f-i). We found no difference in mEPSC amplitude between the two genotypes (Fig. 1g, right), indicating that the observed difference in sEPSC amplitude (Figure 1c, right) could be due to impaired AP-dependent release in cHet mice and the presence of large-amplitude sEPSCs that are preferentially affected by TTX in control mice (new Suppl. Figure 2b-e). Conversely, cHet mice showed longer inter-mEPSC time interval (cumulative distribution in Figure 1g, left), and significantly lower charge transfer and DQ*f (Figure 1j) compared to controls littermates, suggesting a decrease of glutamatergic presynaptic release sites onto PV+ cells. 

      Concerns about the quality of the synapse counting experiments were addressed by showing additional images in a different and explaining quantification. However, the admitted restriction of the analysis of excitatory synapses to the somatic region represent a limitation, as they include only a small fraction of the total excitation - even if, the slightly larger amplitudes of their EPSPs are considered.

      We agree with the reviewer that restricting the anatomical analysis of excitatory synapses to PV cell somatic region is a limitation, as highlighted it in the discussion of the revised manuscript. Recent studies, based on serial block-face scanning electron microscopy, suggest that cortical PV+ interneurons receive more robust excitatory inputs to their perisomatic region as compared to pyramidal neurons (see for example, Hwang et al. 2021, Cerebral Cortex, http://doi.org/10.1093/cercor/bhaa378). It is thus possible that putative glutamatergic synapses, analysed by vGlut1/PSD95 colocalisation around PV+ cell somata, may be representative of a substantially major excitatory input population. Since analysing putative excitatory synapses onto PV+ dendrites would be difficult and require a much longer time, we re-phrased the text to more clearly highlight the rationale and limitation of this approach.

      New experiments using paired-pulse stimulation provided an answer to issues 3 and 4. Note that the numbering of the Figures in the responses and manuscript are not consistent.

      We are glad that the reviewer found that the new paired-pulse experiments answered previously raised concerns. We corrected the discrepancy in figure numbers in the manuscript. Thank you for noticing.

      I agree that low sampling rate of the APs does not change the observed large differences in AP threshold, however, the phase plots are still inconsistent in a sense that there appears to be an offset, as all values are shifted to more depolarized membrane potentials, including threshold, AP peak, AHP peak. This consistent shift may be due to a non-biological differences in the two sets of recordings, and, importantly, it may negate the interpretation of the I/f curves results (Fig. 5e).

      We agree with the reviewers that higher sampling rate would allow to more accurately assess different parameters, such as AP peak, half-width, rise time, etc., while it would not affect the large differences in AP threshold we observed between control and mutant mice. Since the phase plots to not add to our result analysis, we removed them from the revised manuscript. 

      Additional issues:

      The first paragraph of the Results mentioned that the recorded cells were identified by immunolabelling and axonal localization. However, neither the Results nor the Methods mention the criteria and levels of measurements of axonal arborization.

      Recorded MGE-derived interneurons were filled with biocytin, and their identity was confirmed by immunolabeling for neurochemical markers (PV or SST) and analysis of anatomical properties. In particular, whole biocytin-positive immunolabelled neurons were acquired using a Leica SP8-DLS confocal microscope (20x objective, NA 0.75; Z-step 1 1μm).  For each imaged neuron, which was the result of multiple merged confocal stacks, we visually determined the spatial distribution across cortical layers of the axonal arbor and whether its dendrites carried spines.  We added this information in the method section. Furthermore, to better represent our methodological approach, we added a new figure (Supplemental Figure 1) including 1) two examples of PV+ interneurons, showing dendrites devoid of spines and axons spreading from Layer II to Layer V (new Suppl. Figure 1a); and 2) two examples of SST+ interneurons showing dendritic with spines and axons projecting from Layer IV to Layer I where they gave rise to multiple collaterals (new Suppl. Figure 1b).  

      The other issues of the first review were adequately addressed by the Authors and the manuscript improved by these changes.

      We are happy the reviewer found that the other issues were well addressed.

      Reviewer #3 (Public review):

      This paper compares the synaptic and membrane properties of two main subtypes of interneurons (PV+, SST+) in the auditory cortex of control mice vs mutants with Syngap1 haploinsufficiency. The authors find differences between control and mutants in both interneuron populations, although they claim a predominance in PV+ cells. These results suggest that altered PVinterneuron functions in the auditory cortex may contribute to the network dysfunctions observed in Syngap1 haploinsufficiency-related intellectual disability.

      The subject of the work is interesting, and most of the approach is rather direct and straightforward, which are strengths. There are also some methodological weaknesses and interpretative issues that reduce the impact of the paper.

      (1) Supplementary Figure 3: recording and data analysis. The data of Supplementary Figure 3 show no differences either in the frequency or amplitude of synaptic events recorded from the same cell in control (sEPSCs) vs TTX (mEPSCs). This suggests that, under the experimental conditions of the paper, sEPSCs are AP-independent quantal events. However, I am concerned by the high variability of the individual results included in the Figure. Indeed, several datapoints show dramatically different frequencies in control vs TTX, which may be explained by unstable recording conditions. It would be important to present these data as time course plots, so that stability can be evaluated. Also, the claim of lack of effect of TTX should be corroborated by positive control experiments verifying that TTX is working (block of action potentials, for example). Lastly, it is not clear whether the application of TTX was consistent in time and duration in all the experiments and the paper does not clarify what time window was used for quantification.

      We understand the reviewer’s concern about high variability. To account for this variability, we recorded sEPSC followed by mEPSC from more mice of both genotypes (see new Figure 1f-j). We confirmed that TTX worked as expected several times through the time course of this study, in different aliquots prepared from the same TTX vial that was used for all experiments. The results of the last test we performed, showing that TTX application blocks action potentials in a PV+ cell, are depicted in new Suppl. Figure 2a. Furthermore, new Suppl. Figure 2f-i shows series resistance (Rs) over time for 4 different PV+ interneurons, indicating recording stability. These results are representative of the entire population of recorded neurons, which we have meticulously analysed one by one. TTX was applied using the same protocol for all recorded neurons. In particular, sEPSCs were first sampled over a 2 min period. A TTX (1μM; Alomone Labs)-containing solution was then perfused into the recording chamber at a flow rate of 2 mL/min. We then waited for 5 min before sampling mEPSCs over a 2 min period. We added this information in the revised manuscript methods.

      (2)  Figure 1 and Supplementary Figure 3: apparent inconsistency. If, as the authors claim, TTX does not affect sEPSCs (either in the control or mutant genotype, Supplementary Figure 3 and point 1 above), then comparing sEPSC and mEPSC in control vs mutants should yield identical results. In contrast, Figure 1 reports a _selective_ reduction of sEPSCs amplitude (not in mEPSCs) in mutants, which is difficult to understand. The proposed explanation relying on different pools of synaptic vesicles mediating sEPSCs and mEPSCs does not clarify things. If this was the case, wouldn't it also imply a decrease of event frequency following TTX addition? However, this is not observed in Supplementary Figure 3. My understanding is that, according to this explanation, recordings in control solution would reflect the impact of two separate pools of vesicles, whereas, in the presence of TTX, only one pool would be available for release. Therefore, TTX should cause a decrease in the frequency of the recorded events, which is not what is observed in Supplementary Figure 3.

      To account for the large variability and clarify these results, we recorded sEPSCs followed by mEPSCs from more mice of both genotypes (new Figure 1f-j). We found no difference in mEPSC amplitude between the two genotypes (Fig. 1g, right), indicating that the observed difference in sEPSC amplitude (Figure 1c, right) could be due to impaired AP-dependent release in cHet mice and the presence of large-amplitude sEPSCs that are preferentially affected by TTX in control mice (new Suppl. Figure 2b-e). Conversely, cHet mice showed longer inter-mEPSC time interval (cumulative distribution in Figure 1g, left), and significantly lower charge transfer and DQ*f (Figure 1j) compared to controls littermates, suggesting a decrease of glutamatergic presynaptic release sites. We rephrased the text in the revised manuscript according to the updated data and, following the reviewer’s suggestions, we removed speculations relying on different pools of synaptic vesicles.

      (3) Figure 1: statistical analysis. Although I do appreciate the efforts of the authors to illustrate both cumulative distributions and plunger plots with individual data, I am confused by how the cumulative distributions of Figure 1b (sEPSC amplitude) may support statistically significant differences between genotypes, but this is not the case for the cumulative distributions of Figure 1g (inter mEPSC interval), where the curves appear even more separated. A difference in mEPSC frequency would also be consistent with the data of Supplementary Fig 2b, which otherwise are difficult to reconciliate. I would encourage the authors to use the Kolmogorov-Smirnov rather than a t-test for the comparison of cumulative distributions.

      We thank the reviewer for this thoughtful suggestion. We recorded more mice of both genotypes and the updated data now show a significant difference between the cumulative distributions of the inter mEPSC intervals recorded from the two genotypes (new Figure 1g). For statistical analysis, we based our conclusion on the statistical results generated by LMM, modelling animal as a random effect and genotype as fixed effect. We used this statistical analysis because we considered the number of mice as independent replicates and the number of cells in each mouse as repeated measures (Berryer et al. 2016; Heggland et al., 2019; Yu et al., 2022). For cumulative distributions, the same number of events was chosen randomly from each cell and analysed by LMM, modelling animal as a random effect and genotype as fixed effect. The reason we decided to use LMM for our statistical analyses is based on the growing concern over reproducibility in biomedical research and the ongoing discussion on how data are analysed (see for example, Yu et al (2022), Neuron 110:21-35 https://doi: 10.1016/j.neuron.2021.10.030; Aarts et al. (2014). Nat Neurosci 17, 491–496. https://doi.org/10.1038/nn.3648). We acknowledge that patch-clamp data has been historically analysed using t-test and analysis of variance (ANOVA), or equivalent nonparametric tests. However, these tests assume that individual observations (recorded neurons in this case) are independent of each other. Whether neurons from the same mouse are independent or correlated variables is an unresolved question, but does not appear to be likely from a biological point of view. Statisticians have developed effective methods to analyze correlated data, including LMM.

      (4) Methods. I still maintain that a threshold at around -20/-15 mV for the first action potential of a train seems too depolarized (see some datapoints of Fig 5c and Fig7c) for a healthy spike. This suggest that some cells were either in precarious conditions or that the capacitance of the electrode was not compensated properly.

      As suggested by the reviewer, in the revised figures we excluded the neurons with threshold at -20/-15 mV. In addition, we performed statistical analysis with and without these cells (data reported below) and found that whether these cells are included or excluded, the statistical significance of the results does not change.

      Fig.5c: including the 2 outliers from cHet group with values of -16.5 and 20.6 mV: 42.6±1.01 mV in control, n=33 cells from 15 mice vs -35.3±1.2 mV in cHet, n=40 cells from 17 mice, ***p<0.001, LMM; excluding the 2 outliers from cHet group -42.6±1.01 mV in control, n=33 cells from 15 mice vs -36.2±1.1 mV in cHet, n=38 cells from 17 mice, ***p<0.001, LMM.

      Fig.7c: including the 2 outliers from cHet group with values of -16.5 and 20.6 mV: 43.4±1.6 mV in control, n=12 cells from 9 mice vs -33.9±1.8 mV in cHet, n=24 cells from 13 mice, **p=0.002, LMM; excluding the 2 outliers from cHet group -43.4±1.6 mV in control, n=12 cells from 9 mice vs -35.4±1.7 mV in cHet, n=22 cells from 13 mice, *p=0.037, LMM.

      (5) The authors claim that "cHet SST+ cells showed no significant changes in active and passive membrane properties (Figure 8d,e); however, their evoked firing properties were affected with fewer AP generated in response to the same depolarizing current injection".

      This sentence is intrinsically contradictory. Action potentials triggered by current injections are dependent on the integration of passive and active properties. If the curves of Figure 8f are different between genotypes, then some passive and/or active property MUST have changed. It is an unescapable conclusion. The general _blanket_ statement of the authors that there are no significant changes in active and passive properties is in direct contradiction with the current/#AP plot.

      We agreed with the reviewer and rephrased the abstract, results and discussion according to better represent the data. As discussed in the previous revision, it's possible that other intrinsic factors, not assessed in this study, may have contributed to the effect shown in the current/#AP plot. 

      (6) The phase plots of Figs 5c, 7c, and 7h suggest that the frequency of acquisition/filtering of current-clamp signals was not appropriate for fast waveforms such as spikes. The first two papers indicated by the authors in their rebuttal (Golomb et al., 2007; Stevens et al., 2021) did not perform a phase plot analysis (like those included in the manuscript). The last work quoted in the rebuttal (Zhang et al., 2023) did perform phase plot analysis, but data were digitized at a frequency of 20KHz (not 10KHz as incorrectly indicated by the authors) and filtered at 10 kHz (not 2-3 kHz as by the authors in the manuscript). To me, this remains a concern.

      We agree with the reviewer that higher sampling rate would allow to more accurately assess different AP parameters, such as AP peak, half-width, rise time, etc. The papers were cited in context of determining AP threshold, not performing phase plot analysis. We apologize for the confusion and error. Finally, we removed the phase plots since they did not add relevant information. 

      (7)  The general logical flow of the manuscript could be improved. For example, Fig 4 seems to indicate no morphological differences in the dendritic trees of control vs mutant PV cells, but this conclusion is then rejected by Fig 6. Maybe Fig 4 is not necessary. Regarding Fig 6, did the authors check the integrity of the entire dendritic structure of the cells analyzed (i.e. no dendrites were cut in the slice)? This is critical as the dendritic geometry may affect the firing properties of neurons (Mainen and Sejnowski, Nature, 1996).

      As suggested by the reviewer, we removed Fig.4. All the reconstructions used for dendritic analysis contained intact cells with no evidently cut dendrites.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the editors at eLife and the one reviewer for engaging our revised manuscript. As we noted in our previous response to reviewers, which we wrote in October 2024 when we submitted our initial revision the majority of critique we received was targeted not so much at the argument of this manuscript but at the debate regarding the evidence in the two other manuscripts that this one accompanied; “ Evidence for deliberate burial of the dead by Homo naledi” and “241,000 to 335,000 Years Old Rock Engravings Made by Homo naledi in the Rising Star Cave system, South Africa.” Because of that critique we revised this manuscript to emphasize that the key element in constructing our argument is that H. naledi engaged in mortuary behavior (the movement of dead H. naledi by living H. naledi into the Rising Star cave system) and place that in context of a) the increasingly complex later Pleistocene record of meaning making activity and b) the assumed correlations between brain size and cognitive capacities in Pliocene and Pleistocene hominins. This framing, as noted in the eLife editorial comment, is the main thrust of our manuscript. There is a growing convergence of evidence that totality of the currently available data and analyses for H. naledi in the Rising Star cave system support mortuary behavior: that is, the agential and intentional action by H. naledi individuals in the transport of bodies to the Lesedi Chamber and Dinaledi Subsystem--see Berger et al. 2025 plus the 2nd round reviews and the eLife editorial comment associated with it, and also Van Rooyen et al. 2025. We acknowledge the serious debates around the assertion of funerary behavior (cultural burial) and seek to illustrate that while we believe the data support the funerary behavior hypothesis, it is not a necessary requirement for our main argument.

      A few specific responses to the reviewer in this revised manuscript:

      Reviewer states: “Claims for a positive correlation between absolute and/or relative brain size and cognitive ability are not common in discussions surrounding the evolution of Middle- and Late Pleistocene hominin behavior.” We are not making the argument that absolute brain size in the later Pleistocene is a point of focus, rather that there are many arguments and assertions about EQ and cognitive capacity that are central in the proposals for the evolution of hominins in general and genus Homo in particular across the Plio-pleistocene period. We offer a brief review of this in the text and suggest, as noted by this reviewer, that “exploration of the specific/potential socio-cultural, neuro-structural, ecological and other factors will be more informative than the emphasis on absolute/relative brain size”…this (in their words) is exactly our main point. However, we contend that such a framing should not be exclusive to later Pleistocene contexts, but rather that the examination of earlier hominins might also be better served by moving away from the traditional assumptions of cognitive complexity associated with absolute/relative brain size. The reviewer states: “The authors use, in a number of instances throughout the paper, secondary sources of information such as review papers (e.g., McBrearty & Brooks 2000; Scerri & Will 2023; Galway-Witham et al. 2019) instead of the original works that are the basis for making the desired case.” We do indeed use review papers in the main body of the text for clarity, brevity, and to acknowledge robust previous review work in these areas, however in the supplemental text and with the figures and table we offer substantive bibliographies of the original citations and studies. We encourage readers to please spend time with those materials as well. Finally, the reviewer states: “Given the inadequate analyses in the accompanying papers, and the lack of evidence for stone tools in the naledi sites, the present claims for the expression of culturally and symbolically mediated behaviors by this small-brained hominin must be adequately established.” We are quite specific in this manuscript, and in other publications, that we are not arguing for “symbolically mediated” behavior, but do stand by our non-controversial suggestions of meaning-making, and cultural behavior, as relevant in Pleistocene hominins (e.g. Kissel and Fuentes 2017, 2018). We do not argue that stone tools are necessary as mandatory indicators of such possibilities and lay out the H. naledi information in the context of the broader and increasing datasets and analyses for meaning-making behavior in Pleistocene hominins (see Figure 1 and table 1, and in the text).

      Our point with this manuscript which we reiterate here is that “The increasing data for complex behavior and meaning-making across the Pleistocene should play a major element in structuring how we investigate, explain, and model the origins and patterns of hominin and human evolution” and we feel that the current evidence for H. naledi behavior contributes to the broader suites of data, hypotheses, analyses, and theory building in this endeavor.


      The following is the authors’ response to the original reviews.

      Before laying out how we addressed the specific comments on this manuscript we want to clarify the goal and intent of this paper to maximize effective critical reading of its contents. We appreciate and look forward to continued critique and enhanced discussion of this topic and argument.

      Our starting point for constructing the argument in this manuscript is that H. naledi engaged in mortuary behavior. This emerges from the totality of the currently available data and analyses for Homo naledi in the Rising Star cave system, which support agential and intentional action by Homo naledi individuals in the transport of bodies to the Lesedi Chamber and Dinaledi Subsystem. We do feel that the data support the cultural burial hypothesis as well as the likelihood that at least some of the markings reported as engravings are non-naturally occurring (see Martinón-Torres et al. 2024) and made by Homo naledi. But these two elements are not necessary for the validity of the argument we pursue in this manuscript.

      Our second key point is that gross brain size does not necessarily correlate with particular patterns of complex behavior in Pleistocene hominins. On this there is wide agreement, yet both scholarly and public arguments for the success of the genus Homo and the success of Homo sapiens have incorporated an assumption of a Rubicon of cerebral size. From this we propose a third point: that smaller brained Pleistocene hominins, including Homo naledi, are part of a Pleistocene hominin niche that includes patterns of complex social and cognitive behavior. Such behavior was historically considered to be exclusive to Homo sapiens but is now documented to occur earlier, across a range of hominin taxa in the latter half of the Pleistocene. We offer the case of H. naledi behavior in the Rising Star system as an example of this. This case contributes to the development of a broader approach to the cognitive, physiological, and behavioral framings of, and explanations for, Pleistocene hominin behavior.

      Responses to specific critiques in the eLife reviews centered on this manuscript:

      Reviewer #1:

      All inferences regarding hominin behaviour and biology of Homo naledi, discussed by Fuentes and colleagues, are wholly dependent on the evidence presented in the archaeology preprints being true.

      Reviewer #2:

      Fuentes et al. provide a detailed and thoughtful commentary on the evolutionary and behavioral implications of complex behaviors associated with a small-brained hominin, Homo naledi…..While the review by Fuentes et al. highlights important assumptions about the relationship between hominin brain size, cognition, and complex behaviors, the evidence presented by Berger et al. 2023a,b does not support the claim that Homo naledi engaged in burial practices or symbolic expression through wall engravings.

      Reviewer #3:

      This paper presents the cognitive implications of claims made in two accompanying papers (Berger et al. 2023a, 2023b) about the creation of rock engravings, the intentional disposal of the dead, and fire use by Homo naledi. The importance of the paper, therefore, relies on the validity of the claims for the presence of socio-culturally complex and cognitively demanding behaviors that are presented in the associated papers. Given the archaeological, hominin, and taphonomic analyses in the associated papers are not adequate to enable the exceptional claims for nalediassociated complex behaviors, the inferences made in this paper are currently inadequate and incomplete.

      We have clarified in the manuscript text and above why we argue that the inferences we are setting as core to our argument do not require cultural burial or engravings by H. naledi be demonstrated. However, we do clarify in the revision that the current evidence for the transport of dead conspecifics into difficult to reach areas deep into the cave system by naledi is well supported by the archeological and paleoanthropological data currently available (e.g. Berger et al. 2024, Elliott et al. 2021, Robbins et al. 2021, Hawks et al. 2017) and that this is the basis for our argument.

      Reviewer #3:

      The claimed behaviors are widely recognized as complex and even quintessential to Homo sapiens. The implications of their unequivocal association with such a small-brained Middle Pleistocene hominin are thus far reaching. Accordingly, the main thrust of the paper is to highlight that greater cognition and complex socio-cultural behaviors were not necessarily associated with a positively encephalized brain. This argument begs the obvious question of whether absolute brain size and/or encephalization quotient (i.e., the actual brain volume of a given species relative the expected brain size for a species of the same average body size) can measure cognitive capacity and the complexity of socio-cultural behaviors among late Middle Pleistocene hominins….Claims for a positive correlation between absolute and/or relative brain size and cognitive ability are not common in discussions surrounding the evolution of Middle- and Late Pleistocene hominin behavior.

      We assert that claims for a positive correlation between absolute and/or relative brain size and cognitive ability are central—either explicitly or implicitly—in most arguments concerning cognitively complex behavior in the genus Homo. This is especially true for ideas about success of Pleistocene Homo relative to other hominins. We clarify this in the text offering various citations in support of this position (e.g. Meneganzin and Currie 2022, Galway-Witham, Cole, and Stringer 2019, DeCasien, Barton, and Higham 2022, Dunbar 2003, Kissel and Fuentes 2021, Muthukrishna et al. 2018, Püschelet al. 2021, Tattersall 2023).

      Reviewer #3:

      Currently, the bulk of the evidence for early complex technological and social behaviors derives from multiple sites across South Africa and postdates the emergence of H. sapiens by more than 100,000 years. Such lag in the expression of complex technologies and behaviors within our species renders the brain size-implies-cognitive capacity argument moot. Instead, a rich body of research over the past several decades has focused on aspects related to sociocultural, environmental, and even the wiring of the brain in order to understand factors underlying the expression of the capacity for greater behavioral variability. In this regard, even if the claimed evidence for complex behaviors among the small-brained naledi populations proves valid, the exploration of the specific/potential socio-cultural, neuro-structural, ecological and other factors will be more informative than the emphasis on absolute/relative brain size.”

      While not at all denying the critically important and rich record of cultural complexity in the Late Pleistocene South African archeological record, we disagree that “the bulk of the evidence for early complex technological and social behaviors derives from multiple sites across South Africa and postdates the emergence of H. sapiens by more than 100,000 years”. We offer a range of examples and citations in support of our assertion in the text (esp. in pp12-14 and Table 1 and Figure 1)

      We lay out the currently available data for such cultural complexity in Figure 1 with extensive documentation and citations for each case in the Supplementary material (both aa a table and a bibliography). We wholly agree with Reviewer 3 that “the exploration of the specific/potential socio-cultural, neuro-structural, ecological and other factors will be more informative than the emphasis on absolute/relative brain size” and are attempting to do just that in the manuscript.

      Reviewer #3:

      The paper presents as supporting evidence previous claims for the appearance of similar complex behaviors predating the emergence of our species, H. sapiens, although it does acknowledge their controversial nature. It then uses the current claims for the association of such behaviors with H. naledi as decisive. Given the inadequate analyses in the accompanying papers and the lack of evidence for stone tools in the naledi sites, the present claims for the expression of culturally and symbolically mediated behaviors by this small-brained hominin must be adequately established.

      We respond to the first part of this critique above (regarding the other papers). But again, we emphasize that although we do feel that the argument for cultural burial is supported (see Berger et al. 2024 preprint) what we are arguing for in this paper is that the agential and intentional transportation of dead (mortuary behavior) is the sufficient factor undergirding our proposal. We do not agree that absence of recognizable stone tools at the site negates our proposal and assert that the context provided by Figure 1, and the data in the table for figure 1 in the SOM, in concert with the supported mortuary behavior (transport and emplacement of the dead) offer sufficient support for the argument we make in the text regarding brain size and the role of emotional cognition and complex behavior in the Pleistocene hominin niche and H. naledi’s participation in it.

    1. eLife Assessment

      This paper describes the structure and connectivity of brain neurons that send descending connections to motor neurons and muscle in the fruit fly nerve cord, using a synapse-resolution connectome. This important work provides a wealth of hypotheses and predictions for future experimentation and modelling. Using state-of-the-art methods, the authors provide solid evidence for their conclusions.

    2. Reviewer #1 (Public review):

      Summary:

      Cheong et al. use a synapse-resolution wiring map of the fruit fly nerve cord to comprehensively investigate circuitry between descending neurons (DNs) from the brain and motor neurons (MNs) that enact different behaviours. These neurons were painstakingly identified, categorised, and linked to existing genetic driver lines; this allows the investigation of circuitry to be informed by the extensive literature on how flights walk, fly, and escape from looming stimuli. New motifs and hypotheses of circuit function were presented. This work will be a lasting resource for those studying nerve cord function.

      Strengths:

      The authors present an impressive amount of work in reconstructing and categorising the neurons in the DN to MN pathways. There is always a strong link between the circuitry identified and what is known in the literature, making this an excellent resource for those interested in connectomics analysis or experimental circuits neuroscience. Because of this, there are many testable hypotheses presented with clear predictions, which I expect will result in many follow-up publications. Most MNs were mapped to the individual muscles that they innervate by linking this connectome to pre-existing light microscopy datasets. When combined with past fly brain connectome datasets (Hemibrain, FAFB) or future ones, there is now a tantalising possibility of following neural pathways from sensory inputs to motor neurons and muscle.

      Weaknesses:

      As with all connectome datasets, the sample size is low, limiting statistical analyses. Readers should keep this in mind, but note that this is the current state-of-the-art. Some figures are weakened by relying too much on depictions of wiring diagrams without additional quantification of connectivity. Readers may find the length of this work challenging, particularly the initial anatomical descriptions of the dataset, which span many figures and may not be of interest to those outside of the subfield.

    3. Reviewer #2 (Public review):

      Summary:

      In Cheong et al., the authors analyze a new motor system (ventral nerve cord) connectome of Drosophila. Through proofreading, cross-referencing with another female VNC connectome, they define key features of VNC circuits with a focus on descending neurons (DNs), motor neurons (MNs), and local interneuron circuits. They define DN tracts, MNs for limb and wing control and their nerves (although their sample suffers for a subset of MNs). They establish connectivity between DNs and MNs (minimal). They perform topological analysis of all VNC neurons including interneurons. They focus specifically on identifying core features of flight circuits (control of wings and halteres), leg control circuits with a focus on walking rather than other limbed behaviors (grooming, reaching, etc.), intermediate circuits like those for escape (GF). They put these features in the context of what is known or has been posited about these various circuits.

      Strengths

      Some strengths of the manuscript include the matching of new DN and MN types to light microscopy, including serial homology of leg motor neurons. This is a valuable contribution that will certainly open up future lines of experimental work. As well, the analysis of conserved connectivity patterns within each leg neuromere and interconnecting connectivity patterns between neuromeres will be incredibly valuable. The standard leg connectome is very nice. Finally, the finding of different connectivity statistics (degrees of feedback) in different neuropils is quite interesting and will stimulate future work aimed at determining its functional significance.

      Weaknesses

      The degradation of many motor neurons is unfortunate. Figure 5 supplement 1 shows that roughly 50% of the leg motor neurons have significantly compromised connectivity data, whereas for non-leg motor neurons, few seem to be compromised. As well, the infomap communities don't seem to be so well controlled/justified. Community detection can be run on any graph - why should I believe that the VNC graph is actually composed of discrete communities? Perhaps this comes from a lack of familiarity with the infomap algorithm, but I imagine most readers will be similarly unfamiliar with it, so more work should be done to demonstrate the degree to which these communities are really communities that connect more within than across communities.

    1. eLife Assessment

      In this useful study, ectopic expression and knockdown strategies were used to assess the effects of increasing and decreasing Cyclic di-AMP on the developmental cycle in Chlamydia. The authors convincingly demonstrate that overexpression of the dacA-ybbR operon results in increased production of c-di-AMP and early expression of the transitionary gene hctA and late gene omcB. Whilst the authors have attempted to revise the submission, the model currently proposed is not fully supported by the data presented.

    2. Reviewer #2 (Public review):

      This manuscript describes the role of the production of c-di-AMP on the chlamydial developmental cycle. The main findings remain the same. The authors show that overexpression of the dacA-ybbR operon results in increased production of c-di-AMP and early expression of transitionary and late genes. The authors also knocked down the expression of the dacA-ybbR operon and reported a modest reduction in the expression of both hctA and omcB. The authors conclude with a model suggesting the amount of c-di-AMP determines the fate of the RB, continued replication, or EB conversion.

      Overall, this is a very intriguing study with important implications however, the data is very preliminary, and the model is very rudimentary. The data support the observation that dramatically increased c-di-AMP has an impact on transitionary gene expression and late gene expression suggesting dysregulation of the developmental cycle. This effect goes away with modest changes in c-di-AMP (detaTM-DacA vs detaTM-DacA (D164N)). However, the model predicts that low levels of c-di-AMP delays EB production is not not well supported by the data. If this prediction were true then the growth rate would increase with c-di-AMP reduction and the data does not show this. The levels of c-di-AMP at the lower levels need to be better validated as it seems like only very high levels make a difference for dysregulated late gene expression. However, on the low end it's not clear what levels are needed to have an effect as only DacAopMut and DacAopKD show any effects on the cycle and the c-di-AMP levels are only different at 24 hours.

      The authors responded to reviewers' critiques by adding the overexpression of DacA without the transmembrane region. This addition does not really help their case. They show that detaTM-DacA and detaTM-DacA (D164N) had the same effects on c-di-AMP levels but the figure shows no effects on the developmental cycle.

      Describing the significance of the findings:

      The findings are important and point to very exciting new avenues to explore the important questions in chlamydial cell form development. The authors present a model that is not quantified and does not match the data well.

      Describing the strength of evidence:

      The evidence presented is incomplete. The authors do a nice job of showing that overexpression of the dacA-ybbR operon increases c-di-AMP and that knockdown or overexpression of the catalytically dead DacA protein decreases the c-di-AMP levels. However, the effects on the developmental cycle and how they fit the proposed model are less well supported.

      Overall this is a very intriguing finding that will require more gene expression data, phenotypic characterization of cell forms, and better quantitative models to fully interpret these findings.

    3. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The paper by Lee and Ouellette explores the role of cyclic-d-AMP in chlamydial developmental progression. The manuscript uses a collection of different recombinant plasmids to up- and down-regulate cdAMP production, and then uses classical molecular and microbiological approaches to examine the effects of expression induction in each of the transformed strains. 

      Strengths: 

      This laboratory is a leader in the use of molecular genetic manipulation in Chlamydia trachomatis and their efforts to make such efforts mainstream is commendable. Overall, the model described and defended by these investigators is thorough and significant.

      Thank you for these comments.

      Weaknesses: 

      The biggest weakness in the document is their reliance on quantitative data that is statistically not significant, in the interpretation of results. These challenges can be addressed in a revision by the authors. 

      Thank you for these comments. We point out that, while certain RT-qPCR data may not be statistically significant, our RNAseq data indicate late genes are, as a group, statistically significantly increased when increasing c-di-AMP levels and decreased when decreasing c-di-AMP levels. We do not believe running additional experiments to “achieve” statistical significance in the RT-qPCR data is worthwhile. We hope the reviewer agrees with this assessment.

      We have also included new data in this revised manuscript, which we believe further strengthens aspects of the conclusions linked to individual expression of full-length DacA isoforms. We have also quantified inclusion areas and bacterial sizes for critical strains.

      Reviewer #2 (Public review): 

      Summary: 

      This manuscript describes the role of the production of c-di-AMP on the chlamydial developmental cycle. Chlamydia are obligate intracellular bacterial pathogens that rely on eukaryotic host cells for growth. The chlamydial life cycle depends on a cell form developmental cycle that produces phenotypically distinct cell forms with specific roles during the infectious cycle. The RB cell form replicates amplifying chlamydia numbers while the EB cell form mediates entry into new host cells disseminating the infection to new hosts. Regulation of cell form development is a critical question in chlamydia biology and pathogenesis. Chlamydia must balance amplification (RB numbers) and dissemination (EB numbers) to maximize survival in its infection niche. The main findings In this manuscript show that overexpression of the dacA-ybbR operon results in increased production of c-di-AMP and early expression of the transitionary gene hctA and late gene omcB. The authors also knocked down the expression of the dacA-ybbR operon and reported a reduction in the expression of both hctA and omcB. The authors conclude with a model suggesting the amount of c-di-AMP determines the fate of the RB, continued replication, or EB conversion. Overall, this is a very intriguing study with important implications however the data is very preliminary and the model is very rudimentary and is not well supported by the data. 

      Thank you for your comments. Chlamydia is not an easy experimental system, but we have done our best to address the reviewer’s concerns in this revised submission.

      Describing the significance of the findings: 

      The findings are important and point to very exciting new avenues to explore the important questions in chlamydial cell form development. The authors present a model that is not quantified and does not match the data well. 

      Describing the strength of evidence: 

      The evidence presented is incomplete. The authors do a nice job of showing that overexpression of the dacA-ybbR operon increases c-di-AMP and that knockdown or overexpression of the catalytically dead DacA protein decreases the c-di-AMP levels. However, the effects on the developmental cycle and how they fit the proposed model are less well supported. 

      dacA-ybbR ectopic expression: 

      For the dacA-ybbR ectopic expression experiments they show that hctA is induced early but there is no significant change in OmcB gene expression. This is problematic as when RBs are treated with Pen (this paper) and (DOI 10.1128/MSYSTEMS.00689-20) hctA is expressed in the aberrant cell forms but these forms do not go on to express the late genes suggesting stress events can result in changes in the developmental expression kinetic profile. The RNA-seq data are a little reassuring as many of the EB/Late genes were shown to be upregulated by dacA-ybbR ectopic expression in this assay.

      As the reviewer notes, we also generated RNAseq data, which validates that late gene transcripts (including sigma28 and sigma54 regulated genes) are statistically significantly increased earlier in the developmental cycle in parallel to increased c-di-AMP levels. The lack of statistical significance in the RT-qPCR data for omcB, which shows a trend of higher transcripts, is less concerning given the statistically significantly RNAseq dataset. We have reported the data from three replicates for the RT-qPCR and do not think it would be worthwhile to attempt more replicates in an attempt to “achieve” statistical significance.

      We recognize that hctA may also increase during stress as noted by the Grieshaber Lab. In re-evaluating these data, we decided to remove the Penicillin-linked studies from the manuscript since they detract from the focus of the story we are trying to tell given the potential caveat the reviewer mentions.

      The authors also demonstrate that this ectopic expression reduces the overall growth rate but produces EBs earlier in the cycle but overall fewer EBs late in the cycle. This observation matches their model well as when RBs convert early there is less amplification of cell numbers. 

      dacA knockdown and dacA(mut) 

      The authors showed that dacA knockdown and ectopic expression of the dacA mutant both reduced the amount of c-di-AMP. The authors show that for both of these conditions, hctA and omcB expression is reduced at 24 hpi. This was also partially supported by the RNA-seq data for the dacA knockdown as many of the late genes were downregulated. However, a shift to an increase in RB-only genes was not readily evident. This is maybe not surprising as the chlamydial inclusion would just have an increase in RB forms and changes in cell form ratios would need more time points.

      Thank you for this comment. We agree that it is not surprising given the shift in cell forms. The reduction in hctA transcripts argues against a stress state as noted above by the reviewer, and the RNAseq data from dacA-KD conditions indicates at least that secondary differentiation has been delayed. We agree that more time points would help address the reviewer’s point, but the time and cost to perform such studies is prohibitive with an obligate intracellular bacterium.

      Interestingly, the overall growth rate appears to differ in these two conditions, growth is unaffected by dacA knockdown but is significantly affected by the expression of the mutant. In both cases, EB production is repressed. The overall model they present does not support this data well as if RBs were blocked from converting into EBs then the growth rate should increase as the RB cell form replicates while the EB cell form does not. This should shift the population to replicating cells. 

      We agree that it seems that perturbing c-di-AMP production by knockdown or overexpressing the mutant DacA(D164N) has different impacts on chlamydial growth. We have generated new data, which we believe addresses this. Overexpressing membrane-localized DacA isoforms is clearly detrimental to chlamydiae as noted in the manuscript. However, when we removed the transmembrane domain and expressed N-terminal truncations of these isoforms, we observed no effects of overexpression on chlamydial morphology or growth. Importantly, for the wild-type full-length or truncated isoforms, overexpressing each resulted in the same level of c-di-AMP production, further supporting that the negative effect of overexpressing the wild-type full-length is linked to its membrane localization and not c-di-AMP levels. These data have been included as new Figure 3. These data indicate that too much DacA in the membrane is disruptive and suggest that the balance of DacA to YbbR is important since overexpression of both did not result in the same phenotype. This is further described in the Discussion.

      As it relates to knockdown of dacA-ybbR, we have essentially removed/reduced the amount of these proteins from the membrane and have blocked the production of c-di-AMP. This is fundamentally different from overexpression.

      Overall this is a very intriguing finding that will require more gene expression data, phenotypic characterization of cell forms, and better quantitative models to fully interpret these findings. 

      Reviewer #1 (Recommendations for the authors): 

      There is a generally consistent set of experiments conducted with each of the mutant strains, allowing a straightforward examination of the effects of each transformant. There are a few general and specific things that need to be addressed for both the benefit of the reader and the accuracy of interpretation. The following is a list of items that need to be addressed in the document, with an overall goal of making it more readable and making the interpretations more quantitatively defended. 

      Specific comments: 

      (1) The manuscript overall is wordy and there are quite a few examples of text in the results that should be in the discussion (examples include lines 224-225, 248-262, 282-288, 304-308) the manuscript overall could use a careful editing for verbosity. 

      Thank you for this comment. We have removed some of the indicated sentences. However, to maintain the flow and logic of the manuscript, some statements may have been preserved to help transition between sections. As far as verbosity, we have tried to be as clear as possible in our descriptions of the results to minimize ambiguity. Others who read our manuscript appreciated the thoroughness of our descriptions.

      (2) There is also a trend in the document to base fact statements on qualitative and quantitative differences that do not approach statistical significance. Examples of this include the following: lines 156-158, 190-192, 198-199, 230-232, 239-242, 292-293). This is something the authors need to be careful about, as these different statistically insignificant differences may tend to multiply a degree of uncertainty across the entire manuscript. 

      We have quantified inclusion areas and tried to remove instances of qualitative assessments as noted by the reviewer. In regards to some of the transcripts, we can only report the data as they are. In some cases, there are trends that are not statistically significant, but it would seem to be inaccurate to state that they were unchanged. In other cases, a two-fold or less difference in transcript levels may be statistically significant but biologically insignificant. A reader can and should make their own conclusions.

      (3) Any description of inclusion or RB size being modestly different needs to be defended with microscopic quantification. 

      We have quantified inclusion areas and RB sizes and tried to remove instances of qualitative assessments as noted by the reviewer.

      (4) It would be very helpful to reviewers if there was a figure number added to each figure in the reviewer-delivered text. 

      Added.

      (5) Figure 1A: This should indicate that the genes indicated beneath each developmental form are on high (I think that is what that means). 

      We have reorganized Figure 1 to better improve the flow.

      (6) Figure 1B is exactly the same as the three images in Figure 8B. I would delete this in Figure 1. This relates to comment 9. 

      We presented this intentionally to clearly illustrate to the reader, who may not be knowledgeable in this area, what we propose is happening in the various strains. As such, we respectfully disagree and have left this aspect of the figure unchanged.

      (7) Figure 1D: It is not clear if the period in E.V has any meaning. I think this is just a typo. Also, the color coding needs to be indicated here. What do the gray bars represent? The labeling for the gene schematic for dacA-KDcom should not be directly below the first graph in D. This makes the reader think this is a label for the graph. This can be accomplished if the image in panel B is removed and the first graph in panel D is moved into B. This will make a better figure. 

      We have reorganized Figure 1 to better improve the flow.

      (8) Figure 2 C, G: The utility of these panels is not clear. For them to have any value, they need to be expressed in genome copies. If they are truly just a measure of chlamydia genomic DNA, they have minimal utility to the reader. There are similar panels in several other figures. 

      We have reported genome copies as suggested in lieu of ng gDNA for these measurements. Importantly, it does not alter any interpretations.

      (9) I am not sure about the overall utility of Figure 8. Granted, a summary of their model is useful, but the cartoons in the figure are identical or very nearly identical to model figures shown in two other publications from the same group (PMID: 39576108, 39464112) These are referenced at least tangentially in the current manuscript (Jensen paper- now published- and ref 53). Because the model has been published before, if they are to be included, there needs to be a direct comparison of the results in each of these three papers, as they basically describe the same developmental process. The model images should also be referenced directly to the first of the other papers.

      This was intentional so that readers familiar with our work will see the similarities between these systems. We have added additional comments in the Discussion related to our newly published work. As an aside, Dr. Lee generated the first version of the figure that was adapted by others in the lab. It is perhaps unlucky that those other studies have been published before his work.

    1. eLife Assessment

      This important study presents a new framework (ASBAR) that combines open-source toolboxes for pose estimation and behavior recognition to automate the process of categorizing behaviors in wild apes from video data. The authors present compelling evidence that this pipeline can categorize simple wild ape behaviors from out-of-context video at a similar level of accuracy as previous models, while simultaneously vastly reducing the size of the model. The study's results should be of particular interest to primatologists and other behavioral biologists working with natural populations.

    2. Reviewer #1 (Public review):

      Summary:

      Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of automatically identifying simple behaviors of wild apes from camera trap videos of variable quality and contexts. These results indicate that skeletal-based action recognition offers a reliable and lightweight methodology for studying ape behavior in the wild and the presented framework and GUI offer an accessible route for other researchers to utilize such tools.

      Given that automated behavior recognition in wild primates will likely be a major future direction within many subfields of primatology, open-source frameworks, like the one presented here, will present a significant impact on the field and will provide a strong foundation for others to build future research upon.

      Strengths:

      Clearly articulated the argument as to why the framework was needed and what advantages it could convey to the wider field.

      For a very technical paper it was very well written. Every aspect of the framework the authors clearly explained why it was chosen and how it was trained and tested. This information was broken down in a clear and easily digestible way that will be appreciated by technical and non-technical audiences alike.

      The study demonstrates which pose estimation architectures produce the most robust models for both within context and out of context pose estimates. This is invaluable knowledge for those wanting to produce their own robust models.

      The comparison of skeletal-based action recognition with other methodologies for action recognition are helpful in contextualizing the results.

      Weaknesses:

      While I note that this is a paper most likely aimed at the more technical reader, it will also be of interest to a wider primatological readership, including those who work extensively in the field. When outlining the need for future work I felt the paper offered almost exclusively very technical directions. This may have been a missed opportunity to engage the wider readership and suggest some practical ways those in the field could collect more ASBAR friendly video data to further improve accuracy.

      Comments on latest version:

      I think the new version is an improvement and applaud the authors on a well-written article that conveys some very technical details excellently. The authors have addressed my initial comments about reaching out to a wider, sometimes less technical, primatological audience by encouraging researchers to create large annotated datasets and make these publicly accessible. I also agree that fostering interdisciplinary collaboration is the best way to progress this field of research. These additions have certainly strengthened the paper but I still think some more practical advice for the actual collection of high-quality training data used to improve the pose estimates and behavioral classification in tough out-of-context environments could have been added. This doesn't detract from the quality of the paper though.

    3. Reviewer #2 (Public review):

      Fuchs et al. propose a framework for action recognition based on pose estimation. They integrate functions from DeepLabCut and MMAction2, two popular machine learning frameworks for behavioral analysis, in a new package called ASBAR.

      They test their framework by:

      Running pose estimation experiments on the OpenMonkeyChallenge (OMC) dataset (the public train + val parts) with DeepLabCut

      Also annotating around 320 images pose data in the PanAf dataset (which contains behavioral annotations). They show that the ResNet-152 model generalizes best from the OMC data to this out-of-domain dataset.

      They then train a skeleton-based action recognition model on PanAf and show that the top-1/3 accuracy is slightly higher than video-based methods

    4. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review)

      Summary:

      Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor-intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well-tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of automatically identifying simple behaviors of wild apes from camera trap videos of variable quality and contexts. These results indicate that skeletal-based action recognition offers a reliable and lightweight methodology for studying ape behavior in the wild and the presented framework and GUI offer an accessible route for other researchers to utilize such tools.

      Given that automated behavior recognition in wild primates will likely be a major future direction within many subfields of primatology, open-source frameworks, like the one presented here, will present a significant impact on the field and will provide a strong foundation for others to build future research upon.

      Strengths:

      Clearly articulated the argument as to why the framework was needed and what advantages it could convey to the wider field.

      For a very technical paper it was very well written. Every aspect of the framework the authors clearly explained why it was chosen and how it was trained and tested. This information was broken down in a clear and easily digestible way that will be appreciated by technical and non-technical audiences alike.

      The study demonstrates which pose estimation architectures produce the most robust models for both within-context and out-of-context pose estimates. This is invaluable knowledge for those wanting to produce their own robust models.

      The comparison of skeletal-based action recognition with other methodologies for action recognition helps contextualize the results.

      We thank Reviewer #1 for their thoughtful and constructive review of our manuscript. We are especially grateful for your recognition of the clarity of the manuscript, the strength of the technical framework, and its accessibility to both technical and non-technical audiences. Your feedback highlights exactly the kind of interdisciplinary engagement we hope to foster with this work.

      Weaknesses

      While I note that this is a paper most likely aimed at the more technical reader, it will also be of interest to a wider primatological readership, including those who work extensively in the field. When outlining the need for future work I felt the paper offered almost exclusively very technical directions. This may have been a missed opportunity to engage the wider readership and suggest some practical ways those in the field could collect more ASBAR-friendly video data to further improve accuracy.

      We appreciate this insightful suggestion and fully agree that emphasizing practical relevance is important for engaging a broader readership. In response, we have reformulated the opening of the Discussion section to place stronger emphasis on the value of shared, open-source resources and the real-world accessibility of the ASBAR framework. The revised text explicitly highlights the practical benefits of ASBAR for field researchers working in resource-constrained environments, and underscores the importance of community-driven data sharing to advance behavioral research in natural settings.

      This section now reads: Despite the growing availability of open-source resources, such as large-scale animal pose datasets and machine learning toolboxes for pose estimation and human skeleton-based action recognition, their integration for animal behavior recognition—particularly in natural settings—remains largely unexplored. With ASBAR, a framework combining animal pose estimation and skeleton-based action recognition, we provide a comprehensive data and model pipeline, methodology, and GUI to assist researchers in automatically classifying animal behaviors via pose estimation. We hope these resources will become valuable tools for advancing the understanding of animal behavior within the research community.

      To illustrate ASBAR’s capabilities, we applied it to the challenging task of classifying great ape behaviors in their natural habitat. Our skeletonbased approach achieved accuracy comparable to previous video-based studies for Top-K and Mean Class Accuracies. Additionally, by reducing the input size of the action recognition model by a factor of approximately 20 compared to video-based methods, our approach requires significantly less computational power, storage space, and data transfer resources. These qualities make ASBAR particularly suitable for field researchers working in resource-constrained environments.

      Our framework and results are built on the foundation of shared and open-source materials, including tools like DeepLabCut, MMAction2, and datasets such as OpenMonkeyChallenge and PanAf500. This underscores the importance of making resources publicly available, especially in primatology, where data scarcity often impedes progress in AI-assisted methodologies. We strongly encourage researchers with large annotated video datasets to make them publicly accessible to foster interdisciplinary collaboration and further advancements in animal behavior research.

      Reviewer #2 (Public Review)

      Fuchs et al. propose a framework for action recognition based on pose estimation. They integrate functions from DeepLabCut and MMAction2, two popular machine-learning frameworks for behavioral analysis, in a new package called ASBAR.

      They test their framework by

      Running pose estimation experiments on the OpenMonkeyChallenge (OMC) dataset (the public train + val parts) with DeepLabCut.

      Annotating around 320 image pose data in the PanAf dataset (which contains behavioral annotations). They show that the ResNet-152 model generalizes best from the OMC data to this out-of-domain dataset.

      They then train a skeleton-based action recognition model on PanAf and show that the top-1/3 accuracy is slightly higher than video-based methods (and strong), but that the mean class accuracy is lower - 33% vs 42%. Likely due to the imbalanced class frequencies. This should be clarified. For Table 1, confidence intervals would also be good (just like for the pose estimation results, where this is done very well).

      We thank Reviewer #2 for their clear and helpful summary of our work, and for the thoughtful suggestions to improve the manuscript. We appreciate this observation. In the revised manuscript, we now clarify that the lower Mean Class Accuracy (MCA) in the initial version was indeed driven by significant class imbalance in the PanAf dataset, which contains highly uneven representation across behavior categories. To address this, we made two key improvements to the action recognition model:

      (1) We replaced the standard cross-entropy loss with a class-balanced focal loss, following the approach of Sakib et al. (2021), to better account for rare behaviors during training.

      (2) We initialized the PoseConv3D model with pretrained weights from FineGym (Shao et al., 2020) rather than training from scratch, which increased performance across underrepresented classes.

      Together, these changes substantially improved model performance on tail classes, increasing the Mean Class Accuracy from 33.6% to 47%, now exceeding that of the videobased baseline.

      Moreover, we sincerely thank Reviewer #2 for the thorough and constructive private feedback. Your comments have greatly helped us improve both the structure and clarity of the manuscript, and we have implemented several key revisions based on your recommendations to streamline the text and sharpen its focus on the core contributions. In particular, we have revised the tone of both the Introduction and Discussion sections to more modestly and accurately reflect the scope of our findings. We removed unnecessary implementation details—such as the description of graph-based models that were not part of the final pipeline—to avoid distracting tangents. The Methods section has been clarified and consolidated to include all evaluation metrics, a description of the data augmentation, and other methodological elements that were previously scattered across the Results section. Additionally, the Discussion now explicitly addresses the limitations of our EfficientNet results, including a dedicated paragraph that acknowledges the use of suboptimal hyperparameters and highlights the need for architecture-specific tuning, particularly with respect to learning rate schedules.

    1. eLife Assessment

      The authors present a useful agent-based model to study the tensile force generated by myosin mini-filaments in actin systems (bundles and networks); by numerically solving a mechanical model of myosin-II filaments, the authors provide insights into how the geometry of the molecular components and their elastic responses determine the force production. This work is of interest to biophysicists (in particular theoreticians) investigating force generation of motor molecules from a biomechanical engineering and physics perspective. The authors convincingly show that cooperative effects between multiple myosin filaments can enhance the total force generated, but not the efficiency of force generation (force per myosin) if passive cross-linkers are present. This work would benefit from a more extensive discussion of the physiological relevance of the results in view of the existing experimental literature, and how the principles that govern the behavior could be different for different motor proteins.

    2. Reviewer #1 (Public review):

      Summary:

      This work by Ding et al uses agent-based simulations to explore the role of the structure of molecular motor myosin filaments in force generation in cytoskeletal structures. The focus of the study is on disordered actin bundles which can occur in the cell cytoskeleton and can be investigated with in vitro purified protein experiments. A key finding is that the force generation depends on the number of myosin motor heads and the spatial distribution of the myosin thick filaments in relation to passive crosslinkers.

      Strengths:

      The work develops a model where the detailed structure of the myosin motor filaments with multiple heads is represented. This allows the authors to test the dependence of myosin-generated forces on the number of myosin heads and their spatial distribution.

      The work highlights that forces from multiple myosin motors within a disordered actin bundle may not simply add up, but depend on their spatial distribution in relation to passive crosslinkers.

      This may explain prior experimental observations in in vitro reconstituted actomyosin bundles that the tension developed in the bundle was proportional to the number of myosin motor heads per filament rather than the number of myosin filaments. More generally, this type of modeling can guide fundamental understanding of the relationship between structure and mechanical force production.

      Weaknesses:

      The work focuses on the structure of myosin filaments but ignores other processes that may determine contractility of actomyosin structures such as the dynamics of crosslinker binding/unbinding and actin polymerization/depolymerization.

      The authors did not vary the relative concentration of myosin motors and passive crosslinkers. This would have revealed interesting competing effects between motor and crosslink density and distribution, that their model and other studies suggest are important.

      Given the above factors and the lack of direct quantitative comparisons with the experiment, the physiological significance of the work remains hard to ascertain.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, the authors use a mechanical model to investigate how the geometry and deformations of myosin II filaments influence their force generation. They introduce a force generation efficiency that is defined as the ratio of the total generated force and the maximal force that the motors can generate. By changing the architecture of the myosin II filaments, they study the force generation efficiency in different systems: two filaments, a disorganized bundle, and a 2D network. In the simple two-filament systems, they found that in the presence of actin cross-linking proteins motors cannot add up their force because of steric hindrances. In the disorganized bundle, the authors identified a critical overlap of motors for cooperative force generation. This overlap is also influenced by the arrangement of the motor on the filaments and influenced by the length of the bare zone between the motor heads.

      Strengths:

      The strength of the study is the identification of organizational principles in myosin II filaments that influence force generation. It provides a complementary mechanistic perspective on the operation of these motor filaments. The force generation efficiency and the cooperative overlap number are quantitative ways to characterize the force generation of molecular motors in clusters and between filaments. These quantities and their conceptual implications are most likely also applicable in other systems.

      Weaknesses:

      The detailed model that the authors present relies on over 20 numerical parameters that are listed in the supplement. Because of this vast number of parameters, it is not clear how general the findings are. On the other hand, it was not obvious how specific the model is to myosin II, meaning how well it can describe experimental findings or make measurable predictions. Although the authors partially addressed this point in the revisions, I still think it is not easy to see what are the fundamental principles that govern the behavior and how they could be different for different motor proteins.

      The model seems to be quantitative, but the interpretation and connection to real experiments is rather qualitative in my point of view.

    1. eLife Assessment

      This useful study examines excitation/inhibition (E/I) balance in the CA3-CA1 circuit of the hippocampus. Experimental and computational modeling results are presented, but these results provide incomplete evidence to support the paper's main claims due to shortcomings in the experimental and modeling approaches, as well as concerns about the neurobiological relevance of the results.

    2. Reviewer #1 (Public review):

      Summary:

      This study uses optogenetics to activate CA3, while recording from CA1 neurons and characterizing the excitation/inhibition (E/I) balance. They observe use-dependent alterations in the E/I balance as a result of STP, and they develop a model to describe these observations. This is a very ambitious paper that deals with many issues using both experimental and modeling approaches.

      Strengths:

      This paper examines important principles regarding the manner in which synaptic circuitry and use-dependent synaptic plasticity can transform inputs and perform computations.

      Weaknesses:

      The use of selective ChR2 expression in CA3 cells is a good approach, but there are numerous issues that cause concern regarding the applicability of their slice recordings to physiological conditions and that make some aspects of their results difficult to interpret. Experiments are not performed under physiological conditions (high external calcium and low temperature), which makes the interpretation of their findings difficult. In addition, the reliability of stimulating action potentials in CA3 pyramidal cells needs to be determined, particularly during high-frequency trains. If it is unreliable, there are alternative approaches that might prove to be superior, such as the use of somatically targeted ChR2. In addition, a clearer, more detailed discussion of their model that distinguishes it from previous modeling studies would be helpful (and would make it seem less incremental).

    3. Reviewer #2 (Public review):

      Summary:

      The authors investigate EI balance in the CA3-CA1 projections, emphasizing synaptic depletion and the implied rebalancing of excitatory and inhibitory projections onto a single CA1 Pyramidal cell. They present physiological results with optical stimulation in CA3 and measuring various response features in CA1, showing signatures consistent with the adjustment of EI balance. In particular, the authors emphasize a transient effect where the neuron escapes from EI balance, which can be used for mismatch detection. They partially replicate these results in a computational model that looks at detailed properties of synaptic plasticity in CA1.

      Strengths:

      The authors provide compelling evidence that non-specific modulation of synaptic plasticity, combined with their differential effects on excitatory and inhibitory neurons, can be used by CA1 excitatory neurons to detect changes in the population activity of CA3 neurons. Indeed, they provide insight into the potential computational role of transient EI imbalance.

      Weaknesses:

      The authors observe that‬ "little‬‭ is‬‭ known‬‭ about‬‭ how‬‭ EI‬‭ balance‬ itself evolves dynamically due to activity-driven plasticity in sparsely active networks.‬" This is an overstatement, or better an understatement, given the extensive literature on EI balance (e.g. Wen W, Turrigiano GG. Keeping Your Brain in Balance: Homeostatic Regulation of Network Function. Ann Rev Neurosci. 2024. https://doi.org/10.1146/annurev-neuro-092523-110001 PMID:38382543). This way of framing the question does a disservice to the field and fails to contextualize the current research properly.

      The evidence is incomplete because the authors do not show a specific relationship between synaptic change in CA1 and EI balance adjustment, i.e., the alternative could be that this is an unspecific effect unrelated to the specific regulation of EI balance and its functional role in the hippocampus and the cortex. Indeed, the paper drifts from addressing EI balance to elucidating the mismatch detection. The second shortcoming is that they do not show that the stimulation of the CA3 neurons occurs in a physiologically realistic regime, nor do they analyze what the impact will be of the excitatory transient in "mismatch detection", and CA1, when this would occur at the level of the whole population, i.e., the physiological impossibility of triggering uncontrolled chaotic excitatory responses. In particular, when we consider CA3 as an attractor memory system, the range of deviations (mismatches) that a CA1 neuron can be exposed to and detect, given the model presented in this paper, might be below those generated due to CA3 pattern-completion dynamics. In addition, the match between the model and the physiological results is not fully quantified, leaving it to the reader to make a leap of faith.

      In addition, the manuscript suffers from poor analysis and presentation. The work could be improved by putting more effort into translating results into insightful metrics.

      Overall, the authors have not achieved their original aim to show that the observed phenomenon is relevant to computation in CA1 or the brain outside of a highly controlled in vitro setup and reductionist single cell model.

      The authors combine several techniques for in vitro whole-cell patch-clamp recordings with patterned optical stimulation of the CA3 network in the mouse hippocampus, which is consistent with the state-of-the-art.

      They introduce a metric of similarity between expected and observed response patterns, called gamma. The name is confusing given the wide use of the label gamma for oscillation frequencies above 20 Hz. Gamma is calculated as (E*O)/(E-O). This means that gamma approximates infinity as the difference goes to 0, to mention one of the problems. This metric is not interpretable, and it is not clear why the authors did not follow a standard approach, e.g., likelihood, correlation, or percent error.

      The authors aim to replicate the physiological results with an "abstract‬‭ model‬ of‬‭ the‬‭ hippocampal‬‭ FFEI‬‭ network. In practice, this is a conductance-based model of a single CA1 neuron, including chemical‬ kinetics-based‬‭ multi-step‬‭ neurotransmitter‬‭ vesicle‬‭ release‬‭. This is an abstraction from the FFEI network that the paper starts with. It raises the question whether this is the right level at which to model the computational impacts of EI imbalance on CA1 neurons. Given the highly reduced model they have elaborated, the generalization to the complete CA3-CA1 network that the authors suggest can be achieved in the discussion is overoptimistic. Network models of CA3 and C1 must be considered, together with afferents from the entorhinal cortex to accomplish this generalization.

      The authors reveal a potentially interesting physiological feature of CA1 excitatory neurons under very specific stimulus conditions. It could warrant follow-up studies to place EI imbalance in a physiologically realistic context.

    4. Reviewer #3 (Public review):

      Summary:

      This work shows experimentally and computationally that single CA1 neurons can perform mismatch detection on patterned CA3 inputs and that STP and EI balance underlie this detection.

      Strengths:

      It has been known that STP can enhance the EPSP when the corresponding presynaptic input exhibits abrupt changes in firing rate. This work provides experimental evidence and further computational support for the hypothesis that the basic computation through STP is useful for detecting abrupt changes in the spatial pattern of synaptic inputs at the Schaffer collaterals. Further, their results indicate the novel view that mismatch detection is most efficient when gamma-frequency bursting inputs exhibit mismatches between theta cycles.

      Weaknesses:

      Their model assumes that patterned activities in CA3 do not have overlaps. However, overlaps between memory engrams have been shown. Therefore, this assumption may not hold, and whether the proposed mechanism is valid for overlapping CA3 inputs needs further clarification.

    1. eLife Assessment

      This valuable study provides evidence that the integration of the nuclear envelope into the endoplasmic reticulum provides a mechanism for mechanical integration across this continuous membrane system. If robustly demonstrated, this work would open up new avenues for studying organelle membrane tension homeostasis. While the evidence is largely convincing and carefully quantified, a key limitation is the absence of data demonstrating that microinjection of cytoskeleton-depolymerizing drugs locally disrupts the target network.

    2. Reviewer #1 (Public review):

      Summary:

      Zare‑Eelanjegh et al. investigate how the endoplasmic reticulum, the nucleus, and the cell periphery are mechanically linked by indenting intact cells with specially shaped atomic‑force probes that double as drug injection devices. Fluorescence‑lifetime imaging of the membrane tension reporter Flipper‑TR reveals that these three compartments are mechanically linked and that the actin cytoskeleton, microtubules, and lamins modulate this coupling in complex ways.

      Strengths:

      (1) The study makes an important advance by applying FluidFM to probe organelle mechanics in living cells, a technically demanding but powerful approach.

      (2) Experimental design is quantitative, the data are clearly presented, and the conclusions are broadly consistent with the measurements.

      Weaknesses:

      (1) Calcium‑dependent effects: Indentation can evoke cytoplasmic Ca²⁺ elevations that drive myosin contraction and reshape the internal membrane network (e.g., vesiculation: PMID : 9200614, 32179693) possibly confounding the Flipper-TR responses; without simultaneous/matching Ca²⁺ imaging, cell viability assays (e.g., Sytox), and intracellular Ca²⁺ sequestration or myosin inhibition experiments, a more complex mechanochemical coupling cannot be excluded, weakening conclusions.

      (2) Baseline measurements: Flipper‑TR lifetime images acquired without indentation do not exclude potential light‑induced or time‑dependent changes, which weaken the conclusions.

      (3) Indentation depth versus nuclear stiffness/tension: Because lamin‑A/C depletion softens nuclei, a given force may produce a deeper pit and thus greater membrane stretch. It is unclear how the cytoskeletal perturbations affect indentation depth, which weakens the conclusions.

    3. Reviewer #2 (Public review):

      Summary:

      This useful study combines atomic force microscopy with genetic manipulations of the lamin meshwork and microinjection of cytoskeletal depolymerizing drugs to probe the mechanical responses of intracellular organelles to combinations of cytoskeletal perturbations. This study demonstrates both local and distal responses of intracellular organelles to mechanical forces and shows that these responses are affected by disruption of the actin, microtubule, and lamin cytoskeletal systems. Interpretation of these effects is limited by the absence of key data determining whether acute microinjection of cytoskeleton-depolymerizing drugs has complete or partial effects on the targeted cytoskeletal networks.

      Strengths:

      This study uses a sensitive micromanipulation system to apply and visualize the effects of force on intracellular organelles.

      Weaknesses:

      The choice to deliver cytoskeleton-depolymerizing drugs by local microinjection is unusual, and it is unclear to what extent actin and microtubule filaments are actually depolymerized immediately after microinjection and on the minutes-length timescale being evaluated in this study. This omission limits the interpretation of these data.

    4. Reviewer #3 (Public review):

      Summary:

      Using an approach developed by the authors (FluidFM) combined with FLIM, they discover that a mechanical force applied over the cell nucleus triggers mechanical responses dependent on the Lamina composition.

      Strengths:

      The authors present a new approach to study mechano-transduction in living cells, with which they uncover lamin-dependent properties of the nucleus.

      Weaknesses:

      (1) The transfer of the mechanical response from the Lamina to the ER is not fully covered.

      (2) In Figure 4D, WT dots are the same for each compartment. Why do the authors not make one graph for each compartment with WT, A-KO, B-KD, and A-KO/B-KD together?

      (2) In Figure 1E, the authors showed well how the probe deforms the nucleus. It is not indicated in the material and methods section or in the figure legend, where, in Z, the acquisition of FLIM images was made or if it is a maximum projection. I assume it was made at a plane in the middle of the nucleus to see the nuclear envelope border and the ER at the same time. Did the authors look at the nuclear membrane facing upward, where most of the deformation should occur? Are there more lifetime changes? In Figure D, before injection of CytoD, we can clearly see a difference at the pyramidal indentation site with two different lifetime colors.

      (3) A great result of this article regards the importance of Lamins, A and B, in triggering the response to a mechanical force applied to the nucleus. Could 3D imaging for LaminA and LaminB be performed at the different time points of indentation to see how the lamins meshworks are deformed and how they return to basal state? This could be correlated with the FLIM results described in the article.

      (4) Lamins form a meshwork underneath the nuclear membrane. They are connected to the cytoskeletons mainly by the LINC complex. Results presented here show that the cytoskeletons are implicated in transferring the stimulus from the nuclear envelope to the ER. Could the author perform the same experiments using Nesprin-2 or/and Nesprin-1 or/and SUN1/2 knockdowns to determine if this transmission is occurring through the LINC complex or rather in a passive way by modifying the nuclear close surroundings?

      (5) The authors used cytoskeleton drugs, CytoD and Nocodazole, with their FluidFM probe, but did not show if the drugs actually worked and to what extent by performing actin or microtubule stainings. In the original paper describing FluidFM, 15s were enough to obtain a full FITC-positive cell after injection. Here, the experiments are around 5 minutes long. I therefore interrogate the rationale behind the injection of the drugs compared to direct incubation, besides affecting only the cell currently under indentation.

    1. eLife Assessment

      This important study presents a meta-analysis confirming a statistically significant association between slow oscillation-spindle coupling and memory formation, although the reported effects are limited (~0.5% of variance). The evidence is overall convincing, but the statistical methods may be difficult to follow for readers unfamiliar with advanced techniques. This work will be of particular interest to neuroscientists studying the neural mechanisms of sleep and memory.

    2. Reviewer #1 (Public review):

      In this meta-analysis, Ng and colleagues review the association between slow-oscillation spindle coupling during sleep and overnight memory consolidation. The coupling of these oscillations (and also hippocampal sharp-wave ripples) have been central to theories and mechanistic models of active systems consolidation, that posit that the coupling between ripples, spindles, and slow oscillations (SOs) coordinate and drive the coordinated reactivation of memories in hippocampus and cortex, facilitating cross-regional information and ultimately memory strengthening and stabilisation.

      Given the importance that these coupling mechanisms have been given in theory, this is a timely and important contribution to the literature in terms of determining whether these theoretical assumptions hold true in human data. The results show that the timing of sleep spindles relative to the SO phase, and the consistency of that timing, predicted overnight memory consolidation in meta-analytic models. The overall amount of coupling events did not show as strong a relationship. Coupling phase in particular was moderated by a number of variables including spindle type (fast, slow), channel location (frontal, central, posterior), age, and memory type. The main takeaway is that fast spindles that consistently couple close to the peak of the SO in frontal channel locations are optimal for memory consolidation, in line with theoretical predictions. These findings will be very useful for future researchers in terms of determining necessary sample sizes to observe coupling - memory relationships, and in the selection and reporting of relevant coupling metrics.

      Although the meta-analysis covers the three main coupling metrics that are typically assessed (occurrence, timing, and consistency), the meta-analysis also includes spindle amplitude. This may be confusing to readers, as this is not a measurement of SO-spindle coupling but instead a measurement of spindles in general (which may or may not be coupled).

    3. Reviewer #2 (Public review):

      This article reviews the studies on the relationship between slow oscillation (SO)-spindle (SP) coupling and memory consolidation. It innovatively employs non-normal circular linear correlations through a Bayesian meta-analysis. A systematic analysis of the retrieved studies highlighted that co-coupling of SO and the fast SP's phase and amplitude at the frontal part better predicts memory consolidation performance.

      Regarding the moderator of age, this study not only provided evidence of the effect across all age groups but also the effect in a younger age group (without the small sample of elders that has a large gap from the younger age groups). The ageing effects become less pronounced, but the model still shows a moderate effect.

    4. Reviewer #3 (Public review):

      This manuscript presents a meta-analysis of 23 studies, which report 297 effect sizes, on the effect of SO-spindle coupling on memory performance. The analysis has been done with great care, and the results are described in great detail. In particular, there are separate analyses for coupling phase, spindle amplitude, coupling strength (e.g., measured by vector length or modulation index), and coupling percentage (i.e., the percentage of SPs coupled with SOs). The authors conclude that the precision and strength of coupling showed significant correlations with memory retention.

      There are two main points where I do not agree with the authors.

      First, the authors conclude that "SO-SP coupling should be considered as a general physiological mechanism for memory consolidation". However, the reported effect sizes are smaller than what is typically considered a "small effect" (0.10<br /> Second, the study implements state-of-the-art Bayesian statistics. While some might see this as a strength, I would argue that it is not. A classical meta-analysis is relatively easy to understand, even for readers with only a limited background in statistics. A Bayesian analysis, on the other hand, introduces a number of subjective choices that render it much less transparent. This becomes obvious in the forest plots. It is not immediately apparent to the reader how the distributions for each study represent the reported effect sizes (gray dots), which makes the analyses unnecessarily opaque. It is commendable that the authors now provide classical forest plots as Figs. S10.1-4.

      However, analyses that require a "Markov chain Monte Carlo (MCMC) method, [..] with the no-U-turn Hamiltonian Monte Carlo (HMC) samplers, [..] with each chain undergoing 12,000 iterations (including 2,000 warm-ups)" for calculating accurate Bayes Factors (BF), and checking its convergence "through graphical posterior predictive checks, [..] trace plots, and [..] Gelman and Rubin Diagnostic", which should then result in something resembling "a uniformly undulating wave with high overlap between chains" still seems overly complex. It follows a recent trend in using more and more opaque methods. Where we had to trust published results a decade ago because the data were not openly available, today we must trust the results because methods (including open source software toolboxes) can no longer be checked with reasonable effort.

    5. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Given the importance that these coupling mechanisms have been given in theory, this is a timely and important contribution to the literature in terms of determining whether these theoretical assumptions hold true in human data.

      Thank you!

      I did not follow the logic behind including spindle amplitude in the meta-analysis. This is not a measure of SO-spindle coupling (which is the focus of the review), unless the authors were restricting their analysis of the amplitude of coupled spindles only. It doesn't sound like this is the case though. The effect of spindle amplitude on memory consolidation has been reviewed in another recent meta-analysis (Kumral et al, 2023, Neuropsychologia). As this isn't a measure of coupling, it wasn't clear why this measure was included in the present meta-analysis. You could easily make the argument that other spindle measures (e.g., density, oscillatory frequency) could also have been included, but that seems to take away from the overall goal of the paper which was to assess coupling.

      Indeed, spindle amplitude refers to all spindle events rather than only coupled spindles. This choice was made because we recognized the challenge of obtaining relevant data from each study—only 4 out of the 23 included studies performed their analyses after separating coupled and uncoupled spindles. This inconsistency strengthens the urgency and importance of this meta-analysis to standardize the methods and measures used for future analysis on SO-SP coupling and beyond. We agree that focusing on the amplitude of coupled spindles would better reveal their relations with coupling, and we have discussed this limitation in the manuscript.

      Nevertheless, we believe including spindle amplitude in our study remains valuable, as it served several purposes. First, SO-SP coupling involves the modulation between spindle amplitude and slow oscillation phase. Different studies have reported conflicting conclusions regarding how overall spindle amplitude was related to coupling as an indicator of oscillation strength overnight– some found significant correlations (e.g., Baena et al., 2023), while others did not (e.g., Roebber et al., 2022). This discrepancy highlights an indirect but potentially crucial insight into the role of spindle amplitude in coupling dynamics. Second, in studies related to SO-SP coupling, spindle amplitude is one of the most frequently reported measures along with other coupling measures that significantly correlated with oversleep memory improvements (e.g. Kurz et al., 2023; Ladenbauer et al., 2021; Niknazar et al., 2015), so we believe that including this measure can provide a more comprehensively review of the existing literature on SO-SP coupling. Third, incorporating spindle amplitude allows for a direct comparison between the measurement of coupling and individual events alone in their contribution to memory consolidation– a question that has been extensively explored in recent research. (e.g., Hahn et al., 2020; Helfrich et al., 2019; Niethard et al., 2018; Weiner et al., 2023). Finally, spindle amplitude was identified as the most important moderator for memory consolidation in Kumral et al.'s (2023) meta-analysis. By including it in our analysis, we sought to replicate their findings within a broader framework and introduce conceptual overlaps with existing reviews. Therefore, although we were not able to selectively include coupled spindles, there is still a unique relation between spindle amplitude and SO-SP coupling that other spindle measures do not have. 

      Originally, we also intended to include coupling density or counts in the analysis, which seems more relevant to the coupling metrics. However, the lack of uniformity in methods used to measure coupling density posed a significant limitation. We hope that our study will encourage consistent reporting of all relevant parameters in future research, allowing future meta-analyses to incorporate these measures comprehensively. We have added this discussion to the revised version of the manuscript (p. 3) to further clarify these points.

      All other citations were referenced in the manuscript.

      At the end of the first paragraph of section 3.1 (page 13), the authors suggest their results "... further emphasise the role of coupling compared to isolated oscillation events in memory consolidation". This had me wondering how many studies actually test this. For example, in a hierarchical regression model, would coupled spindles explain significantly more variance than uncoupled spindles? We already know that spindle activity, independent of whether they are coupled or not, predicts memory consolidation (e.g., Kumral meta-analysis). Is the variance in overnight memory consolidation fully explained by just the coupled events? If both overall spindle density and coupling measures show an equal association with consolidation, then we couldn't conclude that coupling compared to isolated events is more important.

      While primary coupling measurements, including coupling phase and strength, showed strong evidence for their associations with memory consolidation, measures of spindles, including spindle amplitude, only exhibited limited evidence (or “non-significant” effect) for their association with consolidation. These results are consistent with multiple empirical studies using different techniques (e.g., Hahn et al., 2020; Helfrich et al., 2019; Niethard et al., 2018; Weiner et al., 2023), which reported that coupling metrics are more robust predictors of consolidation and synaptic plasticity than spindle or slow oscillation metrics alone. However, we agree with the reviewer that we did not directly separate the effect between coupled and uncoupled spindles, and a more precise comparison would involve contrasting the “coupling of oscillation events” with ”individual oscillation events” rather than coupling versus isolated events.

      We recognized that Kumral and colleagues’ meta-analysis reported a moderate association between spindle measures and memory consolidation (e.g., for spindle amplitude-memory association they reported an effect size of approximately r = 0.30). However, one of the advantages of our study is that we actively cooperated with the authors to obtain a large number of unreported and insignificant data relevant to our analysis, as well as separated data that were originally reported under mixed conditions. This approach decreases the risk of false positives and selective reporting of results, making the effect size more likely to approach the true value. In contrast, we found only a weak effect size of r = 0.07 with minimal evidence for spindle amplitude-memory relation. However, we agree with the reviewer that using a more conservative term in this context would be a better choice since we did not measure all relevant spindle metrics including the density.

      To improve clarity in our manuscript, we have revised the statement to: “Together with other studies included in the review, our results suggest a crucial role of coupling but did not support the role of spindle events alone in memory consolidation,” and provide relevant references (p. 13). We believe this can more accurately reflect our findings and the existing literature to address the reviewer’s concern.

      It was very interesting to see that the relationship between the fast spindle coupling phase and overnight consolidation was strongest in the frontal electrodes. Given this, I wonder why memory promoting fast spindles shows a centro-parietal topography? Surely it would be more adaptive for fast spindles to be maximally expressed in frontal sites. Would a participant who shows a more frontal topography of fast spindles have better overnight consolidation than someone with a more canonical centro-parietal topography? Similarly, slow spindles would then be perfectly suited for memory consolidation given their frontal distribution, yet they seem less important for memory.

      Regarding the topography of fast spindles and their relationship to memory consolidation, we agree this is an intriguing issue, and we have already developed significant progress in this topic in our ongoing work, and have found evidence that participants with a more frontal topography of fast spindles show better overnight consolidation. These findings will be presented in our future publications. We share a few relevant observations: First, there are significant discrepancies in the definition of “slow spindle” in the field. Some studies defined slow spindle from 9-12 Hz (e.g. Mölle et al., 2011; Kurz et al., 2021), while others performed the event detection within a range of 11-13/14 Hz and found a frontal-dominated topography (e.g. Barakat et al., 2011; D'Atri et al., 2018). Compounding this issue, individual and age differences in spindle frequency are often overlooked, leading to challenges in reliably distinguishing between slow and fast spindles. Some studies have reported difficulty in clearly separating the two types of spindles altogether (e.g., Hahn et al., 2020). Moreover, a critical factor often ignored in past research is the propagating nature of both slow oscillations and spindles across the cortex, where spindles are coupled with significantly different phases of slow oscillations (see Figure 5). In addition, the frontal region has the strongest and most active SOs as its origin site, which may contribute to the role of frontal coupling. In contrast, not all SOs propagate from PFC to centro-parietal sites. The reviewer also raised an interesting idea that slow spindles would be perfectly suited for memory consolidation given their frontal distribution. We propose that one possible explanation is that if SOs couple exclusively with slow SPs, they may lose their ability to coordinate inter-area activity between centro-parietal and frontal regions, which could play a critical role in long-range memory transmission across hippocampus, thalamus, and prefrontal cortex. This hypothesis requires investigation in future studies. We believe a better understanding of coupling in the context of the propagation of these waves will help us better understand the observed frontal relationship with consolidation. Therefore, we believe this result supports our conclusion that coupling precision is more important than intensity, and we have addressed this in revised manuscript (pp. 15-16).

      The authors rightly note the issues with multiple comparisons in sleep physiology and memory studies. Multiple comparison issues arise in two ways in this literature. First are comparisons across multiple electrodes (many studies now use high-density systems with 64+ channels). Second are multiple comparisons across different outcome variables (at least 3 ways to quantify coupling (phase, consistency, occurrence) x 2 spindle types (fast, slow). Can the authors make some recommendations here in terms of how to move the field forward, as this issue has been raised numerous times before (e.g., Mantua 2018, Sleep; Cox & Fell 2020, Sleep Medicine Reviews for just a couple of examples). Should researchers just be focusing on the coupling phase? Or should researchers always report all three metrics of coupling, and correct for multiple comparisons? I think the use of pre-registration would be beneficial here, and perhaps could be noted by the authors in the final paragraph of section 3.5, where they discuss open research practices.

      There are indeed multiple methods that we can discuss, including cluster-based and non-parametric methods, etc., to correct for multiple comparisons in EEG data with spatiotemporal structures. In addition, encouraging the reporting of all tested but insignificant results, at least in supplementary materials, is an important practice that helps readers understand the findings with reduced bias. We agree with the reviewer’s suggestions and have added more information in section 3.4-3.5 (p. 17) to advocate for a standardized “template” used to report effect sizes and correct multiple comparisions in future research.

      We advocate for the standardization of reporting all three coupling metrics– phase, strength, and prevalence (density, count, and/or percentage coupled). Each coupling metric captures distinct a property of the coupling process and may interact with one another (Weiner et al., 2023). Therefore, we believe it is essential to report all three metrics to comprehensively explore their different roles in the “how, what, and where” of long-distance communication and consolidation of memory. As we advance toward a deeper understanding of the relationship between memory and sleep, we hope this work establishes a standard for the standardization, transparency, and replication of relevant studies.

      Reviewer #2 (Public review):

      Regarding the Moderator of Age: Although the authors discuss the limited studies on the analysis of children and elders regarding age as a moderator, the figure shows a significant gap between the ages of 40 and 60. Furthermore, there are only a few studies involving participants over the age of 60. Given the wide distribution of effect sizes from studies with participants younger than 40, did the authors test whether removing studies involving participants over 60 would still reveal a moderator effect?

      We agree that there is an age gap between younger and older adults, as current studies often focus on contrasting newly matured and fully aged populations to amplify the effect, while neglecting the gradual changes in memory consolidation mechanisms across the aging spectrum. We suggest that a non-linear analysis of age effects would be highly valuable, particularly when additional child and older adult data become available.

      In response to the reviewer’s suggestion, we re-tested the moderation effect of age after excluding effect sizes from older adults. The results revealed a decrease in the strength of evidence for phase-memory association due to increased variability, but were consistent for all other coupling parameters. The mean estimations also remained consistent (coupling phase-memory relation: -0.005 [-0.013, 0.004], BF10 = 5.51, the strength of evidence reduced from strong to moderate; coupling strength-memory relation: -0.005 [-0.015, 0.008], BF10 = 4.05, the strength of evidence remained moderate). These findings align with prior research, which typically observed a weak coupling-memory relationship in older adults during aging (Ladenbauer et al, 2021; Weiner et al., 2023) but not during development (Hahn et al., 2020; Kurz et al., 2021; Kurz et al., 2023). Therefore, this result is not surprising to us, and there are still observable moderate patterns in the data. We have reported these additional results in the revised manuscript (pp. 6, 11), and interpret “the moderator effect of age in the phase-memory association becomes less pronounced during development after excluding the older adult data”. We believe the original findings including the older adult group remain meaningful after cautious interpretation, given that the older adult data were derived from multiple studies and different groups, and they represent the aging effects.

      Reviewer #3 (Public review):

      First, the authors conclude that "SO-SP coupling should be considered as a general physiological mechanism for memory consolidation". However, the reported effect sizes are smaller than what is typically considered a "small effect”.

      While we acknowledge the concern about the small effect sizes reported in our study, it is important to contextualize these findings within the field of neuroscience, particularly memory research. Even in individual studies, small effect sizes are not uncommon due to the inherent complexity of the mechanisms involved and the multitude of confounding variables. This is an important factor to be considered in meta-analyses where we synthesize data from diverse populations and experimental conditions. For example, the relationship between SO-slow SP coupling and memory consolidation in older adults is expected to be insignificant.

      As Funder and Ozer (2019) concluded in their highly cited paper, an effect size of r = 0.3 in psychological and related fields should be considered large, with r = 0.4 or greater likely representing an overestimation and rarely found in a large sample or a replication. Therefore, we believe r = 0.1 should not be considered as a lower bound of the small effect. Bakker et al. (2019) also advocate for a contextual interpretation of the effect size. This is particularly important in meta-analyses, where the results are less prone to overestimation compared to individual studies, and we cooperated with all authors to include a large number of unreported and insignificant results. In this context, small correlations may contain substantial meaningful information to interpret. Although we agree that effect sizes reported in our study are indeed small at the overall level, they reflect a rigorous analysis that incorporates robust evidence across different levels of moderators. Our moderator analyses underscore the dynamic nature of coupling-memory relationships, with stronger associations observed in moderator subgroups that have historically exhibited better memory performance, particularly after excluding slow spindles and older adults. For example, both the coupling phase and strength of frontal fast spindles with slow oscillations exhibited "moderate-to-large" correlations with the consolidation of different types of memory, especially in young adults, with r values ranging from 0.18 to 0.32. (see Table S9.1-9.4). We have included discussion about the influence of moderators and hierarchical structures on the dynamics of coupling-memory associations (pp. 17, 20). In addition, we have updated the conclusion to be “SO-fast SP coupling should be considered as a general physiological mechanism for memory consolidation” (p. 1).

      Second, the study implements state-of-the-art Bayesian statistics. While some might see this as a strength, I would argue that it is the greatest weakness of the manuscript. A classical meta-analysis is relatively easy to understand, even for readers with only a limited background in statistics. A Bayesian analysis, on the other hand, introduces a number of subjective choices that render it much less transparent.

      This kind of analysis seems not to be made to be intelligible to the average reader. It follows a recent trend of using more and more opaque methods. Where we had to trust published results a decade ago because the data were not openly available, today we must trust the results because the methods can no longer be understood with reasonable effort.

      This becomes obvious in the forest plots. It is not immediately apparent to the reader how the distributions for each study represent the reported effect sizes (gray dots). Presumably, they depend on the Bayesian priors used for the analysis. The use of these priors makes the analyses unnecessarily opaque, eventually leading the reader to question how much of the findings depend on subjective analysis choices (which might be answered by an additional analysis in the supplementary information).

      We appreciate the reviewer for sharing this viewpoint and we value the opportunity to clarify some key points. To address the concern about clarity, we have included more details in the methods section explaining how to interpret Bayesian statistics including priors, posteriors, and Bayes factors, making our results more accessible to those less familiar with this approach.

      On the use of Bayesian models, we believe there may have been a misunderstanding. Bayesian methods, far from being "opaque" or overly complex, are increasingly valued for their ability to provide nuanced, accurate, and transparent inferences (Sutton & Abrams, 2001; Hackenberger, 2020; van de Schoot et al., 2021; Smith et al., 1995; Kruschke & Liddell, 2018). It has been applied in more than 1,200 meta-analyses as of 2020 (Hackenberger, 2020). In our study, we used priors that assume no effect (mean set to 0, which aligns with the null) while allowing for a wide range of variation to account for large uncertainties. This approach reduces the risk of overestimation or false positives and demonstrates much-improved performance over traditional methods in handling variability (Williams et al., 2018; Kruschke & Liddell, 2018). In addition, priors can also increase transparency, since all assumptions are formally encoded and open to critique or sensitivity analysis. In contrast, frequentist methods often rely on hidden or implicit assumptions such as homogeneity of variance, fixed-effects models, and independence of observations that are not directly testable. Sensitivity analyses reported in the supplemental material (Table S9.1-9.4) confirmed the robustness of our choices of priors– our results did not vary by setting different priors.

      As Kruschke and Liddell (2018) described, “shrinkage (pulling extreme estimates closer to group averages) helps prevent false alarms caused by random conspiracies of rogue outlying data,” a well-known advantage of Bayesian over traditional approaches. This explains the observed differences between the distributions and grey dots in the forest plots, which is an advantage of Bayesian models in handling heterogeneity. Unlike p-values, which can be overestimated with a large sample size and underestimated with a small sample size, Bayesian methods make assumptions explicit, enabling others to challenge or refine them– an approach aligned with open science principles (van de Schoot et al., 2021). For example, a credible interval in Bayesian model can be interpreted as “there is a 95% probability that the parameter lies within the interval.”, while a confidence interval in frequentist model means “In repeated experiments, 95% of the confidence intervals will contain the true value.” We believe the former is much more straightforward and convincing for readers to interpret. We will ensure our justification for using Bayesian models is more clearly presented in the manuscript (pp. 21-23).

      We acknowledge that even with these justifications, different researchers may still have discrepancies in their preferences for Bayesian and frequentist models. To increase the effort of transparent reporting, we have also reported the traditional frequentist meta-analysis results in Supplemental Material 10 to justify the robustness of our analysis, which suggested non-significant differences between Bayesian and frequentist models. We have included clearer references in the updated version of the manuscript to direct readers to the figures that report the statistics provided by traditional models.

      However, most of the methods are not described in sufficient detail for the reader to understand the proceedings. It might be evident for an expert in Bayesian statistics what a "prior sensitivity test" and a "posterior predictive check" are, but I suppose most readers would wish for a more detailed description. However, using a "Markov chain Monte Carlo (MCMC) method with the no-U-turn Hamiltonian Monte Carlo (HMC) sampler" and checking its convergence "through graphical posterior predictive checks, trace plots, and the Gelman and Rubin Diagnostic", which should then result in something resembling "a uniformly undulating wave with high overlap between chains" is surely something only rocket scientists understand. Whether this was done correctly in the present study cannot be ascertained because it is only mentioned in the methods and no corresponding results are provided. 

      We appreciate the reviewer’s concerns about accessibility and potential complexity in our descriptions of Bayesian methods. Our decision to provide a detailed account serves to enhance transparency and guide readers interested in replicating our study. We acknowledge that some terms may initially seem overwhelming. These steps, such as checking the MCMC chain convergence and robustness checks, are standard practices in Bayesian research and are analogous to “linearity”, “normality” and “equal variance” checks in frequentist analysis. In addition, Hamiltonian Monte Carlo (HMC) is the default algorithm Stan (the software we used to fit Bayesian models) uses to sample from the posterior distribution in Bayesian models. It is a type of MCMC method designed to be faster and more efficient than traditional sampling algorithms, especially for complex or high-dimensional models. We have added exemplary plots in the supplemental material S4.1-4.3 and the method section (pp. 21-22) to explain the results and interpretation of these convergence checks. We hope this will help address any concerns about methodological rigor.

      In one point the method might not be sufficiently justified. The method used to transform circular-linear r (actually, all references cited by the authors for circular statistics use r² because there can be no negative values) into "Z_r", seems partially plausible and might be correct under the H0. However, Figure 12.3 seems to show that under the alternative Hypothesis H1, the assumptions are not accurate (peak Z_r=~0.70 for r=0.65). I am therefore, based on the presented evidence, unsure whether this transformation is valid. Also, saying that Z_r=-1 represents the null hypothesis and Z_r=1 the alternative hypothesis can be misinterpreted, since Z_r=0 also represents the null hypothesis and is not half way between H0 and H1.

      First, we realized that in the title of Figures 12.2 and 12.3. “true r = 0.35” and “true r = 0.65” should be corrected as “true r_z” (note that we use r_z instead of Z_r in the revised manuscript per your suggestion). The method we used here is to first generate an underlying population that has null (0), moderate (0.35), or large (0.65) r_z correlations, then test whether the sampling distribution drawn from these populations followed a normal distribution across varying sample sizes. Nevertheless, the reviewer correctly noticed discrepancies between the reported true r_z and its sampling distribution peak. This discrepancy arises because, when generating large population data, achieving exact values close to a strong correlation like r_z = 0.65 is unlikely. We loop through simulations to generate population data and ensure their r_z values fall within a threshold. For moderate effect sizes (e.g., r_z = 0.35), this is straightforward using a narrow range (0.34 < r_z < 0.35). However, for larger effect sizes like r_z = 0.65, a wider range (0.6 < r_z < 0.7) is required. therefore sometimes the population we used to draw the sample has a r_z slightly deviated from 0.65. This remains reasonable since the main point of this analysis is to ensure that a large r_z still has a normal sampling distribution, but not focus specifically on achieving r_z = 0.65.

      We acknowledge that this variability of the range used was not clearly explained in supplemental material 12 and it is not accurate to report “true r_z = 0.65”. In the revised version, we have addressed this issue by adding vertical lines to each subplot to indicate the r_z of the population we used to draw samples, making it easier to check if it aligns with the sampling peak. In addition, we have revised the title to “Sampling distributions of r_z drawn from strong correlations

      (r_z = 0.6-0.7)”. We confirmed that population r_z and the peak of their sampling distribution remain consistent under both H0 and H1 in all sample sizes with n > 25, and we hope this explanation can fully resolve your concern.

      We agree with the reviewer that claiming r_z = -1 represents the null hypothesis is not accurate. The circlin r_z = 0 is better analogous to Pearson’s r = 0 since both represent the mean drawn from the population under the null hypothesis. In contrast, the mean effect size under null will be positive in the raw circlin r, which is one of the important reasons for the transformation. To provide a more accurate interpretation, we updated Table 6 to describe the following strength levels of evidence: no effect (r < 0), null (r = 0), small (r = 0.1), moderate (r = 0.3), and large (r =0.5). We thank the reviewer again for their valuable feedback.

      Reviewer #2 (Recommendations for the authors):

      (1) There is an extra space in the Notes of Figure 1. "SW R sharp-wave ripple.".

      We thank the reviewer for pointing this out. We have confirmed that the "extra space" is not an actual error but a result of how italicized Times New Roman font is rendered in the LaTeX format. We believe that the journal’s formatting process will resolve this issue.

      (2) In the introduction, slow oscillations (SO) are defined with a frequency of 0.16-4 Hz, sleep spindles (SP) at 8-16 Hz, and sharp-wave ripples (SWR) at 80-300 Hz. The term "fast oscillation" (FO) is first introduced with the clarification "SPs in our case." However, on page 2, the authors state, "SO-FO coupling involving SWRs, SPs, and SOs..." There seems to be a discrepancy in the definition of FO; does it consistently refer to SPs and SWRs throughout the article?

      We appreciate the reviewer’s observation regarding the potential ambiguity of the term "FO." In our manuscript, "FO" is used as a general term to describe the interaction of a "relatively faster oscillation" with a "relatively slower oscillation" in the phase-amplitude coupling mechanism, therefore it is not intended to exclusively refer to SPs or SWRs. For example, it is usually used to describe SO–SP–SWR couplings during sleep memory studies, but Theta–Alpha–Gamma couplings in wakeful memory studies. To address this confusion, we removed the phrase "SPs in our case" and explicitly use "SPs" when referring to spindles. In addition, we have replaced "fast oscillation" with "faster oscillation" to emphasize that it is used in a relative sense (p. 1), rather than to refer to a specific oscillation. Also, we only retained the term “FO” when introducing the PAC mechanism.

      (3) On page 2, the first paragraph contains the phrase: "...which occur in the precise hierarchical temporal structure of SO-FO coupling involving SWRs, SPs, and SOs ..." Since "SO-FO" refers to slow and fast oscillations, it is better to maintain the order of frequencies, suggesting it as: SOs, SPs, and SWRs.

      We sincerely thank the reviewer for their valuable suggestion. We have updated the sentence to maintain the correct order from the lowest to the highest frequencies in the revised version (p. 2).

      (4) References should be provided:

      a “Studies using calcium imaging after SP stimulation explained the significance of the precise coupling phase for synaptic plasticity.".

      b. "Electrophysiology evidence indicates that the association between memory consolidation and SO-SP coupling is influenced by a variety of behavioral and physiological factors under different conditions."

      c. "Since some studies found that fast SPs predominate in the centroparietal region, while slow SPs are more common in the frontal region, a significant amount of studies only extracted specific types of SPs from limited electrodes. Some studies even averaged all electrodes to estimate coupling..."

      This is a great point.  These have been referenced as follows:

      a. Rephrased: “Studies using calcium imaging and SP stimulation explained the significance of the precise coupling phase for synaptic plasticity.” We changed “after” to “and” to reflect that these were conducted as two separate experiments. This is a summary statement, with relevant citations provided in the following two sentences of the paragraph, including Niethard et al., 2018, and Rosanova et al., 2005. (p. 2)

      b. Included diverse sources of evidence: “Electrophysiology evidence from studies included in our meta-analysis (e.g. Denis et al., 2021; Hahn et al., 2020; Mylonas et al., 2020) and others (e.g. Bartsch et al., 2019; Muehlroth et al., 2019; Rodheim et al., 2023) reported that the association between memory consolidation and SO-SP coupling is influenced by a variety of behavioral and physiological factors under different conditions.” (p. 3)

      c. Added references and more details: “Since some studies found that fast SPs predominate in the centroparietal region, while slow SPs are more common in the frontal region, a significant amount of studies selectively extracted specific types of SPs from limited electrodes (e.g. Dehnavi et al., 2021; Perrault et al., 2019; Schreiner et al., 2021). Some studies even averaged all electrodes in their spectral and/or time-series analysis to estimate metrics of oscillations and their couplings (e.g. Denis et al., 2022; Mölle et al., 2011; Nicolas et al., 2022).” (p. 4)

      Reviewer #3 (Recommendations for the authors):

      There are a number of terms that are not clearly defined or used:

      (1) SP amplitude. Does this mean only the amplitude of coupled spindles or of spindles in general?

      This refers to the amplitude of spindles in general. We clarified this in the revised text (and see response to reviewer #1, point #1).

      (2) The definition of a small effect

      We thank the reviewer again for raising this important question. As we responded in the public review, small effect sizes are common in neuroscience and meta-analyses due to the complexity of the underlying mechanisms and the presence of numerous confounding variables and hierarchical levels. To help readers better interpret effect sizes, we changed rigid ranges to widely accepted benchmarks for effect size levels in neuroscience research: small (r=0.1), moderate (r=0.3), and large (r=0.5; Cohen, 1988). We also noted that an evidence and context-based framework will provide a more practical way to interpret the observed effect sizes compared to rigid categorizations.

      (3) Can a BF10 based on experimental evidence actually be "infinite" and a probability actually be 1.00?

      We appreciate the reviewer for highlighting this potential confusion. The formula used to calculate BF10 is P(data | H1) / P(data | H0). In the experimental setting with an informative prior, an ‘infinite’ BF10 value indicates that all posterior samples are overwhelmingly compatible with H1 given the data and assumptions (Cox et al., 2023; Heck et al., 2023; Ly et al., 2016). In such cases, the denominator P(data | H0) becomes vanishingly small, leading BF10 to converge to infinity. This scenario occurs when the probability of H1 converges to 1 (e.g., 0.9999999999…).

      It is a well-established convention in Bayesian statistics to report the Bayes factor as "infinity" in cases where the evidence is overwhelmingly strong, and BF10 exceeds the numerical limits of the computation tools to become effectively infinite. To address this ambiguity, we added a footnote in the revised version of the manuscript to clarify the interpretation of an 'infinite' BF10 . (p. 8)

      (4) Z_r should be renamed to r_z or similar. These are not Z values (-inf..+inf), but r values (-1..1).

      We thank the reviewers for their suggestions. We agree that r_z would provide a clearer and more accurate interpretation, while z is more appropriate for referring to Fisher's z-transformed r (see point (5)). We have updated the notation accordingly.

      (5) Also, it remains quite unclear at which points in the analyses, "r" values or "Fisher's z transformed r" values are used. Assumptions of normality should only apply to the transformed values. However, the formulas for the random effects model seem to assume normality for r values.

      The correlation values were z-transformed during preprocessing to ensure normality and the correct estimation of sampling variances before running the models. The outputs were then back-transformed to raw r values only when reporting the results to help readers interpret the effect size. We mentioned this in Section 5.5.1, therefore the normality assumptions are not a concern. We have updated the notation r to z (-inf..+inf) in the formula of the random and mixed effect models in the revised version of the manuscript (p. 22).

      Language

      (1) Frequency. In the introduction, the authors use "frequency" when they mean something like the incidence of spindles.

      We agree that the term "frequency" has been used inconsistently to describe both the incidence of events and the frequency bands of oscillations. We have replaced "frequency" with "prevalence" to refer to the incidence of coupling events where applicable (p. 3).

      (2) Moderate and mediate. These two terms are usually meant to indicate two different types of causal influences.

      Thanks for the reviewer’s suggestions. We agree that "moderate" is more appropriate to describe moderators in this study since it does not directly imply causality. We have replaced mediate with moderate in relevant contexts.

      (3) "the moderate effect of memory task is relatively weak": "moderator effect" or "moderate effect"?

      We appreciate the reviewer for pointing out this mistake. We have updated the term to "moderator effect" in Section 2.2.2 (p. 6).

      (4) "in frontal regions we found a latest coupled but most precise and strong SO-fast SP coupling" Meaning?

      We thank the reviewer for bringing this concern of clarity to our attention. By 'latest,' we refer to the delayed phase of SO-fast SP coupling observed in the frontal regions compared to the central and parietal regions (see Figure 5), "Precise and strong" describes the high precision and strength of phase-locking between the SO up-state and the fast SP peak in these regions. We have rephrased this sentence to be: “We found that SO-fast SP coupling in the frontal region occurred at the latest phase observed across all regions, characterized by the highest precision and strength of phase-locking.” to improve clarity (p. 9).

      (5) Figure 5 and others contain angles in degrees and radians.

      We appreciate the reviewer pointing out this inconsistency. We have updated the manuscript and supplementary material to consistently use radians throughout.

    1. eLife Assessment

      This well-designed study combining psychophysical and fMRI data presents a valuable finding regarding how adaptation alters spatial frequency processing in the cortex. The evidence supporting the claims of the authors is solid, although inclusion of more participants and better quality of the fMRI data would have strengthened the study. The study will be of interest to cognitive and perceptual neuroscientists working on human and non-human primates.

    2. Reviewer #2 (Public review):

      The revised manuscript by Altan et al. includes some real improvements to the visualizations and explanations of the authors' thesis statement with respect to fMRI measurements of pRF sizes. In particular, the deposition of the paper's data has allowed me to probe and refine several of my previous concerns. While I still have major concerns about how the data are presented in the current draft of the manuscript, my skepticism about data quality overall has been much alleviated. Note that this review focuses almost exclusively on the fMRI data as I was satisfied with the quality of the psychophysical data and analyses in my previous review.

      Major Concerns

      (I) Statistical Analysis

      In my previous review, I raised the concern that the small sample size combined with the noisiness of the fMRI data, a lack of clarity about some of the statistics, and a lack of code/data likely combine to make this paper difficult or impossible to reproduce as it stands. The authors have since addressed several aspects of this concern, most importantly by depositing their data. However their response leaves some major questions, which I detail below.

      First of all, the authors claim in their response to the previous review that the small sample size is not an issue because large samples are not necessary to obtain "conclusive" results. They are, of course, technically correct that a small sample size can yield significant results, but the response misses the point entirely. In fact, small samples are more likely than large samples to erroneously yield a significant result (Button et al., 2013, DOI:10.1038/nrn3475), especially when noise is high. The response by the authors cites Schwarzkopf & Huang (2024) to support their methods on this front. After reading the paper, I fail to see how it is at all relevant to the manuscript at hand or the criticism raised in the previous review. Schwarzkopf & Huang propose a statistical framework that is narrowly tailored to situations where one is already certain that some phenomenon (like the adaptation of pRF size to spatial frequency) either always occurs or never occurs. Such a framework is invalid if one cannot be certain that, for example, pRF size adapts in 98% of people but not the remaining 2%. Even if the paper were relevant to the current study, the authors don't cite this paper, use its framework, or admit the assumptions it requires in the current manuscript. The observation that a small dataset can theoretically lead to significance under a set of assumptions not appropriate for the current manuscript is not a serious response to the concern that this manuscript may not be reproducible.

      To overcome this concern, the authors should provide clear descriptions of their statistical analyses and explanations of why these analyses are appropriate for the data. Ideally, source code should be published that demonstrates how the statistical tests were run on the published data. (I was unable to find any such source code in the OSF repository.) If the effects in the paper were much stronger, this level of rigor might not be strictly necessary, but the data currently give the impression of being right near the boundary of significance, and the manuscript's analyses needs to reflect that. The descriptions in the text were helpful, but I was only able to approximately reproduce the authors analyses based on these descriptions alone. Specifically, I attempted to reproduce the Mood's median tests described in the second paragraph of section 3.2 after filtering the data based on the criteria described in the final paragraph of section 3.1. I found that 7/8 (V1), 7/8 (V2), 5/8 (V3), 5/8 (V4), and 4/8 (V3A) subjects passed the median test when accounting for the (40) multiple comparisons. These results are reasonably close to those reported in the manuscript and might just differ based on the multiple comparisons strategy used (which I did not find documented in the manuscript). However, Mood's median test does not test the direction of the difference-just whether the medians are different-so I additionally required that the median sigma of the high-adapted pRFs be greater than that of the low-adapted pRFs. Surprisingly, in V1 and V3, one subject each (not the same subject) failed this part of the test, meaning that they had significant differences between conditions but in the wrong direction. This leaves 6/8 (V1), 7/8 (V2), 4/8 (V3), 5/8 (V4), and 4/8 (V3A) subjects that appear to support the authors' conclusions. As the authors mention, however, this set of analyses runs the risk of comparing different parts of cortex, so I also performed Wilcox signed-rank tests on the (paired) vertex data for which both the high-adapted and low-adapted conditions passed all the authors' stated thresholds. These results largely agreed with the median test (only 5/8 subjects significant in V1 but 6/8 in in V3A, other areas the same, though the two tests did not always agree which subjects had significant differences). These analyses were of course performed by a reviewer with a reviewer's time commitment to the project and shouldn't be considered a replacement for the authors' expertise with their own data. If the authors think that I have made a mistake in these calculations, then the best way to refute them would be to publish the source code they used to threshold the data and to perform the same tests.

      Setting aside the precise values of the relevant tests, we should also consider whether 5 of 8 subjects showing a significant effect (as they report for V3, for example) should count as significant evidence of the effect? If one assumes, as a null hypothesis, that there is no difference between the two conditions in V3 and that all differences are purely noise, then a binomial test across subjects would be appropriate. Even if 6 of 8 subjects show the effect, however (and ignoring multiple comparisons), the p-value of a one-sided binomial test is not significant at the 0.05 level (7 of 8 subjects is barely significant). Of course, a more rigorous way to approach this question could be something like an ANOVA, and the authors use an ANOVA analysis of the medians in the paragraph following their use of Mood's median test. However, ANOVA assumes normality, and the authors state in the previous paragraph that they employed Mood's median test because "the distribution of the pRF sizes is zero-bounded and highly skewed" so this choice does not make sense. The Central Limits Theorem might be applied to the medians in theory, but with only 8 subjects and with an underlying distribution of pRF sizes that is non-negative, the relevant data will almost certainly not be normally distributed. These tests should probably be something like a Kruskal-Wallis ANOVA on ranks.

      All of the above said, my intuition about the data is currently that there are significant changes to the adapted pRF size in V2. I am not currently convinced that the effects in other visual areas are significant, and I suspect that the paper would be improved if authors abandoned their claims that areas other than V2 show a substantial effect. Importantly, I don't think this causes the paper to lose any impact-in fact, if the authors agree with my assessments, then the paper might be improved by focusing on V2. Specifically, the authors' already discuss psychophysical work related to the perception of texture on pages 18 and 19 and link it to their results. V2 is also implicated in the perception of texture (see, for example, Freeman et al., 2013; DOI:10.1038/nn.3402; Ziemba et al., 2016, DOI:10.1073/pnas.1510847113; Ziemba et al., 2019; DOI:10.1523/JNEUROSCI.1743-19.2019) and so would naturally be the part of the visual cortex where one might predict that spatial frequency adaptation would have a strong effect on pRF size. This neatly connects the psychophysical and imaging sides of this project and could make a very nice story out of the present work.

      (II) Visualizations

      The manuscript's visual evidence regarding the pRF data also remains fairly weak (but I found the pRF size comparisons in the OSF repository and Figure S1 to be better evidence-more in the next paragraph). The first line of the Results section still states, "A visual inspection on the pRF size maps in Figure 4c clearly shows a difference between the two conditions, which is evident in all regions." As I mentioned in my previous review, I don't agree with this claim (specifically, that it is clear). My impression when I look at these plots is of similarity between the maps, and, where there is dissimilarity, of likely artifacts. For example, the splotch of cortex near the upper vertical meridian (ventral boundary) of V1 that shows up in yellow in the upper plot but not the lower plot also has a weirdly high eccentricity and a polar angle near the opposite vertical meridian: almost certainly not the actual tuning of that patch of cortex. If this is the clearest example subject in the dataset, then the effect looks to me to be very small and inconsistently distributed across the visual areas. That said, I'm not convinced that the problem here is the data-rather, I think it's just very hard to communicate a small difference in parameter tuning across a visual area using this kind of side-by-side figure. I think that Figure S2, though noisy (as pRF maps typically are), is more convincing than Figure 4c, personally. For what it's worth, when looking at the data myself, I found that plotting log(𝜎(H) / 𝜎(L)), which will be unstable when noise causes 𝜎(H) or 𝜎(L) to approach zero, was less useful than plotting plotting (𝜎(H) - 𝜎(L)) / (𝜎(H) + 𝜎(L)). This latter quantity will be constrained between -1 and 1 and shows something like a proportional change in the pRF size (and thus should be more comparable across eccentricity).

      In my opinion, the inclusion of the pRF size comparison plots in the OSF repository and Figure S1 made a stronger case than any of the plots of the cortical surface. I would suggest putting these on log-log plots since the distribution of pRF size (like eccentricity) is approximately exponential on the cortical surface. As-is, it's clear in many plots that there is a big splotch of data in the compressed lower left corner, but it's hard to get a sense for how these should be compared to the upper right expanse of the plots. It is frequently hard to tell whether there is a greater concentration of points above or below the line of equality in the lower left corner as well, and this is fairly central to the paper's claims. My intuition is that the upper right is showing relatively little data (maybe 10%?), but these data are very emphasized by the current plots.
The authors might even want to consider putting a collection of these scatter-plots (or maybe just subject 007, or possible all subjects' pRFs on a single scatter-plot) in the main paper and using these visualizations to provide intuitive supporting for the main conclusions about the fMRI data (where the manuscript currently use Figure 4c for visual intuition).

      Minor Comments

      (1) Although eLife does not strictly require it, I would like to see more of the authors' code deposited along with the data (especially the code for calculating the statistics that were mentioned above). I do appreciate the simulation code that the authors added in the latest submission (largely added in response to my criticism in the previous reviews), and I'll admit that it helped me understand where the authors were coming from, but it also contains a bug and thus makes a good example of why I'd like to see more of the authors' code. If we set aside the scientific question of whether the simulation is representative of an fMRI voxel (more in Minor Comment 5, below), Figures 1A and the "AdaptaionEffectSimulated.png" file from the repository (https://osf.io/d5agf) imply that only small RFs were excluded in the high-adapted condition and only large RFs were excluded in the low-adapted condition. However, the script provided (SimlatePrfAdaptation.m: https://osf.io/u4d2h) does not do this. Lines 7 and 8 of the script set the small and large cutoffs at the 30th and 70th percentiles, respectively, then exclude everything greater than the 30th percentile in the "Large RFs adapted out" condition (lines 19-21) and exclude anything less than the 70th percentile in the "Small RFs adapted out" condition (lines 27-29). So the figures imply that they are representing 70% of the data but they are in fact representing only the most extreme 30% of the data. (Moreover, I was unable to run the script because it contains hard-coded paths to code in someone's home directory.) Just to be clear, these kinds of bugs are quite common in scientific code, and this bug was almost certainly an honest mistake.

      (2) I also noticed that the individual subject scatter-plots of high versus low adapted pRF sizes on the OSF seem to occasionally have a large concentration of values on the x=0 and y=0 axes. This isn't really a big deal in the plots, but the manuscript states that "we denoised the pRF data to remove artifactual vertices where at least one of the following criteria was met: (1) sigma values were equal to or less than zero ..." so I would encourage the authors to double-check that the rest of their analysis code was run with the stated filtering.

      (3) The manuscript also says that the median test was performed "on the raw pRF size values". I'm not really sure what the "raw" means here. Does this refer to pRF sizes without thresholding applied?

      (4) The eccentricity data are much clearer now with the additional comments from the authors and the full set of maps; my concerns about this point have been met.

      (5) Regarding the simulation of RFs in a voxel (setting aside the bug), I will admit both to hoping for a more biologically-grounded situation and to nonetheless understanding where the authors are coming from based on the provided example. What I mean by biologically-grounded: something like, assume a 2.5-mm isotropic voxel aligned to the surface of V1 at 4{degree sign} of eccentricity; the voxel would span X to Y degrees of eccentricity, and we predict Z neurons with RFs in this voxel with a distribution of RF sizes at that eccentricity from [reference], etc. eventually demonstrating a plausible pRF size change commensurate to the paper's measurements. I do think that a simulation like this would make the paper more compelling, but I'll acknowledge that it probably isn't necessary and might be beyond the scope here.

    3. Reviewer #3 (Public review):

      This is a well-designed study examining an important, surprisingly understudied question: how does adaptation affect spatial frequency processing in human visual cortex? Using a combination of psychophysics and neuroimaging, the authors test the hypothesis that spatial frequency tuning is shifted to higher or lower frequencies, depending on preadapted state (low or high s.f. adaptation). They do so by first validating the phenomenon psychophysically, showing that adapting to 0.5 cpd stimuli causes an increase perceived s.f., and 3.5 cpd causes a relative decrease in perceived s.f. Using the same stimuli, they then port these stimuli to a neuroimaging study, in which population receptive fields are measured under high and low spatial frequency adaptation states. They find that adaptation changes pRF size, depending on adaptation state: adapting to high s.f. led to broader overall pRF sizes across early visual cortex, whereas adapting to low s.f. led to smaller overall pRF sizes. Finally the authors carry out a control experiment to psychophysically rule out the possibility that the perceived contrast change w/ adaptation may have given rise to these imaging results (doesn't appear to be the case). All in all, I found this to be a good manuscript: the writing is taut, and the study is well designed.

    4. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      We thank the reviewer for their careful evaluation and positive comments. 

      Adaptation paradigm

      “why is it necessary to use an *adaptation* paradigm to study the link between SF tuning and pRF estimation? Couldn't you just use pRF bar stimuli with varying SFs?” 

      We thank the reviewer for this question. First, by using adaptation we can infer the correspondence between the perceptual and the neuronal adaptation to spatial frequency. We couldn’t draw any inference about perception if we only varied the SF inside the bar. More importantly, while changing the SF inside the bar might help drive different neuronal populations, this is not guaranteed. As we touched on in our discussion, responses obtained from the mapping stimuli are dominated by complex processing rather than the stimulus properties alone. A considerable proportion of the retinotopic mapping signal is probably simply due to spatial attention to the bar (de Haas & Schwarzkopf, 2018; Hughes et al., 2019). So, adaptation is a more targeted way to manipulate different neuronal populations.

      Other pRF estimates: polar angle and eccentricity 

      We included an additional plot showing the polar angle for both adapter conditions (Figure S4), as well as participant-wise scatter plots comparing raw pRF size, eccentricity, and polar angle between two adapter conditions (available in shared data repository). In line with previous work on the reliability of pRF estimates (van Dijk, de Haas, Moutsiana, & Schwarzkopf, 2016; Senden, Reithler, Gijsen, & Goebel, 2014), both polar angle and eccentricity maps are very stable between the two adaptation conditions. 

      Variability in pRF size change

      As the reviewer pointed out, the pRF size changes show some variability across eccentricities, and ROIs (Figure 5A and 5B). It is likely that the variability could relate to the varying tuning properties of different regions and eccentricities for the specific SF we used in the mapping stimulus. So one reason V2 is most consistent could be that the stimulus is best matched for the tuning there. However, what factors contribute to this variability is an interesting question that will require further study. 

      Other recommendations

      We have addressed the other recommendations of the reviewer with one exception. The reviewer suggested we should comment on the perceived contrast decrease after SF adaptation (as seen in Figure 6B) in the main text. However, since we refer the readers to the supplementary analyses (Supplementary section S8) where we discuss this in detail, we chose to keep this aspect unchanged to avoid overcomplicating the main text.

      Reviewer #2 (Public Review):

      We thank the reviewer for their comments - we improved how we report key findings which we hope will clarify matters raised by the reviewer.

      RF positions in a voxel

      The reviewer’s comments suggest that they may have misunderstood the diagram (Figure 1A) illustrating the theoretical basis of the adaptation effect, likely due to us inadvertently putting the small RFs in the middle of the illustration. We changed this figure to avoid such confusion.

      Theoretical explanation of adaptation effect

      The reviewer’s explanation for how adaptation should affect the size of pRF averaging across individual RFs is incorrect. When selecting RFs from a fixed range of semi-uniformly distributed positions (as in an fMRI voxel), the average position of RFs (corresponding to pRF position) is naturally near the center of this range. The average size (corresponding to pRF size) reflects the visual field coverage of these individual RFs. This aggregate visual field coverage thus also reflects the individual sizes. When large RFs have been adapted out, this means the visual field coverage at the boundaries is sparser, and the aggregate pRF is therefore smaller. The opposite happens when adapting out the contribution of small RFs. We demonstrate this with a simple simulation at this OSF link: https://osf.io/ebnky/. The pRF size of the simulated voxels illustrate the adaptation effect should manifest precisely as we hypothesized.

      Figure S2

      It is not actually possible to compare R<sup>2</sup> between regions by looking at Figure S2 because it shows the pRF size change, not R<sup>2</sup>. Therefore, the arguments Reviewer #2 made based on their interpretation of the figure are not valid. Just as the reviewer expected, V1 is one of the brain regions with good pRF model fits. We included normalized and raw R<sup>2</sup> maps to make this more obvious to the readers.

      V1 appeared essentially empty in that plot primarily due to the sigma threshold we selected, which was unintentionally more conservative than those applied in our analyses and other figures. We apologize for this mistake. We corrected it in the revised version by including a plot with the appropriate sigma threshold.

      Thresholding details 

      Thresholding information was included in our original manuscript; however, we included more information in the figure captions to make it more obvious.

      2D plots replaced histograms

      We thank the reviewer for this suggestion. The original manuscript contained histograms showing the distribution of pRF size for both adaptation conditions for each participant and visual area (Figure S1). However, we agree that 2D plots better communicate the difference in pRF parameters between conditions. So we moved the histogram plots to the online repository, and included scatter plots with a color scheme revealing the 2D kernel density.

      We chose to implement 2D kernel density in scatter plots to display the distribution of individual pRF sizes transparently.

      (proportional) pRF size-change map 

      The reviewer requests pRF size difference maps. Figure S2 in fact demonstrates the proportional difference between the pRF sizes of the two adaptation conditions. Instead of simply taking the difference, we believe showing the proportional change map is more sensible because overall pRF size varies considerably between visual regions. We explained this more clearly in our revision. 

      pRF eccentricity plot 

      “I suspect that the difference in PRF size across voxels correlates very strongly with the difference in eccentricity across voxels.”

      Our original manuscript already contained a supplementary plot (Figure S4 B, now Figure S4 C) comparing the eccentricity between adapter conditions, showing no notable shift in eccentricities except in V3A - but that is a small region and the results are generally more variable. In addition, we included participant-wise plots in the online repository, presenting raw comparisons of pRF size, eccentricity, and polar angle estimates between adaptation conditions. These 2D plots provide further evidence that the SF adapters resulted in a change in pRF size, while eccentricity and polar angle estimates did not show consistent differences.  

      To the reviewer’s point, even if there were an appreciable shift in eccentricity between conditions (as they suggest may have happened for the example participant we showed), this does not mean that the pRF size effect is “due [...] to shifts in eccentricity.” Parameters in a complex multi-dimensional model like the pRF are not independent. There is no way of knowing whether a change in one parameter is causally linked with a change in another. We can only report the parameter estimates the model produces. 

      In fact, it is conceivable that adaptation causes both: changes in pRF size and eccentricity. If more central or peripheral RFs tend to have smaller or larger RFs, respectively, then adapting out one part of the distribution will shift the average accordingly. However, as we already established, we find no compelling evidence that pRF eccentricity changes dramatically due to adaptation, while pRF size does.

      Other recommendations

      We have addressed the other recommendations of the reviewer, except for the y-axis alignment. Different regions in the visual hierarchy naturally vary substantially in pRF size. Aligning axes would therefore lead to incorrect visual inferences that (1) the absolute pRF sizes between ROIs are comparable, and (2) higher regions show the effect most

      prominently. However, for clarity, we now note this scale difference in our figure captions. Finally, as mentioned earlier, we also present a proportional pRF size change map to enable comparison of the adaptation effect between regions.

      Reviewer #3 (Public Review):

      We thank the reviewer for their comments.

      pRF model

      Top-up adapters were not modelled in our analyses because they are shared events in all TRs, critically also including the “blank” periods, providing a constant source of signal. Therefore modelling them separately cannot meaningfully change the results. However, the reviewer makes a good suggestion that it would be useful to mention this in the manuscript, so we added a discussion of this point in Section 3.1.5.

      pRF size vs eccentricity

      We added a plot showing pRF size in the two adaptation conditions (in addition to the pRF size difference) as a function of eccentricity.

      Correlation with behavioral effect

      In the original manuscript, we pointed out why the correlation between the magnitude of the behavioral effect and the pRF size change is not an appropriate test for our data. First, the reviewer is right that a larger sample size would be needed to reliably detect such a between-subject correlation. More importantly, as per our recruitment criteria for the fMRI experiment, we did not scan participants showing weak perceptual effects. This limits the variability in the perceptual effect and makes correlation inapplicable.

    1. eLife Assessment

      This work presents potentially important findings suggesting that a combination of transcranial stimulation approaches applied for a short period could improve memory performance. However, the evidence supporting the conclusions is currently incomplete. In particular, the claims relating to the specific neural mechanisms and anatomical sites of action underlying effects were viewed as overstated in the current version. The results potentially have implications for non-invasive enhancement of cognitive functions.

    2. Review #1 (Public Review):

      Summary:

      The authors employ a combination of repetitive transcranial magnetic stimulation (intermittent theta burst-iTBS) and transcranial alternating current stimulation (gamma tACS) as an approach aimed to improve memory in a face/name/profession task.

      Strengths:

      The paper has many strengths. The approach of stimulating the human brain non-invasively is potentially impactful because it could lead to a host of interesting applications. The current study aims to evaluate one such exciting application. The paper contains an unusual combination of noninvasive stimulation and brain imaging data, and includes independent replication samples.

      Weaknesses:

      (1) It remains unclear how this stimulation protocol is proposed to enhance memory. Memories are believed to be stored by precise inputs to specific neurons and highly tuned changes in synaptic strengths. It remains unclear whether proposed neural activity generated by the stimulation reflects the activation of specific memories or generally increased activity across all classes of neurons.

      (2) The claim that effects directly involve the precuneus lacks strong support. The measurements shown in Figure 3 appear to be weak (i.e., Figure 3A top and bottom look similar, and Figure 3C left and right look similar). The figure appears to show a more global brain pattern rather than effects that are limited to the precuneus. Related to this, it would perhaps be useful to show the different positions of the stimulation apparatus. This could perhaps show that the position of the stimulation matters and could perhaps illustrate a range of distances over which position of the stimulation matters.

      (3) Behavioral results showing an effect on memory would substantiate claims that the stimulation approach produces significant changes in brain activity. However, placebo effects can be extremely powerful and useful, and this should probably be mentioned. Also, in the behavioral results that are currently presented, there are several concerns:

      a) There does not appear to be a significant effect on the STMB task.

      b) The FNAT task is minimally described in the supplementary material. Experimental details that would help the reader understand what was done are not described. Experimental details are missing for: the size of the images, the duration of the image presentation, the degree of image repetition, how long the participants studied the images, whether the names and occupations were different, genders of the faces, and whether the same participant saw different faces across the different stimulation conditions. Regarding the latter point, if the same participant saw the same faces across the different stimulation conditions, then there could be memory effects across different conditions that would need to be included in the statistical analyses. If participants saw different faces across the different stimulus conditions, then it would be useful to show that the difficulty was the same across the different stimuli.

      c) Also, if I understand FNAT correctly, the task is based on just 12 presentations, and each point in Figure 2A represents a different participant. How the performance of individual participants changed across the conditions is unclear with the information provided. Lines joining performance measurements across conditions for each participant would be useful in this regard. Because there are only 12 faces, the results are quantized in multiples of 100/12 % in Figure 3A. While I do not doubt that the authors did their homework in terms of the statistical analyses, it seems as though these 12 measurements do not correspond to a large effect size. For example, in Figure 3A for the immediate condition (total), it seems that, on average, the participants may remember one more face/name/occupation.

      d) Block effects. If I understand correctly, the experiments were conducted in blocks. This is potentially problematic. An example study that articulates potential problems associated with block designs is described in Li et al (TPAMI 2021, https://ieeexplore.ieee.org/document/9264220). It is unclear if potential problems associated with block designs were taken into consideration.

      e) In the FNAT portion of the paper, some results are statistically significant, while others are not. The interpretation of this is unclear. In Figure 3A, it seems as though the authors claim that iTBS+gtACS > iTBS+sham-tACS, but iTBS+gtACS ~ sham+sham. The interpretation of such a result is unclear. Results are also unclear when separated by name and occupation. There is only one condition that is statistically significant in Figure 3A in the name condition, and no significant results in the occupation condition. In short, the statistical analyses, and accompanying results that support the authors’ claims, should be explained more clearly.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript "Dual transcranial electromagnetic stimulation of the precuneus-hippocampus network boosts human long-term memory" by Borghi and colleagues provides evidence that the combination of intermittent theta burst TMS stimulation and gamma transcranial alternating current stimulation (γtACS) targeting the precuneus increases long-term associative memory in healthy subjects compared to iTBS alone and sham conditions. Using a rich dataset of TMS-EEG and resting-state functional connectivity (rs-FC) maps and structural MRI data, the authors also provide evidence that dual stimulation increased gamma oscillations and functional connectivity between the precuneus and hippocampus. Enhanced memory performance was linked to increased gamma oscillatory activity and connectivity through white matter tracts.

      Strengths:

      The combination of personalized repetitive TMS (iTBS) and gamma tACS is a novel approach to targeting the precuneus, and thereby, connected memory-related regions to enhance long-term associative memory. The authors leverage an existing neural mechanism engaged in memory binding, theta-gamma coupling, by applying TMS at theta burst patterns and tACS at gamma frequencies to enhance gamma oscillations. The authors conducted a thorough study that suggests that simultaneous iTBS and gamma tACS could be a powerful approach for enhancing long-term associative memory. The paper was well-written, clear, and concise.

      Weaknesses:

      (1) The study did not include a condition where γtACS was applied alone. This was likely because a previous work indicated that a single 3-minute γtACS did not produce significant effects, but this limits the ability to isolate the specific contribution of γtACS in the context of this target and memory function

      (2) The authors applied stimulation for 3 minutes, which seems to be based on prior tACS protocols. It would be helpful to present some rationale for both the duration and timing relative to the learning phase of the memory task. Would you expect additional stimulation prior to recall to benefit long-term associative memory?

      (3) How was the burst frequency of theta iTBS and gamma frequency of tACS chosen? Were these also personalized to subjects' endogenous theta and gamma oscillations? If not, were increases in gamma oscillations specific to patients' endogenous gamma oscillation frequencies or the tACS frequency?

      (4) The authors do a thorough job of analyzing the increase in gamma oscillations in the precuneus through TMS-EEG; however, the authors may also analyze whether theta oscillations were also enhanced through this protocol due to the iTBS potentially targeting theta oscillations. This may also be more robust than gamma oscillations increases since gamma oscillations detected on the scalp are very low amplitude and susceptible to noise and may reflect activity from multiple overlapping sources, making precise localization difficult without advanced techniques.

      (5) Figure 4: Why are connectivity values pre-stimulation for the iTBS and sham tACS stimulation condition so much higher than the dual stimulation? We would expect baseline values to be more similar.

      (6) Figure 2: How are total association scores significantly different between stimulation conditions, but individual name and occupation associations are not? Further clarification of how the total FNAT score is calculated would be helpful.

    4. Reviewer #3 (Public review):

      Summary:

      Borghi and colleagues present results from 4 experiments aimed at investigating the effects of dual γtACS and iTBS stimulation of the precuneus on behavioral and neural markers of memory formation. In their first experiment (n = 20), they found that a 3-minute offline (i.e., prior to task completion) stimulation that combines both techniques leads to superior memory recall performance in an associative memory task immediately after learning associations between pictures of faces, names, and occupation, as well as after a 15-minute delay, compared to iTBS alone (+ tACS sham) or no stimulation (sham for both iTBS and tACS). Performance in a second task probing short-term memory was unaffected by the stimulation condition. In a second experiment (n = 10), they show that these effects persist over 24 hours and up to a full week after initial stimulation. A third (n = 14) and fourth (n = 16) experiment were conducted to investigate the neural effects of the stimulation protocol. The authors report that, once again, only combined iTBS and γtACS increase gamma oscillatory activity and neural excitability (as measured by concurrent TMS-EEG) specific to the stimulated area at the precuneus compared to a control region, as well as precuneus-hippocampus functional connectivity (measured by resting-state MRI), which seemed to be associated with structural white matter integrity of the bilateral middle longitudinal fasciculus (measured by DTI).

      Strengths:

      Combining non-invasive brain stimulation techniques is a novel, potentially very powerful method to maximize the effects of these kinds of interventions that are usually well-tolerated and thus accepted by patients and healthy participants. It is also very impressive that the stimulation-induced improvements in memory performance resulted from a short (3 min) intervention protocol. If the effects reported here turn out to be as clinically meaningful and generalizable across populations as implied, this approach could represent a promising avenue for the treatment of impaired memory functions in many conditions.

      Methodologically, this study is expertly done! I don't see any serious issues with the technical setup in any of the experiments (with the only caveat that I am not an expert in fMRI functional connectivity measures and DTI). It is also very commendable that the authors conceptually replicated the behavioral effects of experiment 1 in experiment 2 and then conducted two additional experiments to probe the neural mechanisms associated with these effects. This certainly increases the value of the study and the confidence in the results considerably.

      The authors used a within-subject approach in their experiments, which increases statistical power and allows for stronger inferences about the tested effects. They are also used to individualize stimulation locations and intensities, which should further optimize the signal-to-noise ratio.

      Weaknesses:

      I want to state clearly that I think the strengths of this study far outweigh the concerns I have. I still list some points that I think should be clarified by the authors or taken into account by readers when interpreting the presented findings.

      I think one of the major weaknesses of this study is the overall low sample size in all of the experiments (between n = 10 and n = 20). This is, as I mentioned when discussing the strengths of the study, partly mitigated by the within-subject design and individualized stimulation parameters. The authors mention that they performed a power analysis but this analysis seemed to be based on electrophysiological readouts similar to those obtained in experiment 3. It is thus unclear whether the other experiments were sufficiently powered to reliably detect the behavioral effects of interest. That being said, the authors do report significant effects, so they were per definition powered to find those. However, the effect sizes reported for their main findings are all relatively large and it is known that significant findings from small samples may represent inflated effect sizes, which may hamper the generalizability of the current results. Ideally, the authors would replicate their main findings in a larger sample. Alternatively, I think running a sensitivity analysis to estimate the smallest effect the authors could have detected with a power of 80% could be very informative for readers to contextualize the findings. At the very least, however, I think it would be necessary to address this point as a potential limitation in the discussion of the paper.

      It seems that the statistical analysis approach differed slightly between studies. In experiment 1, the authors followed up significant effects of their ANOVAs by Bonferroni-adjusted post-hoc tests whereas it seems that in experiment 2, those post-hoc tests where "exploratory", which may suggest those were uncorrected. In experiment 3, the authors use one-tailed t-tests to follow up their ANOVAs. Given some of the reported p-values, these choices suggest that some of the comparisons might have failed to reach significance if properly corrected. This is not a critical issue per se, as the important test in all these cases is the initial ANOVA but non-significant (corrected) post-hoc tests might be another indicator of an underpowered experiment. My assumptions here might be wrong, but even then, I would ask the authors to be more transparent about the reasons for their choices or provide additional justification. Finally, the authors sometimes report exact p-values whereas other times they simply say p < .05. I would ask them to be consistent and recommend using exact p-values for every result where p >= .001.

      While the authors went to great lengths trying to probe the neural changes likely associated with the memory improvement after stimulation, it is impossible from their data to causally relate the findings from experiments 3 and 4 to the behavioral effects in experiments 1 and 2. This is acknowledged by the authors and there are good methodological reasons for why TMS-EEG and fMRI had to be collected in sperate experiments, but it is still worth pointing out to readers that this limits inferences about how exactly dual iTBS and γtACS of the precuneus modulate learning and memory.

      There were no stimulation-related performance differences in the short-term memory task used in experiments 1 and 2. The authors argue that this demonstrates that the intervention specifically targeted long-term associative memory formation. While this is certainly possible, the STM task was a spatial memory task, whereas the LTM task relied (primarily) on verbal material. It is thus also possible that the stimulation effects were specific to a stimulus domain instead of memory type. In other words, could it be possible that the stimulation might have affected STM performance if the task taxed verbal STM instead? This is of course impossible to know without an additional experiment, but the authors could mention this possibility when discussing their findings regarding the lack of change in the STM task.

      While the authors discuss the potential neural mechanisms by which the combined stimulation conditions might have helped memory formation, the psychological processes are somewhat neglected. For example, do the authors think the stimulation primarily improves the encoding of new information or does it also improve consolidation processes? Interestingly, the beneficial effect of dual iTBS and γtACS on recall performance was very stable across all time points tested in experiments 1 and 2, as was the performance in the other conditions. Do the authors have any explanation as to why there seems to be no further forgetting of information over time in either condition when even at immediate recall, accuracy is below 50%? Further, participants started learning the associations of the FNAT immediately after the stimulation protocol was administered. What would happen if learning started with a delay? In other words, do the authors think there is an ideal time window post-stimulation in which memory formation is enhanced? If so, this might limit the usability of this procedure in real-life applications.

    5. Author Response:

      Public Reviews:

      Reviewer #1 (Public review):

      Weaknesses:

      (1) It remains unclear how this stimulation protocol is proposed to enhance memory. Memories are believed to be stored by precise inputs to specific neurons and highly tuned changes in synaptic strengths. It remains unclear whether proposed neural activity generated by the stimulation reflects the activation of specific memories or generally increased activity across all classes of neurons.

      Thank you for raising the important issue of the actual neurophysiological effects of non-invasive brain stimulation. Unfortunately, invasive neurophysiological recordings in humans for this type of study are not feasible due to ethical constraints, while studies on cadavers or rodents would not fully resolve our question. Indeed, the authors of the cited study (Mihály Vöröslakos et al., Nature Communications, 2018) highlight the impossibility of drawing definitive conclusions about the exact voltage required in the in-vivo human brain due to significant differences between rats and humans, as well as the in-vivo human brain and cadavers due to alterations in electrical conductivity that occur in postmortem tissue.

      We acknowledge that further exploration of this aspect would be highly valuable, and we agree that it is worth discussing both as a technical limitation and as a potential direction for future research, we therefore modify the manuscript correspondingly. However, to address the challenge of in vivo recordings, we conducted Experiments 3 and 4, which respectively examined the neurophysiological and connectivity changes induced by the stimulation in a non-invasive manner. The observed changes in brain oscillatory activity (increased gamma oscillatory activity), cortical excitability (enhanced posteromedial parietal cortex reactivity), and brain connectivity (strengthened connections between the precuneus and hippocampi) provided evidence of the effects of our non-invasive brain stimulation protocol, further supporting the behavioral data.

      Additionally, we carefully considered the issue of stimulation distribution and, in response, performed a biophysical modeling analysis and E-field calculation using the parameters employed in our study (see Supplementary Materials).

      (2) The claim that effects directly involve the precuneus lacks strong support. The measurements shown in Figure 3 appear to be weak (i.e., Figure 3A top and bottom look similar, and Figure 3C left and right look similar). The figure appears to show a more global brain pattern rather than effects that are limited to the precuneus. Related to this, it would perhaps be useful to show the different positions of the stimulation apparatus. This could perhaps show that the position of the stimulation matters and could perhaps illustrate a range of distances over which position of the stimulation matters.

      Thank you for your feedback. We will improve the clarity of the manuscript to better address this important aspect. Our assumption that the precuneus plays a key role in the observed effects is based on several factors:

      (1) The non-invasive stimulation protocol was applied to an individually identified precuneus for each participant. Given existing evidence on TMS propagation, we can reasonably assume that the precuneus was at least a mediator of the observed effects (Ridding & Rothwell, Nature Reviews Neuroscience 2007). For further details about target identification and TMS and tACS propagation, please refer to the MRI data acquisition section in the main text and Biophysical modeling and E-field calculation section in the supplementary materials.

      (2) To investigate the effects of the neuromodulation protocol on cortical responses, we conducted a whole-brain analysis using multiple paired t-tests comparing each data point between different experimental conditions. To minimize the type I error rate, data were permuted with the Monte Carlo approach and significant p-values were corrected with the false discovery rate method (see the Methods section for details). The results identified the posterior-medial parietal areas as the only regions showing significant differences across conditions.

      (3) To control for potential generalized effects, we included a control condition in which TMS-EEG recordings were performed over the left parietal cortex (adjacent to the precuneus). This condition did not yield any significant results, reinforcing the cortical specificity of the observed effects.

      However, as stated in the Discussion, we do not claim that precuneus activity alone accounts for the observed effects. As shown in Experiment 4, stimulation led to connectivity changes between the precuneus and hippocampus, a network widely recognized as a key contributor to long-term memory formation (Bliss & Collingridge, Nature 1993). These connectivity changes suggest that precuneus stimulation triggered a ripple effect extending beyond the stimulation site, engaging the broader precuneus-hippocampus network.

      Regarding Figure 3A, it represents the overall expression of oscillatory activity detected by TMS-EEG. Since each frequency band has a different optimal scaling, the figure reflects a graphical compromise. A more detailed representation of the significant results is provided in Figure 3B. The effect sizes for gamma oscillatory activity in the delta T1 and T2 conditions were 0.52 and 0.50, respectively, which correspond to a medium effect based on Cohen’s d interpretation.

      (3) Behavioral results showing an effect on memory would substantiate claims that the stimulation approach produces significant changes in brain activity. However, placebo effects can be extremely powerful and useful, and this should probably be mentioned. Also, in the behavioral results that are currently presented, there are several concerns:

      a) There does not appear to be a significant effect on the STMB task.

      b) The FNAT task is minimally described in the supplementary material. Experimental details that would help the reader understand what was done are not described. Experimental details are missing for: the size of the images, the duration of the image presentation, the degree of image repetition, how long the participants studied the images, whether the names and occupations were different, genders of the faces, and whether the same participant saw different faces across the different stimulation conditions. Regarding the latter point, if the same participant saw the same faces across the different stimulation conditions, then there could be memory effects across different conditions that would need to be included in the statistical analyses. If participants saw different faces across the different stimulus conditions, then it would be useful to show that the difficulty was the same across the different stimuli.

      We thank you for signaling the lack in the description of FNAT task. We will add all the information required to the manuscript.

      In the meantime, here we provide the answers to your questions. The size of the images 19x15cm. They were presented in the learning phase and the immediate recall for 8 seconds each, while in the delayed recall they were shown (after the face recognition phase) until the subject answered. The learning phase, where name and occupation were shown together with the faces, lasted around 2 minutes comprising the instructions. We used a different set of stimuli for each stimulation condition, for a total of 3 parallel task forms balanced across the condition and order of sessions. All the parallel forms were composed of 6 male and 6 female faces, for each sex there were 2 young adults (aged around 30 years old), 2 middle adults (aged around 50 years old), and 2 old adults (aged around 70 years old). Before the experiments, we ran a pilot study to ensure there were no differences between the parallel forms of the task. We can provide the task with its parallel form upon request. The chance level in the immediate and delayed recall is not quantifiable since the participants had to freely recall the name and the occupation without a multiple choice. In the recognition, the chance level was around 33% (since the possible answers were 3).

      c) Also, if I understand FNAT correctly, the task is based on just 12 presentations, and each point in Figure 2A represents a different participant. How the performance of individual participants changed across the conditions is unclear with the information provided. Lines joining performance measurements across conditions for each participant would be useful in this regard. Because there are only 12 faces, the results are quantized in multiples of 100/12 % in Figure 3A. While I do not doubt that the authors did their homework in terms of the statistical analyses, it seems as though these 12 measurements do not correspond to a large effect size. For example, in Figure 3A for the immediate condition (total), it seems that, on average, the participants may remember one more face/name/occupation.

      We will add another graph to the manuscript with lines connecting each participant's performance. Unfortunately, we were not able to incorporate it in the box-and-whisker plot.

      We apologize for the lack of clarity in the description of the FNAT. As you correctly pointed out, we used the percentage based on the single association between face, name and occupation (12 in total). However, each association consisted of three items, resulting in a total of 36 items to learn and associate – we will make it more explicit in the manuscript.

      In the example you mentioned, participants were, on average, able to recall three more items compared to the other conditions. While this difference may not seem striking at first glance, it is important to consider that we assessed memory performance after a single, three-minute stimulation session. Similar effects are typically observed only after multiple stimulation sessions (Koch et al., NeuroImage, 2018; Grover et al., Nature Neuroscience, 2022).

      d) Block effects. If I understand correctly, the experiments were conducted in blocks. This is potentially problematic. An example study that articulates potential problems associated with block designs is described in Li et al (TPAMI 2021, https://ieeexplore.ieee.org/document/9264220). It is unclear if potential problems associated with block designs were taken into consideration.

      Thank you for the interesting reference. According to this paper, in a block design, EEG or fMRI recordings are performed in response to different stimuli of a given class presented in succession. If this is the case, it does not correspond to our experimental design where both TMS-EEG and fMRI were conducted in a resting state on different days according to the different stimulation conditions.

      e) In the FNAT portion of the paper, some results are statistically significant, while others are not. The interpretation of this is unclear. In Figure 3A, it seems as though the authors claim that iTBS+gtACS > iTBS+sham-tACS, but iTBS+gtACS ~ sham+sham. The interpretation of such a result is unclear. Results are also unclear when separated by name and occupation. There is only one condition that is statistically significant in Figure 3A in the name condition, and no significant results in the occupation condition. In short, the statistical analyses, and accompanying results that support the authors’ claims, should be explained more clearly.

      Thank you again for your feedback. We will work on making the large amount of data we reported easier to interpret.

      Hoping to have thoroughly addressed your initial concerns in our previous responses, we now move on to your observations regarding the behavioral results, assuming you were referring to Figure 2A. The main finding of this study is the improvement in long-term memory performance, specifically the ability to correctly recall the association between face, name, and occupation (total FNAT), which was significantly enhanced in both Experiments 1 and 2. However, we also aimed to explore the individual contributions of name and occupation separately to gain a deeper understanding of the results. Our analysis revealed that the improvement in total FNAT was primarily driven by an increase in name recall rather than occupation recall. We understand that this may have caused some confusion. Therefore we will clarify this in the manuscript and consider presenting the name and occupation in a separate plot.

      Regarding the stimulation conditions, your concerns about the performance pattern (iTBS+gtACS > iTBS+sham-tACS, but iTBS+gtACS ~ sham+sham) are understandable. However, this new protocol was developed precisely in response to the variability observed in behavioral outcomes following non-invasive brain stimulation, particularly when used to modulate memory functions (Corp et al., 2020; Pabst et al., 2022). As discussed in the manuscript, it is intended as a boost to conventional non-invasive brain stimulation protocols, leveraging the mechanisms outlined in the Discussion section.

      Reviewer #2 (Public review):

      Weaknesses:

      (1) The study did not include a condition where γtACS was applied alone. This was likely because a previous work indicated that a single 3-minute γtACS did not produce significant effects, but this limits the ability to isolate the specific contribution of γtACS in the context of this target and memory function

      Thank you for your comments. As you pointed out, we did not include a condition where γtACS was applied alone. This decision was based on the findings of Guerra et al. (Brain Stimulation 2018), who investigated the same protocol and reported no aftereffects. Given the substantial burden of the experimental design on patients and our primary goal of demonstrating an enhancement of effects compared to the standalone iTBS protocol, we decided to leave out this condition. However, we agree that investigating the effects of γtACS alone is an interesting and relevant aspect worthy of further exploration. In line with these observations, we will expand the discussion on this point in the study’s limitations section.

      (2) The authors applied stimulation for 3 minutes, which seems to be based on prior tACS protocols. It would be helpful to present some rationale for both the duration and timing relative to the learning phase of the memory task. Would you expect additional stimulation prior to recall to benefit long-term associative memory?

      Thank you for your comment and for raising this interesting point. As you correctly noted, the protocol we used has a duration of three minutes, a choice based on previous studies demonstrating its greater efficacy with respect to single stimulation from a neurophysiological point of view. Specifically, these studies have shown that the combined stimulation enhanced gamma-band oscillations and increased cortical plasticity (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Given that the precuneus (Brodt et al., Science 2018; Schott et al., Human Brain Mapping 2018), gamma oscillations (Osipova et al., Journal of Neuroscience 2006; Deprés et al., Neurobiology of Aging 2017; Griffiths et al., Trends in Neurosciences 2023), and cortical plasticity (Brodt et al., Science 2018) are all associated with encoding processes, we decided to apply the co-stimulation immediately before it to enhance the efficacy.

      Regarding the question of whether stimulation could also benefit recall, the answer is yes. We can speculate that repeating the stimulation before recall might provide an additional boost. This is supported by evidence showing that both the precuneus and gamma oscillations are involved in recall processes (Flanagin et al., Cerebral Cortex 2023; Griffiths et al., Trends in Neurosciences 2023). Furthermore, previous research suggests that reinstating the same brain state as during encoding can enhance recall performance (Javadi et al., The Journal of Neuroscience 2017).

      We will expand the study rationale and include these considerations in the future directions section.

      (3) How was the burst frequency of theta iTBS and gamma frequency of tACS chosen? Were these also personalized to subjects' endogenous theta and gamma oscillations? If not, were increases in gamma oscillations specific to patients' endogenous gamma oscillation frequencies or the tACS frequency?

      The stimulation protocol was chosen based on previous studies (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Gamma tACS sinusoid frequency wave was set at 70 Hz while iTBS consisted of ten bursts of three pulses at 50 Hz lasting 2 s, repeated every 10 s with an 8 s pause between consecutive trains, for a total of 600 pulses total lasting 190 s (see iTBS+γtACS neuromodulation protocol section). In particular, the theta iTBS has been inspired by protocols used in animal models to elicit LTP in the hippocampus (Huang et al., Neuron 2005). Consequently, neither Theta iTBS nor the gamma frequency of tACS were personalized. The increase in gamma oscillations was referred to the patient’s baseline and did not correspond to the administrated tACS frequency.

      (4) The authors do a thorough job of analyzing the increase in gamma oscillations in the precuneus through TMS-EEG; however, the authors may also analyze whether theta oscillations were also enhanced through this protocol due to the iTBS potentially targeting theta oscillations. This may also be more robust than gamma oscillations increases since gamma oscillations detected on the scalp are very low amplitude and susceptible to noise and may reflect activity from multiple overlapping sources, making precise localization difficult without advanced techniques.

      Thank you for the suggestion. We analyzed theta oscillations finding no changes.

      (5) Figure 4: Why are connectivity values pre-stimulation for the iTBS and sham tACS stimulation condition so much higher than the dual stimulation? We would expect baseline values to be more similar.

      We acknowledge that the pre-stimulation connectivity values for the iTBS and sham tACS conditions appear higher than those for the dual stimulation condition. However, as noted in our statistical analyses, there were no significant differences at baseline between conditions (p-FDR= 0.3514), suggesting that any apparent discrepancy is due to natural variability rather than systematic bias. One potential explanation for these differences is individual variability in baseline connectivity measures, which can fluctuate due to factors such as intrinsic neural dynamics, participant state, or measurement noise. Despite these variations, our statistical approach ensures that any observed post-stimulation effects are not confounded by pre-existing differences.

      (6) Figure 2: How are total association scores significantly different between stimulation conditions, but individual name and occupation associations are not? Further clarification of how the total FNAT score is calculated would be helpful.

      We apologize for any lack of clarity. The total FNAT score reflects the ability to correctly recall all the information associated with a person—specifically, the correct pairing of the face, name, and occupation. Participants received one point for each triplet they accurately recalled. The scores were then converted into percentages, as detailed in the Face-Name Associative Task Construction and Scoring section in the supplementary materials.

      Total FNAT was the primary outcome measure. However, we also analyzed name and occupation recall separately to better understand their individual contributions. Our analysis revealed that the improvement in total FNAT was primarily driven by an increase in name recall rather than occupation recall.

      We acknowledge that this distinction may have caused some confusion. To improve clarity, we will revise the manuscript accordingly and consider presenting name and occupation recall in separate plots.

      Reviewer #3 (Public review):

      Weaknesses:

      I want to state clearly that I think the strengths of this study far outweigh the concerns I have. I still list some points that I think should be clarified by the authors or taken into account by readers when interpreting the presented findings.

      I think one of the major weaknesses of this study is the overall low sample size in all of the experiments (between n = 10 and n = 20). This is, as I mentioned when discussing the strengths of the study, partly mitigated by the within-subject design and individualized stimulation parameters. The authors mention that they performed a power analysis but this analysis seemed to be based on electrophysiological readouts similar to those obtained in experiment 3. It is thus unclear whether the other experiments were sufficiently powered to reliably detect the behavioral effects of interest. That being said, the authors do report significant effects, so they were per definition powered to find those. However, the effect sizes reported for their main findings are all relatively large and it is known that significant findings from small samples may represent inflated effect sizes, which may hamper the generalizability of the current results. Ideally, the authors would replicate their main findings in a larger sample. Alternatively, I think running a sensitivity analysis to estimate the smallest effect the authors could have detected with a power of 80% could be very informative for readers to contextualize the findings. At the very least, however, I think it would be necessary to address this point as a potential limitation in the discussion of the paper.

      Thank you for the observation. As you mentioned, our power analysis was based on our previous study investigating the same neuromodulation protocol with a corresponding experimental design. The relatively small sample could be considered a possible limitation of the study which we will add to the discussion. A fundamental future step will be to replay these results on a larger population, however, to strengthen our results we performed the sensitivity analysis you suggested.

      In detail, we performed a sensitivity analysis for repeated-measures ANOVA with α=0.05 and power(1-β)=0.80 with no sphericity correction. For experiment 1, a sensitivity analysis with 1 group and 3 measurements showed a minimal detectable effect size of f=0.524 with 20 participants. In our paper, the ANOVA on total FNAT immediate performance revealed an effect size of η2\=0.274 corresponding to f=0.614; the ANOVA on FNAT delayed performance revealed an effect size of η2 =0.236 corresponding to f=0.556. For experiment 2, a sensitivity analysis for total FNAT immediate performance (1 group and 3 measurements) showed a minimal detectable effect size of f=0.797 with 10 participants. In our paper, the ANOVA on total FNAT immediate performance revealed an effect size of η2 =0.448 corresponding to f=0.901. The sensitivity analysis for total FNAT delayed performance (1 group and 6 measurements) showed a minimal detectable effect size of f=0.378 with 10 participants. In our paper, the ANOVA on total FNAT delayed performance revealed an effect size of η2 =0.484 corresponding to f=0.968. Thus, the sensitivity analysis showed that both experiments were powered enough to detect the minimum effect size computed in the power analysis. We have now added this information to the manuscript and we thank the reviewer for her/his suggestion.

      It seems that the statistical analysis approach differed slightly between studies. In experiment 1, the authors followed up significant effects of their ANOVAs by Bonferroni-adjusted post-hoc tests whereas it seems that in experiment 2, those post-hoc tests where "exploratory", which may suggest those were uncorrected. In experiment 3, the authors use one-tailed t-tests to follow up their ANOVAs. Given some of the reported p-values, these choices suggest that some of the comparisons might have failed to reach significance if properly corrected. This is not a critical issue per se, as the important test in all these cases is the initial ANOVA but non-significant (corrected) post-hoc tests might be another indicator of an underpowered experiment. My assumptions here might be wrong, but even then, I would ask the authors to be more transparent about the reasons for their choices or provide additional justification. Finally, the authors sometimes report exact p-values whereas other times they simply say p < .05. I would ask them to be consistent and recommend using exact p-values for every result where p >= .001.

      Thank you again for the suggestions. Your observations are correct, we used a slightly different statistical depending on our hypothesis. Here are the details:

      In experiment 1, we used a repeated-measure ANOVA with one factor “stimulation condition” (iTBS+γtACS; iTBS+sham-tACS; sham-iTBS+sham-tACS). Following the significant effect of this factor we performed post-hoc analysis with Bonferroni correction.

      In experiment 2, we used a repeated-measures with two factors “stimulation condition” and “time”. As expected, we observed a significant effect of condition, confirming the result of experiment 1, but not of time. Thus, this means that the neuromodulatory effect was present regardless of the time point. However, to explore whether the effects of stimulation condition were present in each time point we performed some explorative t-tests with no correction for multiple comparisons since this was just an explorative analysis.

      In experiment 3, we used the same approach as experiment 1. However, since we had a specific hypothesis on the direction of the effect already observed in our previous study, i.e. increase in spectral power (Maiella et al., Scientific Report 2022), our tests were 1-tailed.

      For the p-values, we will correct the manuscript reporting the exact values for every result.

      While the authors went to great lengths trying to probe the neural changes likely associated with the memory improvement after stimulation, it is impossible from their data to causally relate the findings from experiments 3 and 4 to the behavioral effects in experiments 1 and 2. This is acknowledged by the authors and there are good methodological reasons for why TMS-EEG and fMRI had to be collected in sperate experiments, but it is still worth pointing out to readers that this limits inferences about how exactly dual iTBS and γtACS of the precuneus modulate learning and memory.

      Thank you for your comment. We fully agree with your observation, which is why this aspect has been considered in the study's limitations. To address your concern, we will further emphasize the fact that our findings do not allow precise inferences regarding the specific mechanisms by which dual iTBS and γtACS of the precuneus modulate learning and memory.

      There were no stimulation-related performance differences in the short-term memory task used in experiments 1 and 2. The authors argue that this demonstrates that the intervention specifically targeted long-term associative memory formation. While this is certainly possible, the STM task was a spatial memory task, whereas the LTM task relied (primarily) on verbal material. It is thus also possible that the stimulation effects were specific to a stimulus domain instead of memory type. In other words, could it be possible that the stimulation might have affected STM performance if the task taxed verbal STM instead? This is of course impossible to know without an additional experiment, but the authors could mention this possibility when discussing their findings regarding the lack of change in the STM task.

      Thank you for your insightful observation. We argue that the intervention primarily targeted long-term associative memory formation, as our findings demonstrated effects only on FNAT. However, as you correctly pointed out, we cannot exclude the possibility that the stimulation may also influence short-term verbal associative memory. We will acknowledge this potential effect when discussing the absence of significant findings in the STM task.

      While the authors discuss the potential neural mechanisms by which the combined stimulation conditions might have helped memory formation, the psychological processes are somewhat neglected. For example, do the authors think the stimulation primarily improves the encoding of new information or does it also improve consolidation processes? Interestingly, the beneficial effect of dual iTBS and γtACS on recall performance was very stable across all time points tested in experiments 1 and 2, as was the performance in the other conditions. Do the authors have any explanation as to why there seems to be no further forgetting of information over time in either condition when even at immediate recall, accuracy is below 50%? Further, participants started learning the associations of the FNAT immediately after the stimulation protocol was administered. What would happen if learning started with a delay? In other words, do the authors think there is an ideal time window post-stimulation in which memory formation is enhanced? If so, this might limit the usability of this procedure in real-life applications.

      Thank you for your comment and for raising these important points.

      We hypothesized that co-stimulation would enhance encoding processes. Previous studies have shown that co-stimulation can enhance gamma-band oscillations and increase cortical plasticity (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Given that the precuneus (Brodt et al., Science 2018; Schott et al., Human Brain Mapping 2018), gamma oscillations (Osipova et al., Journal of Neuroscience 2006; Deprés et al., Neurobiology of Aging 2017; Griffiths et al., Trends in Neurosciences 2023), and cortical plasticity (Brodt et al., Science 2018) have all been associated with encoding processes, we decided to apply co-stimulation before the encoding phase, to boost it.

      We applied the co-stimulation immediately before the learning phase to maximize its potential effects. While we observed a significant increase in gamma oscillatory activity lasting up to 20 minutes, we cannot determine whether the behavioral effects we observed would have been the same with a co-stimulation applied 20 minutes before learning. Based on existing literature, a reduction in the efficacy of co-stimulation over time could be expected (Huang et al., Neuron 2005; Thut et al., Brain Topography 2009). However, we hypothesize that multiple stimulation sessions might provide an additional boost, helping to sustain the effects over time (Thut et al., Brain Topography 2009; Koch et al., Neuroimage 2018; Koch et al., Brain 2022).

      Regarding the absence of further forgetting in both stimulation conditions, we think that the clinical and demographical characteristics of the sample (i.e. young and healthy subjects) explain the almost absence of forgetting after one week.

    1. eLife Assessment

      This useful study employs optogenetics, genetically-encoded dopamine and serotonin sensors, and patch-clamp electrophysiology to investigate modulations of neurotransmitter release between striatal dopamine and serotonin neurons - a topic of interest to neuroscientists studying the basal ganglia. The results suggest that the dopamine and serotonin systems operate largely in parallel, with the activation of serotonin neurons resulting in a small, transient dopamine release. The authors suggest that this interaction occurs via glutamate release in the ventral tegmental area, findings that are closely related to previous work. Some conclusions are incomplete requiring larger samples-sizes and controls.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, Liu et al use optogenetics and genetically encoded neuromodulator sensors to test the extent to which dopamine neuron stimulation produces striatal serotonin release, and vice versa. The study is timely given growing interest in dopamine/serotonin interactions and in the context of recent work showing bidirectional and dynamic regulation of striatal dopamine by another neuromodulator, acetylcholine. The authors find that striatal dopamine and serotonin afferents function largely independently, with dopamine neuron stimulation producing no striatal serotonin release and serotonin neuron stimulation producing minimal striatal dopamine release. This work will inform future work seeking to dissect the contributions of striatal dopamine, serotonin, and their interactions to various motivated behaviors. While the paper's main conclusions are adequately supported (see Strengths), additional controls and experiments would significantly broaden the paper's impact (see Weaknesses). Finally, this draft of the work is poorly presented with numerous errors, omissions, and inconsistencies evident throughout the text and the figures that should be addressed.

      Strengths:

      The study employs optogenetic stimulation simultaneously with fiber photometry recording of dopamine or serotonin release measured with genetically encoded sensors. These methods are state-of-the-art, offering tighter temporal control compared to pharmacological methods for manipulating dopamine and serotonin and improved selectivity over techniques like electrochemistry and microdialysis used to record neuromodulator release in previous studies on the subject. As a result, the paper's main conclusions are well supported.

      Weaknesses:

      (1) The electrophysiology experiments in Figure 3 are only tangentially related to the focus of the study, and their findings are almost entirely irrelevant to the paper's main conclusions. The results of these experiments are also not novel. Glutamate corelease from 5HT neurons has been previously shown, including in the OFC and VTA (Ren et al, 2018, Cell, McDevitt et al, 2014, Cell Rep, Liu et al 2014, Neuron; and others). The authors should explain more clearly what they think these data add to the manuscript and/or consider removing them altogether.

      (2) Related to the point above, as far as I can tell, the only value the electrophysiology data add is to suggest that perhaps activation of serotonin neurons may drive minimal striatal dopamine release via glutamate corelease in the VTA. The evidence provided in this version of the manuscript is insufficient to support that claim, but the manuscript would be significantly strengthened if the authors tested this hypothesis more directly. One way to do that could be to stimulate serotonin axons in the striatum (as opposed to the serotonin cell bodies) and record striatal dopamine release. A complementary anatomical approach would be to use retrograde tracing to test whether the DR 5HT neurons projecting to the striatum are the same or different from the VTA projecting population.

      (3) The findings would be strengthened by the addition of a fluorophore-only control group lacking opsin expression in all experiments in Figures 1 and 2.

      (4) The experiment of stimulating serotonin neurons and recording serotonin release in the NAc was not performed. It would be useful to be able to compare the magnitudes of evoked serotonin release in these two striatal regions, though it is not central to the main claims of the paper.

      (5) The interpretation of the results from Figure 2 is described inconsistently throughout the manuscript. The title implies there is significant crosstalk between the dopamine and serotonin systems. The abstract calls the crosstalk "transient", which is a description of its temporal dynamics, not its magnitude. Then the introduction figures and discussion all suggest the crosstalk is minimal. I suggest the authors describe the main findings - minimal crosstalk between the dopamine and serotonin systems - clearly and consistently in the title, abstract, and main text.

    3. Reviewer #2 (Public review):

      Summary:

      This brief communication aims to clarify interactions between the dopamine (DA) and serotonin (5HT) systems of mice. The authors use optogenetic stimulation of DA neurons in the VTA or of 5HT neurons in the DRN, while monitoring the fluorescence of DA and 5HT sensors in the nucleus accumbens (NAc) and dorsal striatum (DS) using fiber photometry. The authors report on a small release of DA in the NAc following DRN stimulation, which they attribute to glutamate co-release onto VTA DA neurons using slice electrophysiology. The authors also report on cocaine-induced 5-HT release in the striatum.

      Strengths:

      This is a topic well worth studying.

      Weaknesses:

      In its current form, this is an incomplete and underpowered study that does little to clarify the complicated relationship that exists between DA and 5HT in the mammalian brain under physiological conditions or during cocaine use.

    4. Reviewer #3 (Public review):

      The authors suggest that the small release of DA may be due to a release of glutamate from DRN 5-HT neurons to the VTA that stimulates weakly and in a transient fashion the VTA DA neurons, which in the end, produce a transient and small release of DA in the NAc.

      Their findings give more information on the previously reported complex and partial known crosstalk between 5-HT and DA in the NAc.

      I only have some minor concerns about the manuscript:

      (1) In Figure 2F, there is a missing curve for 5-HT in NAc. Besides, the legend shows n=2, making it difficult to perform statistical analysis with that data.

      (2) In Figure 3, the use of NBQX/AP5 is shown, but it is not mentioned either in the methodology or in the discussion. What is the meaning of those results?

      (3) Line 98 compares results from two different places of stimulation. The results are related to stimulation in the VTA, but the comparison indicates that the stimulation was made in the DRN.

      (4) If the release of 5-HT in Nac does not occur, it needs to be precise in the abstract that 5-HT is released in the dorsal striatum (DS) but not in the NAc (line 19).

      (5) Be consistent with the way you mention the 5-HT neurons. For example, in lines from 106 to 119, SERT neurons are used. Previously, 5-HT neurons were used.

      (6) There are several points of confusion when referring to the figures, making the text difficult to follow because the text explains something that is not shown in the figure cited.

    5. Author response:

      We appreciate the reviewers’ insightful feedback and propose to undertake an extensive revision of the manuscript to strengthen our findings and underscore the significance of this work. We remain convinced that our study offers critical insights into the largely independent dopamine and serotonin neural circuits. Nevertheless, we concur that substantial revisions are warranted, as the current organization may not be ideal to showcase the central findings. In particular, we will increase the number of animals to address data variability and enhance the reproducibility of the observed effects. We also recognize the need to perform additional control experiments and to include complementary anatomical tracing studies. Moreover, we will reformat the manuscript and conduct additional analyses to emphasize that evoked dopamine and serotonin release originate from distinct loci with minimal crosstalk. To address all of these points thoroughly, we estimate that a 12-month revision period will be required.

    1. eLife Assessment

      This valuable paper introduces the Dyadic Interaction Platform, an experimental setup that enables researchers to study real-time social interactions between two participants in a controlled environment while maintaining direct face-to-face visibility. The evidence supporting the platform's effectiveness is convincing, with demonstrations of distinct experimental paradigms showing how transparency and continuous access to partners' actions can influence strategic coordination, decision-making, and learning. The work will be of broad interest to researchers studying social cognition across humans and non-human primates, providing a versatile tool that bridges the gap between naturalistic social interactions and controlled laboratory experiments.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors aim to address significant limitations of existing experimental paradigms used to study dyadic social interactions by introducing a novel experimental setup - the Dyadic Interaction Platform (DIP). The DIP uniquely allows participants to interact dynamically, face-to-face, with simultaneous access to both social cues and task-related stimuli. The authors demonstrate the versatility and utility of this platform across several exemplary scenarios, notably highlighting cases of significant behavioral differences in conditions involving direct visibility of a partner.

      Major strengths include comprehensive descriptions of previous paradigms, detailed explanations of the DIP's technical features, and clear illustrations of multimodal data integration. These elements greatly enhance the reproducibility of the methods and clarify the potential applications across various research domains and species. Particularly compelling is the authors' demonstration of behavioral impacts related to transparency in interactions, as evidenced by the macaque-human experiments using the Bach-or-Stravinsky game scenario.

      Strengths:

      The DIP represents a methodological advance in the study of social cognition. Its transparent, touch-sensitive display elegantly solves the problem of enabling participants to attend to both their social partner and task stimuli simultaneously without requiring attention switching. This paper marks a notable step forward toward more options for naturalistic yet still lab-based studies of social decision-making, an area where the field is actively moving, especially given recent research highlighting significant differences in neural activity depending upon the context in which an action is performed. The DIP offers researchers a valuable tool to bridge the gap between tightly controlled laboratory paradigms and the dynamic, bidirectional nature of real-world social interactions.

      The authors do well to provide comprehensive documentation of the technical specifications for the four different implementations of the platform, allowing other researchers to adapt and build upon their work. The detailed information about hardware configurations demonstrates careful attention to practical implementation details. They also highlight numerous options for integration with other tools and software, further demonstrating the versatility of this apparatus and the variety of research questions to which it could be applied.

      The historical review of dyadic experimental paradigms is thorough and effectively positions the DIP as addressing a critical gap in existing methodologies. The authors convincingly argue that studying continuous, dynamic social interactions is essential for understanding real-world social cognition, and that existing paradigms often force unnatural attention-splitting or turn-taking behaviors that don't reflect naturalistic interaction patterns.

      The four example applications showcase the DIP's versatility across diverse research questions. The Bach-or-Stravinsky economic game example is particularly compelling, demonstrating how continuous access to partners' actions substantially changes coordination strategies in non-human primates. This highlights a key strength of the DIP, which is that it removes a level of abstraction that can make tasks more difficult for non-human primates to learn. By being able to see their partner and actions directly, rather than having to understand that a cursor on a screen represents a partner, the platform makes the task more accessible to non-human primates and possibly children as well. This opens up important avenues for enhanced cross-species investigations of cognition, allowing researchers to study social dynamics in a setting that remains naturalistic yet controlled across different populations.

      Weaknesses:

      Some of the experimental applications would benefit from stronger evidence demonstrating the unique advantages of the transparent setup. For instance, in the dyadic foraging example, it's not entirely clear how participants' behavior differs from what might be observed when simply tracking each other's cursor movements in a non-transparent setup. More evidence showing how direct visibility of the partner, beyond simply being able to track the position of the partner's cursor, influences behavior would strengthen this example. Similarly, in the continuous perceptual report (CPR) task, the subjects could perform this task and see feedback from their partners' actions without having to see their partner through the transparent screen. Evidence showing that 1) subjects do indeed look at their partner during the task and 2) viewing their partner influences their performance on the task would significantly strengthen the claim that the ability to view the partner brings in a new dimension to this task. These additions would better demonstrate the specific value added by the transparent nature of the DIP beyond what could be achieved with standard cursor-tracking paradigms.

      A significant limitation that is inadequately addressed relates to neural investigations. While the authors position the platform's ability to merge attention to social stimuli and task stimuli as a key advantage, they don't sufficiently acknowledge the challenges this creates for dissociating neural signals attributed to social cues versus task-based stimuli. More traditional lab-based experiments intentionally separate components like task-stimulus perception, social perception, and decision-making periods so that researchers can isolate the neural signals associated with each process. This deliberate separation, which the authors frame as a weakness, actually serves an important functional purpose in neural investigations. The paper would be strengthened by explicitly discussing this limitation and offering potential approaches to address it in experimental design or data analysis. For instance, the authors could suggest methodological innovations or analytical techniques that might help disentangle the overlapping neural signals that would inevitably arise from the integrated presentation of social and task stimuli in the DIP setup.

      Furthermore, the authors' suggestion to arrange task stimuli around the periphery of the screen to maintain a clear middle area for viewing the partner appears to contradict their own critique of traditional paradigms. This recommended arrangement would seemingly reintroduce the very problem of attentional switching between task stimuli and social partners that the authors identified as a limitation of previous approaches. The paper would be strengthened by discussing the potential trade-offs associated with their suggested stimulus arrangement. Additionally, offering potential approaches to address these limitations in experimental design or data analysis would enhance the paper's contribution to the field.

    3. Reviewer #2 (Public review):

      Summary:

      This work proposes a new platform to study social cognition in a more naturalistic setting. The authors give an overview of previous work that extends from static unidirectional paradigms (i.e., subject is presented with social stimuli such as still images or faces), to more dynamic unidirectional paradigms (i.e., the subject is presented with movies, or another individual's behavior) to dyadic interactions in a laboratory setting or in real life (i.e., interacting with a real person). Overall, this literature demonstrates that findings from realistic social situations can differ dramatically from unidirectional laboratory settings. Moreover, current and previous work are put in the perspective of an experimental framework that has tightly controlled experimental set-ups and low ecological validity on one end, and high ecological validity, naturalistic, without any experimental constraints on the other end, and all that is in between. The authors frame previous work along a spectrum, ranging from highly controlled, low-ecological-validity experiments to naturalistic, unconstrained approaches with high ecological validity, situating their current work within this continuum. They focus on a specific sub-domain of social interactions, i.e., goal-directed contexts in which interactions are purposeful for solving joint tasks or obtaining rewards. This new dyadic interaction platform claims to embed tight experimental control in a naturalistic face-to-face social interaction with the goal of investigating social information processing in bidirectional, dynamic social interactions.

      Strengths:

      The proposed dyadic interaction platform (DIP) is highly flexible, accommodating diverse visual displays, interactive components, and recording devices, making it suitable for various experiments.

      The manuscript does a good job of highlighting the strengths and weaknesses of the various display options. This clarity allows readers to easily assess which display best suits their specific experimental setup and objectives.

      One of the platform's key strengths is its versatility, allowing the same experimental setup to be used across multiple species and developmental stages, and enabling NHPs and humans to be studied as subjects within the same paradigm. Highlighting this capability could further underscore the platform's broad applicability.

      Weaknesses:

      The manuscript emphasizes the importance of ecological validity alongside tight experimental control, a significant challenge in naturalistic neuroscience. While the platform achieves tight control, the ecological validity of such a set-up remains questionable and warrants further testing and validation. For example, while the platform is designed to be more naturalistic in principle, its application to NHPs is still complex and may be comparably constrained as traditional NHP research. To realize its full potential for animal studies, the platform should be combined with complementary methodologies - such as wireless electrophysiology and freely moving paradigms - to truly achieve a balance between ecological validity and experimental control. Further validation in this direction could significantly enhance its utility.

      The manuscript is somewhat lengthy and occasionally reads more like a review paper, which slightly shifts the focus away from the primary emphasis on the innovative technological advancement and the considerable effort invested in optimizing this new platform. Streamlining the presentation to more directly highlight these key contributions could enhance clarity and impact.

      Overall, there is compelling evidence supporting the feasibility and value of DIP for investigating specific types of social interactions, particularly in contexts where individuals share a workspace and have full transparency regarding their opponent's actions. While I believe that DIP has the potential to significantly impact the field, which is supported by preliminary data, its broader applicability remains an open question. This platform aligns well with recent initiatives aimed at enhancing ecological validity in neuroscience research across both human and animal models. To maximize its impact, it would be beneficial to more explicitly situate this work within that broader movement, emphasizing its relevance and potential to advance ecologically valid approaches in the field.

    1. eLife Assessment

      This fundamental study presents a compelling and comprehensive analysis of the newly defined Lipocone superfamily, offering unprecedented insights into the evolutionary origins of Wnt proteins. The authors provide evidence that this superfamily evolved from membrane proteins. The work is exemplary in its use of sequence analysis and structural modeling and will be of broad interest to researchers studying protein evolution and enzymology.

      [Editors' note: this paper was reviewed by Review Commons.]

    2. Reviewer #1 (Public review):

      Summary:

      In this study titled 'The Lipocone Superfamily: A Unifying Theme In Metabolism Of Lipids, Peptidoglycan, And Exopolysaccharides, Inter-Organismal Conflicts And Immunity' from L. Aravind's group, the authors report the identification of a novel domain superfamily termed "Lipocone" superfamily. This superfamily unifies Wnt protein with a spectrum of domains from about 30 families, including those from phosphatidylserine synthases (PTDSS1/2), TelC toxin, VanZ proteins, and the animal Serum Amyloid A (SAA). The authors provide evidence that this superfamily originated as membrane proteins, with few (including Wnt and SAA) evolving into soluble domains. The authors also provide contextual evidence for the Lipocone members recruited as effectors in biological conflicts in both prokaryotes and eukaryotes. Importantly, to my knowledge, this study is the first to decipher the origins of Wnt signaling (emerging from a membrane protein context) and provide novel insights into immunity.

      - The study is well-executed and provides many interesting leads for further experimental studies, which makes it very important. One of the significant hypotheses in this context is metazoan Wnt Lipocone domain interactions with lipids, which remain to be explored.

      - The manuscript is generally navigable for interesting reading despite being content-rich.

      - Overall, the figures are easy to follow.

      Significance:

      This study not only provides a plausible solution to the origins of metazoan Wnt signaling but also hypothesizes, based on retained ancestral substrate binding pocket, potential lipid interactions for lipocone wnt domains. The study also predicts novel enzymatic roles for many poorly characterized proteins that are involved in immunity across lineages/superkingdoms. This work is likely to inspire numerous experimental studies attempting to verify the hypotheses described in the study.

    3. Reviewer #2 (Public review):

      Summary:

      This is a remarkable study, one of a kind. The authors trace the entire huge superfamily containing Wnt proteins which origins remained obscure before this work. Even more amazingly, they show that Wnts originated from transmembrane enzymes. The work is masterfully executed and presented. The conclusions are strongly supported by multiple lines of evidence. Illustrations are beautifully crafted. This is an exemplary work of how modern sequence and structure analysis methods should be used to gain unprecedented insights into protein evolution and origins.

      Significance:

      Wnts are essential in animal development and their studies attracted significant attention. Therefore, this work is of high importance. Moreover, the authors delineated the entire superfamily consisting of many families with unique functional roles throughout all domains of live. The broad reach of this work further elevates its significance.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Burroughs et al. uses informatic sequence analysis and structural modeling to define a very large, new superfamily which they dub the Lipocone superfamily, based on its function on lipid components and cone-shaped structure. The family includes known enzymatic domains as well as previously uncharacterized proteins (30 families in total). Support for the superfamily designation includes conserved residues located on the homologous helical structures within the fold. The findings include analyses that shed light on important evolutionary relationships including a model in which the superfamily originated as membrane proteins where one branch evolved into a soluble version. Their mechanistic proposals suggest possible functions for enzymes currently unassigned. There is also support for the evolutionary connection of this family with the human immune system. The work will be of interest to those in the broad areas of bioinformatics, enzyme mechanisms, and evolution. The work is technically well performed and presented.

    5. Author response:

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity):

      The study is well-executed and provides many interesting leads for further experimental studies, which makes it very important. One of the significant hypotheses in this context is metazoan Wnt Lipocone domain interactions with lipids, which remain to be explored.

      The manuscript is generally navigable for interesting reading despite being content-rich. Overall, the figures are easy to follow.

      We thank the reviewer for the thoughtful and favorable assessment.

      Major comments:

      I urge the authors to consider creating a first figure summarizing the broad approach and process involved in discovering the lipocone superfamily. This would help the average reader easily follow the manuscript.

      It will be helpful to have the final model/synthesis figure, which provides a take-home message that combines the main deductions from Fig 1c, Fig 4, Fig 5, and Fig 6 to provide an eagle's eye view (also translating the arguments on Page 38 last para into this potential figure).

      We have generated a two-part figure that synthesizes these two requests, also in line with the recommendations made by Reviewer 3. Depending on the accepting Review Commons journal, we plan to either submit this as a graphical abstract/TOC figure (as suggested by Reviewer 3) or as a single figure. We prefer starting with the first approach as it will keep our figure count the same.

      Minor comments:

      Fig 1C: The authors should provide a statistical estimate of the difference in transmembrane tendency scores between the "membrane" and "globular" versions of the Lipocone domains.

      To address this, we calculated group-wise differences using the Kruskal-Wallis nonparametric test, followed by Dunn’s test with Bonferroni correction for a more stringent evaluation. The results of which are presented as a critical difference diagram in the new Supplementary Figure S3. The analysis is explained in the Methods section of the revised manuscript, and the statistically significant difference is mentioned in the text. This analysis identifies three groups of significantly different Lipocone families based on their transmembrane tendency: those predicted (or known) to associate with the prokaryotic membranes, those predicted to be diffusible, and a small number of families residing eukaryotic ER membranes or bacterial outer membranes.

      Reviewer #2 (Evidence, reproducibility and clarity):

      This is a remarkable study, one of a kind. The authors trace the entire huge superfamily containing Wnt proteins which origins remained obscure before this work. Even more amazingly, they show that Wnts originated from transmembrane enzymes. The work is masterfully executed and presented. The conclusions are strongly supported by multiple lines of evidence. Illustrations are beautifully crafted. This is an exemplary work of how modern sequence and structure analysis methods should be used to gain unprecedented insights into protein evolution and origins.

      We thank the reviewer for the positive evaluation of our work.

      Minor comments.

      (1) In fig 1, VanZ structure looks rather different from the rest and is a more tightly packed helical bundle. It might be useful for the readers to learn more about the arguments why authors consider this family to be homologous with the rest, and what caused these structural changes in packing of the helices.

      First, the geometry of an α-helix can be approximated as a cylinder, resulting in contact points that are relatively small. Fewer contact constraints can lead to structural variation in the angular orientations between the helices of an all α-helical domain, resulting in some dispersion in space of the helical axes. As a result, some of the views can be a bit confounding when presented as static 2D images. Second, of the two VanZ clades the characteristic structure similar to the other superfamily members is more easily seen in the VanZ-2 clade (as illustrated in supplementary Figure S2).    

      Importantly, the membership of the VanZ domains was recovered via significant hits in our sequence analysis of the superfamily. Importantly, when the sequence alignments of the active site are compared (Figure 2), VanZ retains the conserved active site residue positions, which are predicted to reside spatially in the same location and project into an equivalent active site pocket as seen in the other families in the superfamily. Further, this sequence relationship is captured by the edges in the network in Figure 1B: multiple members of the superfamily show edges indicating significant relationships with the two VanZ families (e.g., HHSearch hits of probability greater than 90%; p<0.0001 are observed between VanZ-1 and Skillet-DUF2809, Skillet-1, Skillet-4, YfiM-1, YfiM-DUF2279, Wok, pPTDSS, and cpCone-1). Thus, they occupy relatively central locations in the sequence similarity network, indicating a consistent sequence similarity connection to multiple other families.

      (2) Fig. 4 color bars before names show a functional role. How does the blue bar "described for the first time" fits into this logic? Maybe some other way to mark this (an asterisk?) could be better to resolve this sematic inconsistency.

      We have shifted the blue bars into asterisks, which follow family names, now stated in the updated legend.

      Reviewer #3 (Evidence, reproducibility and clarity):

      The manuscript by Burroughs et al. uses informatic sequence analysis and structural modeling to define a very large, new superfamily which they dub the Lipocone superfamily, based on its function on lipid components and cone-shaped structure. The family includes known enzymatic domains as well as previously uncharacterized proteins (30 families in total). Support for the superfamily designation includes conserved residues located on the homologous helical structures within the fold. The findings include analyses that shed light on important evolutionary relationships including a model in which the superfamily originated as membrane proteins where one branch evolved into a soluble version. Their mechanistic proposals suggest possible functions for enzymes currently unassigned. There is also support for the evolutionary connection of this family with the human immune system. The work will be of interest to those in the broad areas of bioinformatics, enzyme mechanisms, and evolution. The work is technically well performed and presented.

      We appreciate the positive evaluation of our work by the reviewer.

      Referees cross-commenting

      All the comments seem useful to me. I like Reviewer 1's suggestion for a flowchart showing the methodology. I think the summarizing figure suggested could be a TOC abstracvt, which many journals request.

      To accommodate this comment (along with Reviewer 1’s comments), we have generated a two-part figure containing the methodology flowchart and the summary of findings. Combining the two provides some before-and-after symmetry to a TOC figure, while also avoiding further inflation of the figure count, which would likely be an issue at one or more of the Review Commons journals.

      The authors may wish to consider the following points (page numbers from PDF for review):

      (1) It would be useful in Fig 1A, either in main text or the supporting information, to also have a an accompanying topology diagram- I like the coloring of the helices to show the homology but the connections between them are hard to follow

      We acknowledge the reviewer’s concern as one shared by ourselves. We have placed such a topology diagram in Figure 1A, and now refer to it at multiple points in the manuscript text.

      (2) Page: 6- In the paragraph marked as an example- please call out Fig1A when the family mentioned is described (I believe SAA is described as one example)

      We have added these pointers in the text, where appropriate.

      (3) Page: 7- The authors state "these 'hydrophobic families' often evince a deeper phyletic distribution pattern than the less-hydrophobic families (Figure S1), implying that the ancestral version of the superfamily was likely a TM domain" there should be more explanation or information here - I am not certain from looking at FigS1 what a deeper phyletic distribution pattern means. Perhaps explaining for a single example? I also see that this important point is discussed in the conclusions- it is useful to point to the conclusion here.

      Our use of the ‘deeper’ in this context is meant to convey the concept that more widely conserved families/clades (both across and within lineages) suggest an earlier emergence. In the Lipocone superfamily, this phylogenetic reasoning supports an evolutionary scenario where the membrane-inserted versions generally emerged early, while the solubilized versions, which are found in relatively fewer lineages, emerged later.

      To address this objectively, we have calculated a simple phyletic distribution metric that combines the phyletic spread of a Lipocone clade with its depth within individual lineages, which is then plotted as a bargraph (Supplemental Figure S1). Briefly, this takes the width of the bar as the phyletic spread across the number of distinct taxonomic lineages and its height as a weighted mean of occurrence within each lineage (depth). The latter helps dampen the effects of sampling bias. In the resulting graph, lineages with a lower height and width are likely to have been derived later than those with a greater height and width. A detailed description clarifying this has been added to the Methods section of the revised manuscript. The results support two statements that are made in the text: 1) that the Wok and VanZ clades are the most widely and deeply represented clades in the superfamily, and 2) that the predicted transmembrane versions tend to be more widely and deeply distributed. We have also added a statement in the results with a pointer to Figure S1 to clarify this point raised by the referee.

      (4) For figure 3 I would suggest instead of coloring by atom type- to color the leaving group red and the group being added blue so the reader can see where the moieties start and end in substrates and products

      We have retained the atom type coloring in the figure for ease of visualizing the atom types. However, to address the reviewer’s concern, we have added dashed colored circles to highlight attacking and leaving groups in the reactions. The legend has been updated accordingly.

      (5) Page: 13- The authors state "While the second copy in these versions is catalytically inactive, the H1' from the second duplicate displaces the H1 from the first copy," So this results in a "sort of domain swap" correct? It may be more clear to label both copies in Figure 3 upper right so it is easier for the reader to follow.

      We have added these labels to the updated Figure S4 (formerly S3).

      (6) The authors state "In addition to the fusion to the OMP β-barrel, the YfiM-DUF2279 family (Figure 5H) shows operonic associations with a secreted MltG-like peptidoglycan lytic transglycosylase (127,128), a lipid anchored cytochrome c heme-binding domain (129), a phosphoglucomutase/phosphomannomutase enzyme (130), a GNAT acyltransferase (131), a diaminopimelate (DAP) epimerase (132), and a lysozyme like enzyme (133). In a distinct operon, YfiM-DUF2279 is combined with a GT-A glycosyltransferase domain (79), a further OMP β-barrel, and a secreted PDZ-like domain fused to a ClpP-like serine protease (134,135) (Figure 5H)." this combination of enzymes sounds like those in the pathways for oligosaccharide synthesis which is cytoplasmic but the flippase acts to bring the product to the periplasm. Please make sure it is clear that these enzymes may act at different faces of the membrane.

      We have made that point explicit in the revised manuscript in the paragraph following the above-quoted statement.

      (7) Page: 21- the authors should remove the unpublished observations on other RDD domain or explain or cite them

      The analysis of the RDD domain is a part of a distinct study whose manuscript we are currently preparing, and explaining its many ramifications would be outside the scope of this manuscript. Moreover, placing even an account of it in this manuscript would break its flow and take the focus away from the Lipocone superfamily. Further, its inclusion of the RDD story would substantially increase the size of the manuscript. However, it is commonly fused to the Lipocone domain; hence, it would be remiss if we entirely remove a reference to it. Accordingly, we retain a brief account of the RDD-fused Lipocone domains in the revised manuscript that is just sufficient to make the relevant functional case”.

      (8) Page: 34- The authors state "For instance, the emergence of the outer membrane in certain bacteria was potentially coupled with the origin of the YfiM and Griddle clades (Figure 4)." I don't see origin point indicated in figure 4 (emergence of outer membrane- this may be helpful to indicate in some way- also I am not certain what the dashed circles in Fig 4 are indicating- its not in the legend?

      This annotation has been added to the revised Figure 4, and the point of recruitment is indicated with a  “X” sign, along with a clarification in the legend regarding the dashed circles.

      (9) In terms of the hydrophobicity analysis, it would be good to mark on the plot (Fig 1C) one or two examples of lipocone members with known structure that are transmembrane proteins as a positive control

      We have added these markers (colored triangles and squares for these families to the plot.

      Grammar, typos

      Page: 3- abstract severance is an odd word to use for hydrolysis or cleavage

      We have changed to “cleavage”.

      Page: 5- "While the structure of Wnt was described over a decade prior" should read "Although the structure of ..."

      Page 7 - "One family did not yield a consistent prediction for orientation"- please state which family

      Page: 8 "While the ancestral pattern is noticeably degraded in the metazoan Wnt (Met-Wnt) family, it is strongly preserved in the prokaryotic Min-Wnt family." Should read "Although the ancestral..."

      throughout- please replace solved with experimentally determined to be clear and avoid jargon

      Please replace "TelC severs the link" with "TelC cleaves the bond "

      We have made the above changes.

      Page: 19- the authors state "a lipobox-containing synaptojanin superfamily phosphoesterase (125) and a secreted R-P phosphatase (126) (see Figure 6, Supplementary Data)" I was uncertain if the authors meant Fig S6 or they meant see Fig 6 and something else in supplementary data. Please fix.

      In this pointer, we intended to flag the relevant gene neighborhoods in both Figures 5H and 6, as well as highlight the additional examples contained in the Supplementary Data. We have updated the point

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper concerns mechanisms of foraging behavior in C. elegans. Upon removal from food, C. elegans first executes a stereotypical local search behavior in which it explores a small area by executing many random, undirected reversals and turns called "reorientations." If the worm fails to find food, it transitions to a global search in which it explores larger areas by suppressing reorientations and executing long forward runs (Hills et al., 2004). At the population level, the reorientation rate declines gradually. Nevertheless, about 50% of individual worms appear to exhibit an abrupt transition between local and global search, which is evident as a discrete transition from high to low reorientation rate (Lopez-Cruz et al., 2019). This observation has given rise to the hypothesis that local and global search correspond to separate internal states with the possibility of sudden transitions between them (Calhoun et al., 2014). The main conclusion of the paper is that it is not necessary to posit distinct internal states to account for discrete transitions from high to low reorientation rates. On the contrary, discrete transitions can occur simply because of the stochastic nature of the reorientation behavior itself.

      Strengths:

      The strength of the paper is the demonstration that a more parsimonious model explains abrupt transitions in the reorientation rate.

      Weaknesses:

      (1) Use of the Gillespie algorithm is not well justified. A conventional model with a fixed dt and an exponentially decaying reorientation rate would be adequate and far easier to explain. It would also be sufficiently accurate - given the appropriate choice of dt - to support the main claims of the paper, which are merely qualitative. In some respects, the whole point of the paper - that discrete transitions are an epiphenomenon of stochastic behavior - can be made with the authors' version of the model having a constant reorientation rate (Figure 2f).

      We apologize, but we are not sure what the reviewer means by “fixed dt”. If the reviewer means taking discrete steps in time (dt), and modeling whether a reorientation occurs, we would argue that the Gillespie algorithm is a better way to do this because it provides floating-point precision, rather than a time resolution limited by dt, which we hopefully explain in the updated text (Lines 107-192).

      The reviewer is correct that discrete transitions are an epiphenomenon of stochastic behavior as we show in Figure 2f. However, abrupt stochastic jumps that occur with a constant rate do not produce persistent changes in the observed rate because it is by definition, constant. The theory that there are local and global searches is based on the observation that individual worms often abruptly change their reorientation rates. But this observation is only true for a fraction of worms. We are trying to argue that the reason why this is not observed for all, or even most worms is because these are the result of stochastic sampling, not a sudden change in search strategy.

      (2) In the manuscript, the Gillespie algorithm is very poorly explained, even for readers who already understand the algorithm; for those who do not it will be essentially impossible to comprehend. To take just a few examples: in Equation (1), omega is defined as reorientations instead of cumulative reorientations; it is unclear how (4) follows from (2) and (3); notation in (5), line 133, and (7) is idiosyncratic. Figure 1a does not help, partly because the notation is unexplained. For example, what do the arrows mean, what does "*" mean?

      We apologize for this, you are correct, 𝛀 is cumulative reorientations, and we have edited the text for clarity (Lines 107-192):

      We apologize for the arrow notation confusion. Arrow notation is commonly used in pseudocode to indicate variable assignment, and so we used it to indicate variable assignment updates in the algorithm.

      We added Figure 2a to help explain the Gillespie algorithm for people who are unfamiliar with it, but you are correct, some notation, like probabilities, were left unexplained. We have added more text to the figure legend. Hopefully this additional text, along with lines 105-190, provide better clarification.

      (3) In the model, the reorientation rate dΩ⁄dt declines to zero but the empirical rate clearly does not. This is a major flaw. It would have been easy to fix by adding a constant to the exponentially declining rate in (1). Perhaps fixing this obvious problem would mitigate the discrepancies between the data and the model in Figure 2d.

      You are correct that the model deviates slightly at longer times, but this result is consistent with Klein et al. that show a continuous decline of reorientations. However, we have added a constant to the model (b, Equation 2), since an infinite run length is likely not physiological.

      (4) Evidence that the model fits the data (Figure 2d) is unconvincing. I would like to have seen the proportion of runs in which the model generated one as opposed to multiple or no transitions in reorientation rate; in the real data, the proportion is 50% (Lopez). It is claimed that the "model demonstrated a continuum of switching to non-switching behavior" as seen in the experimental data but no evidence is provided.

      We should clarify that the 50% proportion cited by López-Cruz was based on an arbitrary difference in slopes, and by assessing the data visually (López-Cruz, Figure S2). We added a comment in the text to clarify this (Lines 76 – 78). We sought to avoid this subjective assessment by plotting the distribution of slopes and transition times produced by the method used in López-Cruz. We should also clarify by what we meant by “a continuum of switching and non-switching” behavior. Both the transition time distributions and the slope-difference distributions do not appear to be the result of two distributions (the distributions in Figure 1 are not bimodal). This is unlike roaming and dwelling on food, where two distinct distributions of behavioral metrics can be identified based on speed and angular speed (Flavell et al, 2009, Fig S2a).

      Based on the advice of Reviewer #3, we have also modeled the data using different starting amounts of M (M<sub>0</sub>). By definition, an initial value of M<sub>0</sub> = 1 is a two-state switching strategy; the worm either uses a reorientation rate of a (when M = 1) or b (when M = 0). As expected, this does produce a bimodal distribution of slope differences (Figure 3b), which is significantly different than the experimental distribution (Figure 3c). We have added a new section to explain this in more detail (Lines 253 – 297).

      (5) The explanation for the poor fit between the model and data (lines 166-174) is unclear. Why would externally triggered collisions cause a shift in the transition distribution?

      Thank you, we rewrote the text to clarify this better (Lines 227-233). There were no externally triggered collisions; 10 animals were used per experiment. They would occasionally collide during the experiment, but these collisions were excluded from the data that were provided. However, worms are also known to increase reorientations when they encounter a pheromone trail, and it is unknown (from this dataset) which orientations may have been a result of this phenomenon.

      (6) The discussion of Levy walks and the accompanying figure are off-topic and should be deleted.

      Thank you, we agree that this topic is tangential, and we removed it.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors build a statistical model that stochastically samples from a timeinterval distribution of reorientation rates. The form of the distribution is extracted from a large array of behavioral data, and is then used to describe not only the dynamics of individual worms (including the inter-individual variability in behavior), but also the aggregate population behavior. The authors note that the model does not require assumptions about behavioral state transitions, or evidence accumulation, as has been done previously, but rather that the stochastic nature of behavior is "simply the product of stochastic sampling from an exponential function".

      Strengths:

      This model provides a strong juxtaposition to other foraging models in the worm. Rather than evoking a behavioral transition function (that might arise from a change in internal state or the activity of a cell type in the network), or evidence accumulation (which again maps onto a cell type, or the activity of a network) - this model explains behavior via the stochastic sampling of a function of an exponential decay. The underlying model and the dynamics being simulated, as well as the process of stochastic sampling, are well described and the model fits the exponential function (Equation 1) to data on a large array of worms exhibiting diverse behaviors (1600+ worms from Lopez-Cruz et al). The work of this study is able to explain or describe the inter-individual diversity of worm behavior across a large population. The model is also able to capture two aspects of the reorientations, including the dynamics (to switch or not to switch) and the kinetics (slow vs fast reorientations). The authors also work to compare their model to a few others including the Levy walk (whose construction arises from a Markov process) to a simple exponential distribution, all of which have been used to study foraging and search behaviors.

      Weaknesses:

      This manuscript has two weaknesses that dampen the enthusiasm for the results. First, in all of the examples the authors cite where a Gillespie algorithm is used to sample from a distribution, be it the kinetics associated with chemical dynamics, or a Lotka-Volterra Competition Model, there are underlying processes that govern the evolution of the dynamics, and thus the sampling from distributions. In one of their references, for instance, the stochasticity arises from the birth and death rates, thereby influencing the genetic drift in the model. In these examples, the process governing the dynamics (and thus generating the distributions from which one samples) is distinct from the behavior being studied. In this manuscript, the distribution being sampled is the exponential decay function of the reorientation rate (lines 100-102). This appears to be tautological - a decay function fitted to the reorientation data is then sampled to generate the distributions of the reorientation data. That the model performs well and matches the data is commendable, but it is unclear how that could not be the case if the underlying function generating the distribution was fit to the data.

      Thank you, we apologize that this was not clearer. In the Lotka-Volterra model, the density of predators and prey are being modeled, with the underlying assumption that rates of birth and death are inherently stochastic. In our model, the number of reorientations are being modeled, with the assumption (based on the experiments), that the occurrence of reorientations is stochastic, just like the occurrence (birth) of a prey animal is stochastic. However, the decay in M is phenomenological, and we speculate about the nature of M later in the manuscript.

      You are absolutely right that the decay function for M was fit to the population average of reorientations and then sampled to generate the distributions of the reorientation data. This was intentional to show that the parameters chosen to match the population average would produce individual trajectories with comparable stochastic “switching” as the experimental data. All we’re trying to show really is that observed sudden changes in reorientation that appear persistent can be produced by a stochastic process without resorting to binary state assignments. In Calhoun, et al 2014 it is reported all animals produced switch-like behavior, but in Klein et al, 2017 it is reported that no animals showed abrupt transitions. López-Cruz et al seem to show a mix of these results, which can easily be explained by an underlying stochastic process.

      The second weakness is somewhat related to the first, in that absent an underlying mechanism or framework, one is left wondering what insight the model provides.

      Stochastic sampling a function generated by fitting the data to produce stochastic behavior is where one ends up in this framework, and the authors indeed point this out: "simple stochastic models should be sufficient to explain observably stochastic behaviors." (Line 233-234). But if that is the case, what do we learn about how the foraging is happening? The authors suggest that the decay parameter M can be considered a memory timescale; which offers some suggestion, but then go on to say that the "physical basis of M can come from multiple sources". Here is where one is left for want: The mechanisms suggested, including loss of sensory stimuli, alternations in motor integration, ionotropic glutamate signaling, dopamine, and neuropeptides are all suggested: these are basically all of the possible biological sources that can govern behavior, and one is left not knowing what insight the model provides. The array of biological processes listed is so variable in dynamics and meaning, that their explanation of what governs M is at best unsatisfying. Molecular dynamics models that generate distributions can point to certain properties of the model, such as the binding kinetics (on and off rates, etc.) as explanations for the mechanisms generating the distributions, and therefore point to how a change in the biology affects the stochasticity of the process. It is unclear how this model provides such a connection, especially taken in aggregate with the previous weakness.

      Providing a roadmap of how to think about the processes generating M, the meaning of those processes in search, and potential frameworks that are more constrained and with more precise biological underpinning (beyond the array of possibilities described) would go a long way to assuaging the weaknesses.

      Thank you, these are all excellent points. We should clarify that in López-Cruz et al, they claim that only 50% of the animals fit a local/global search paradigm. We are simply proposing there is no need for designating local and global searches if the data don’t really support it. The underlying behavior is stochastic, so the sudden switches sometimes observed can be explained by a stochastic process where the underlying rate is slowing down, thus producing the persistently slow reorientation rate when an apparent “switch” occurs. What we hope to convey is that foraging doesn’t appear to follow a decision paradigm, but instead a gradual change in reorientations which for individual worms, can occasionally produce reorientation trajectories that appear switch-like.

      As for M, you are correct, we should be more explicit, and we have added text (Lines 319-359) to expand upon its possible biological origin.

      Reviewer #3 (Public review):

      Summary:

      This intriguing paper addresses a special case of a fundamental statistical question: how to distinguish between stochastic point processes that derive from a single "state" (or single process) and more than one state/process. In the language of the paper, a "state" (perhaps more intuitively called a strategy/process) refers to a set of rules that determine the temporal statistics of the system. The rules give rise to probability distributions (here, the probability for turning events). The difficulty arises when the sampling time is finite, and hence, the empirical data is finite, and affected by the sampling of the underlying distribution(s). The specific problem being tackled is the foraging behavior of C. elegans nematodes, removed from food. Such foraging has been studied for decades, and described by a transition over time from 'local'/'area-restricted' search'(roughly in the initial 10-30 minutes of the experiments, in which animals execute frequent turns) to 'dispersion', or 'global search' (characterized by a low frequency of turns). The authors propose an alternative to this two-state description - a potentially more parsimonious single 'state' with time-changing parameters, which they claim can account for the full-time course of these observations.

      Figure 1a shows the mean rate of turning events as a function of time (averaged across the population). Here, we see a rapid transient, followed by a gradual 4-5 fold decay in the rate, and then levels off. This picture seems consistent with the two-state description. However, the authors demonstrate that individual animals exhibit different "transition" statistics (Figure 1e) and wish to explain this. They do so by fitting this mean with a single function (Equations 1-3).

      Strengths:

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Weaknesses:

      (1) The authors claim that only about half the animals tested exhibit discontinuity in turning rates. Can they automatically separate the empirical and model population into these two subpopulations (with the same method), and compare the results?

      Thank you, we should clarify that the observation that about half the animals exhibit discontinuity was not made by us, but by López-Cruz et al. The observed fraction of 50% was based on a visual assessment of the dual regression method we described. We added text (Lines 76-79) to clarify this. To make the process more objective, we decided to simply plot the distributions of the metrics they used for this assessment to see if two distinct populations could be observed. However, the distributions of slope differences and transition times do not produce two distinct populations. Our stochastic approach, which does not assume abrupt state-transitions, also produces comparable distributions. To quantify this, we have added a section varying M<sub>0</sub>, including setting M<sub>0</sub> to 1, so that the model by definition is a switch model. This model performs the worst (Lines 253-296, Figure 3).

      (2) The equations consider an exponentially decaying rate of turning events. If so, Figure 2b should be shown on a semi-logarithmic scale.

      We chose to not do this because this average is based on the number of discrete reorientation events observed within a 2-minute window. The range of events ranges from 0 to 6 (hence a rate of 0.5-3 min<sup>-1</sup>), which does not span one order of magnitude. Instead, we included a heat map (Figure 1a, Figure 2b bottom panel) which shows the density that the average is based on. We hope this provides some clarity to the reader.

      (3) The variables in Equations 1-3 and the methods for simulating them are not well defined, making the method difficult to follow. Assuming my reading is correct, Omega should be defined as the cumulative number of turning events over time (Omega(t)), not as a "turn" or "reorientation", which has no derivative. The relevant entity in Figure 1a is apparently <Omega (t)>, i.e. the mean number of events across a population which can be modelled by an expectation value. The time derivative would then give the expected rate of turning events as a function of time.

      Thank you, you are correct. Please see response to Reviewer #1.

      (4) Equations 1-3 are cryptic. The authors need to spell out up front that they are using a pair of coupled stochastic processes, sampling a hidden state M (to model the dynamic turning rate) and the actual turn events, Omega(t), separately, as described in Figure 2a. In this case, the model no longer appears more parsimonious than the original 2-state model. What then is its benefit or explanatory power (especially since the process involving M is not observable experimentally)?

      Thank you, yes we see how as written this was confusing. In our response to Reviewer #1, and in the text, we added an important detail:

      While reorientations are modeled as discrete events, which is observationally true, the amount of M at time t=0 is chosen to be large (M<sub>0</sub> = 1000), so that over the timescale of 40 minutes, the decay in M is practically continuous. This ensures that sudden changes in reorientations are not due to sudden changes in M, but due to the inherent stochasticity of reorientations.

      However you are correct that if M was chosen to have a binary value of 0 or 1, then this would indeed be the two state model. We added a new section to address this (Lines 253-287, Figure 3). Unlike the experiments, the two-state model produces bimodal distributions in slope and transition times, and these distributions are significantly different than the experimental data (Figure 3).

      (5) Further, as currently stated in the paper, Equations 1-3 are only for the mean rate of events. However, the expectation value is not a complete description of a stochastic system. Instead, the authors need to formulate the equations for the probability of events, from which they can extract any moment (they write something in Figure 2a, but the notation there is unclear, and this needs to be incorporated here).

      Thank you, yes please see our response to Reviewer #1. We have clarified the text in Lines 105-190.

      (6) Equations 1-3 have three constants (alpha and gamma which were fit to the data, and M0 which was presumably set to 1000). How does the choice of M0 affect the results?

      Thank you, this is a good question. We address this in lines 253-296. Briefly, the choice of M<sub>0</sub> does not have a strong effect on the results, unless we set it to M<sub>0</sub>, which by definition, creates a two-state model. This model was significantly different than the experimental data, relative to the other models (Figure 3c).

      (7) M decays to near 0 over 40 minutes, abolishing omega turns by the end of the simulations. Are omega turns entirely abolished in worms after 30-40 minutes off food? How do the authors reconcile this decay with the leveling of the turning rate in Figure 1a?

      Yes, Reviewer #1 recommended adding a baseline reorientation rate which we did for all models (Equation 2). However, we should also note that in Klein et al they observed a continuous decay over 50 minutes. Though realistically, it is likely not plausible that worms will produce infinitely long runs at long time points.

      (8) The fit given in Figure 2b does not look convincing. No statistical test was used to compare the two functions (empirical and fit). No error bars were given (to either). These should be added. In the discussion, the authors explain the discrepancy away as experimental limitations. This is not unreasonable, but on the flip side, makes the argument inconclusive. If the authors could model and simulate these limitations, and show that they account for the discrepancies with the data, the model would be much more compelling.

      To do this, I would imagine that the authors would need to take the output of their model (lists of turning times) and convert them into simulated trajectories over time. These trajectories could be used to detect boundary events (for a given size of arena), collisions between individuals, etc. in their simulations and to see their effects on the turn statistics.

      Thank you, we have added dashed lines to indicate standard deviation to Figures 2b and 3a. After running the models several times, we found that some of the small discrepancies noted (like s<sub>1</sub>-s<sub>2</sub> < 0 for experiments but not the model), were spurious due to these data points being <1% of the data, so we cut this from the text. To compare how similar the continuous (M<sub>0</sub> > 1) and discrete (M<sub>0</sub> = 1) models were to the experimental data, we calculated a Jensen-Shannon distance for the models, and found that the discrete model was significantly more dissimilar to the experimental data than the continuous models (Lines 289-296, Figure 3c).

      (9) The other figures similarly lack any statistical tests and by eye, they do not look convincing. The exception is the 6 anecdotal examples in Figure 2e. Those anecdotal examples match remarkably closely, almost suspiciously so. I'm not sure I understood this though - the caption refers to "different" models of M decay (and at least one of the 6 examples clearly shows a much shallower exponential). If different M models are allowed for each animal, this is no longer parsimonious. Are the results in Figure 2d for a single M model? Can Figure 2e explain the data with a single (stochastic) M model?

      We certainly don’t want the panels in Figure 2e to be suspicious! These comparisons were drawn from calculating the correlations between all model traces and all experimental traces, and then choosing the top hits. Every time we run the simulation, we arrive at a different set of examples. Since it was recommended we add a baseline rate, these examples will be a completely different set when we run the simulation, again.

      We apologize for the confusion regarding M. Since the worms do not all start out with identical reorientation rates, we drew the initial M value from a distribution centered on M<sub>0</sub> to match the initial distribution of observed experimental rates (Lines 206-214). However, the decay in M (γ), as well as α and β, are the same for all in silico animals.

      (10) The left axes of Figure 2e should be reverted to cumulative counts (without the normalization).

      Thank you, we made this change.

      (11) The authors give an alternative model of a Levy flight, but do not give the obvious alternative models:<br /> a) the 1-state model in which P(t) = alpha exp (-gamma t) dt (i.e. a single stochastic process, without a hidden M, collapsing equations 1-3 into a single equation).

      b) the originally proposed 2-state model (with 3 parameters, a high turn rate, a low turn rate, and the local-to-global search transition time, which can be taken from the data, or sampled from the empirical probability distributions). Why not? The former seems necessary to justify the more complicated 2-process model, and the latter seems necessary since it's the model they are trying to replace. Including these two controls would allow them to compare the number of free parameters as well as the model results. I am also surprised by the Levy model since Levy is a family of models. How were the parameters of the Levy walk chosen?

      Thank you, we removed this section completely, as it is tangential to the main point of the paper.

      (12) One point that is entirely missing in the discussion is the individuality of worms. It is by now well known that individual animals have individual behaviors. Some are slow/fast, and similarly, their turn rates vary. This makes this problem even harder. Combined with the tiny number of events concerned (typically 20-40 per experiment), it seems daunting to determine the underlying model from behavioral statistics alone.

      Thank you, yes we should have been more explicit in the reasoning behind drawing the initial M from a distribution (response to comment #9). We assume that not every worm starts out with the same reorientation rate, but that some start out fast (high M) and some start out slow (low M). However, we do assume M decays with the same kinetics, which seems sufficient to produce the observed phenomena. Multiple decay rates are not needed to replicate the experimental data.

      (13) That said, it's well-known which neurons underpin the suppression of turning events (starting already with Gray et al 2005, which, strangely, was not cited here). Some discussion of the neuronal predictions for each of the two (or more) models would be appropriate.

      Thank you, yes we will add Gray et al, but also the more detailed response to Reviewer #2 (Lines 319-359 of manuscript).

      (14) An additional point is the reliance entirely on simulations. A rigorous formulation (of the probability distribution rather than just the mean) should be analytically tractable (at least for the first moment, and possibly higher moments). If higher moments are not obtainable analytically, then the equations should be numerically integrable. It seems strange not to do this.

      Thank you for suggesting this. For the Levy section (which we cut) this would have been an improvement. However, since the distributions of slope differences and transition times are based on a recursive algorithm, rather than an analytical formulation, we decided to use the Jensen-Shannon divergence to compare distributions (Lines 272-296, Figure 3c) since this is a parameter-free approach.

      In summary, while sample simulations do nicely match the examples in the data (of discontinuous vs continuous turning rates), this is not sufficient to demonstrate that the transition from ARS to dispersion in C. elegans is, in fact, likely to be a single 'state', or this (eq 1-3) single state. Of course, the model can be made more complicated to better match the data, but the approach of the authors, seeking an elegant and parsimonious model, is in principle valid, i.e. avoiding a many-parameter model-fitting exercise.

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Thank you, we agree that this is a generic phenomenon, which is partly why we did this. The data from López-Cruz seem to agree in part with Calhoun et al, that claim abrupt transitions occur, and Klein et al, which claim they do not occur. Since the underlying phenomenon is stochastic, we propose the mixed observations of sudden and gradual changes in search strategy are simply the result of a stochastic process, which can produce both phenomena for individual observations. We hope this work can help clarify why sudden changes in search strategy are not consistently observed. We propose a simple hypothesis that there is no change in search strategy. The reorientation rate decays in time, and due to the stochastic nature of this behavior, what appears as a sudden change for individual observations is not due to an underlying decision, but rather the result of a stochastic process.

    2. eLife Assessment

      This valuable paper uses a quantitative modeling approach to explore a well-studied transition in motor behavior in the nematode C. elegans. The authors provide convincing evidence that this transition, which has been interpreted as a two-state behavior, can instead be described as a process whose parameters are smoothly modulated within a single state. This finding provides insight into the relationships between latent internal states and observable behavioral states, and suggests that relatively simple neuronal mechanisms can drive behavioral sequences that appear more complex.

    3. Reviewer #1 (Public review):

      This paper concerns mechanisms of foraging behavior in C. elegans. Upon removal from food, C. elegans first executes a stereotypical local search behavior in which it explores a small area by executing many random, undirected reversals and turns called "reorientations." If the worm fails to find food, it transitions to a global search in which it explores larger areas by suppressing reorientations and executing long forward runs (Hills et al., 2004). At the population level, reorientation rate declines gradually. Nevertheless, about 50% of individual worms appear to exhibit an abrupt transition between local and global search, which is evident as a discrete transition from high to low reorientation rate (Lopez-Cruz et al., 2019). This observation has given rise to the hypothesis that local and global search correspond to separate internal states with the possibility of sudden transitions between them (Calhoun et al., 2014). The objective of the paper is to demonstrate that is not necessary to posit distinct internal states to account for discrete transitions from high to low reorientation rate. On the contrary, discrete transitions can occur simply because of the stochastic nature of the reorientation behavior itself.

      Major strengths and weaknesses of the methods and results

      The model was not explicitly designed to match the sudden, stable changes in reorientation rates observed in the experimental data from individual worms. Kinetic parameters were simply chosen to match the average population behavior. Nevertheless, many sudden stable changes in reorientation rates occurred. This is a strong argument that apparent state changes can arise as an epiphenomenon of stochastic processes.

      The new stochastic model is more parsimonious than reorientation-state change model because it posits one state rather than two.

      A prominent feature of the empirical data is that 50% of the worms exhibit a single (apparent) state change and the rest show either no state changes or multiple state changes. Does the model reproduce these proportions? This obvious question was not addressed.

      There is no obvious candidate for the neuronal basis of the decaying factor M. The authors speculate that decreasing sensory neuron activity might be the correlate of M but then provide contradictory evidence that seems to undermine that hypothesis. The absence of a plausible neuronal correlate of M weakens the case for the model.

      Appraisal of whether the authors achieved their aims, and whether the results support their conclusions

      The authors have made a convincing case that is not necessary to posit distinct internal states to account for discrete transitions from high to low reorientation rate. On the contrary, discrete transitions can occur simply because of the stochastic nature of the reorientation behavior itself.

      Impact of the work on the field, and the utility of the methods and data to the community

      Posting hidden internal states to explain behavioral sequences is gaining acceptance in behavioral neuroscience. The likely impact of the paper is to establish a compelling example of how statistical reasoning can reduce the number of hidden states to achieve models that are more parsimonious.

    4. Reviewer #2 (Public review):

      Summary:

      In this study, the authors build a statistical model that stochastically samples from a time-interval distribution of reorientation rates. The form of the distribution is extracted from a large array of behavioral data, is then used to describe not only the dynamics of individual worms (including the inter-individual variability in behavior), but also the aggregate population behavior. The authors note that the model does not require any assumptions about behavioral state transitions, or evidence accumulation, as has been done previously, but rather that the stochastic nature of behavior is "simply the product of stochastic sampling from an exponential function".

      Strengths:

      This model provides a strong juxtaposition to other foraging models in the worm. Rather than evoking a behavioral transition function (that might arise from a change in internal state or the activity of a cell type in the network), or evidence accumulation (which again maps onto a cell type, or the activity of a network) - this model explains behavior via the stochastic sampling of a function of an exponential decay. The underlying model and the dynamics being simulated, as well as the process of stochastic sampling are well described, and the model fits the exponential function (equation 1) to data on a large array of worms exhibiting diverse behaviors (1600+ worms from Lopez-Cruz et al). The work of this study can explain or describe the inter-individual diversity of worm behavior across a large population. The model is also able to capture two aspects of the reorientations, including the dynamics (to switch or not to switch) and the kinetics (slow vs fast reorientations). The authors also work to compare their model to a few others including the Levy walk (whose construction arises from a Markov process) to a simple exponential distribution, all of which have been used to study foraging and search behaviors.

      Weaknesses:

      The weaknesses are one of framework, which may nonetheless stir discussion and motivate new ideas based on these results.

      First, the examples the authors cite where a Gillespie algorithm is used to sample from a distribution, be it the kinetics associated with chemical dynamics, or a Lotka-Volterra Competition Model, there are underlying processes that govern the evolution of the dynamics, and thus the sampling from distributions. In one of their references for instance, the stochasticity arises from the birth and death rates, thereby influencing the genetic drift in the model. In these examples, the process governing the dynamics (and thus generating the distributions from which one samples) are distinct from the behavior being studied. In this manuscript, the distribution being sampled from is the exponential decay function of the reorientation rate. That the model performs well, and matches the data is commendable, but it is unclear how that could not be the case if the underlying function generating the distribution was fit to the data.

      The second weakness is related to the first, in that absent an underlying mechanism or framework, one is left wondering what insight the model provides. Stochastic sampling a function generated by fitting the data to produce stochastic behavior is where one ends up in this framework. But if that is the case, what do we learn about how the foraging is happening. The authors suggest that the decay parameter M can be considered a memory timescale, which offers some suggestion, but then go on to say that the "physical basis of M can come from multiple sources". Here is where one is left for want: Molecular dynamics models that generate distributions can point to certain properties of the model, such as the binding kinetics (on and off rates, etc.) as explanations for the mechanisms generating the distributions, and therefore point to how a change in the biology affects the stochasticity of the process. It is unclear how this model provides such a connection.

      The authors provide possible roadmaps, but where they lead and how to relate that back to testable mechanistic studies remains unclear. Weighing the significance of the finding relative to the weaknesses appears to depend on how one feels about the possible mechanisms the authors identify in their responses.

    5. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public reviews):

      Weaknesses:

      This manuscript has two weaknesses that dampen the enthusiasm for the results. First, in all of the examples the authors cite where a Gillespie algorithm is used to sample from a distribution, be it the kinetics associated with chemical dynamics, or a Lotka-Volterra Competition Model, there are underlying processes that govern the evolution of the dynamics, and thus the sampling from distributions. In one of their references for instance, the stochasticity arises from the birth and death rates, thereby influencing the genetic drift in the model. In these examples, the process governing the dynamics (and thus generating the distributions from which one samples) are distinct from the behavior being studied. In this manuscript, the distribution being sampled from is the exponential decay function of the reorientation rate (lines 100-102). This appears to be tautological - a decay function fitted to the reorientation data is then sampled to generate the distributions of the reorientation data. That the model performs well, and matches the data is commendable, but it is unclear how that could not be the case if the underlying function generating the distribution was fit to the data.

      To use the Lotka-Volterra model as an analogy, the changing reorientation rate (like a changing rate of prey growth) is tied to the decay in M (like a loss of predators). You could infer the loss of predators by measuring the changing rate of prey growth. In our case, we infer the loss of M by observing the changing reorientation rate. In the LotkaVolterra model, the prey growth rate is negatively associated with predator numbers, but in our model, the reorientation rate is positively associated with M, hence a loss in M leads to a decay in the reorientation rate.

      You are correct that the decay parameters fit to the average should produce a distribution of in silico data that reproduce this average result (Figure 3a). However, this does not necessarily mean that these kinetic parameters should produce the same distributions of switch kinetics observed in Figure 3b. Indeed, a binary model (𝑴 ∈ {𝟎, 𝟏}), which produces an average distribution that matches the average experimental data (Figure 3a) produces a fundamentally different (bimodal) distribution of switch distributions in Figure 3b.

      The second weakness is somewhat related to the first, in that absent an underlying mechanism or framework, one is left wondering what insight the model provides. Stochastic sampling a function generated by fitting the data to produce stochastic behavior is where one ends up in this framework, and the authors indeed point this out: "simple stochastic models should be sufficient to explain observably stochastic behaviors." (Line 233-234). But if that is the case, what do we learn about how the foraging is happening. The authors suggest that the decay parameter M can be considered a memory timescale; which offers some suggestion, but then go on to say that the "physical basis of M can come from multiple sources". Here is where one is left for want: The mechanisms suggested, including loss of sensory stimuli, alternations in motor integration, ionotropic glutamate signaling, dopamine, and neuropeptides are all suggested: this is basically all of the possible biological sources that can govern behavior, and one is left not knowing what insight the model provides. The array of biological processes listed are so variable in dynamics and meaning, that their explanation of what govern M is at best unsatisfying. Molecular dynamics models that generate distributions can point to certain properties of the model, such as the binding kinetics (on and off rates, etc.) as explanations for the mechanisms generating the distributions, and therefore point to how a change in the biology affects the stochasticity of the process. It is unclear how this model provides such a connection, especially taken in aggregate with the previous weakness.

      Providing a roadmap of how to think about the processes generating M, the meaning of those processes in search, and potential frameworks that are more constrained and with more precise biological underpinning (beyond the array of possibilities described) would go a long way to assuaging the weaknesses.

      The insight we (hopefully) are trying to convey is that individual observations of apparent state-switching behavior does not necessarily imply that a state change is actually happening if a large fraction of the population is not producing this behavior. This same observation can be recreated by invoking a stochastic process, which we already know is how reorientation occurrences behave in the first place. Apparent switches to global foraging are simply due to the reorientation rate decaying in time, not necessarily due to a sudden state change. We modeled a stochastic binary switch (when M0=1) which produced a bimodal distribution of switch kinetics (Figure 3b) which was different than the experimental distribution. The biological basis of M is not addressed here, but we clarified the language on lines 342 and 343 to reinforce that it likely represents the timescales of AIA and ADE activities. We reiterated what was described in López-Cruz et al to convey that molecularly, what is governing the timescales of these two neurons is not trivial, and likely multi-faceted.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The presentation of the Gillespie algorithm, though much improved, is tough going and for many biologists will be a barrier to appreciation of what was done and what was achieved. I found the description of the algorithm generated by AI (ChatGTP) to be more accessible and the example given to be better related to the present application of the algorithm. This might provide a template for a more accessible description of the model.

      We are glad the newer draft is clearer, and apologize it is still difficult to read. We made a few changes that hopefully clarify some points (see below).

      It is unclear how instances of >1 transition were automatically distinguished from instances with 1 transition. A related point is how the transition-finding algorithm was kept from detecting too many transitions, as it seems that any quadruplet of points defines a slope change.

      In López-Cruz et al, >1 transitions (and all transitions) were distinguished by eye after running the findchangepts function. We added a clarifying statement on lines 78 and 79 to illuminate this point. As noted on line 72, the function itself only fits two regressions, so by definition, it can only define one transition. This is why we decided to plot the distribution of slope and transition parameters in the first place; to see if there was a clear bimodal distribution (as observed for other observably binary states, like roaming and dwelling). This was not the case for the experimental data, but was observed in the in silico data if we forced the algorithm to be a two-state model (Figure 3b, M0 = 1).

      Line 113-4: I was confused by the distinction between the probability of observing an event and the propensity for it to occur. Are the authors implying that some events occur but are not observed?

      We apologize for this confusion, and added some phrasing in Lines 115-130 to address this. The propensity is analogous to the rate of a reaction. Given this rate, the probability of seeing Ω+1 reorientations in the infinitesimal time interval dt is the product of the propensity and the probability the current state is Ω reorientations.

      Line 120: Shouldn't propensity at t = 0 be alpha + beta?

      Yes, thank you for catching this. We fixed it.

      Why was it necessary to posit two decay processes (equations 2 and 5?). Wouldn't one suffice?

      Thank you, we have added some text to clarify this point (lines 129-132). The Gillespie algorithm models discrete temporal events, which are explicitly dependent on the current state of the system. Since the propensity itself is changing in time, it implies that it is coupled to another state variable that is changing in time, i.e. another propensity. Since an exponential decay is sufficient to model the decay in reorientations, this implies that the reorientation propensity is coupled to a first order decay propensity (equations 4-5).

      Line 145: ...sudden changes in [reorientation rate] are not due to...

      Thank you, we have corrected this (Line 157).

      Fig. 2d: Legend implies (but fails to state) that each dot is a worm, raising the question of how single worms with multiple transitions were plotted in this graph as they would have more than one transition point.

      Thank you, we updated the legend. Multiple transitions are not quantified with the tworegression approach. Prior observations, such as by López-Cruz, were simply done by eye.

      Line 153: Does i denote either process 1 or 2?

      Yes, i is the subscript for each propensity ai. We have added text on line 166 to clarify this.

      Line 159: Confusing. If an "event" is a reorientation event and a "transition" is a discrete change in slope of Omega vs t, then "The probability that no events will occur for ALL transitions in this time interval" makes no sense.

      Thank you, we have reworded this part (Lines 169-172) to be clearer.

      Equation 17:Unclear what index i refers to

      Thank you, we have changed this to index to j, and modified the text on line 228 to reflect this.

      Line 227-9: Unclear how collisions are thought to have caused the shift in experimental distribution.

      We have clarified the text on lines 246 and 250. Collisions are not being referred to here, but instead the crossing of pheromone trails. This is purely speculative.

      Line 310-317. If M rises on food, then worms should reorient more on food than after long times off food, when M has decayed. But worms don't reorient much on food; they behave as though M is low. This seems like a contradiction, unless one supposes instead that M is low on food and after long times off food but spikes when food is removed.

      Thank you, we have added clarifying language on lines 333-336 to address this point. Worm behavior is fundamentally different on food, as worms transition to a dwell/roam behavioral dynamic which is fundamentally different than foraging behavior while off food.

    1. eLife Assessment

      This useful study describes distinctive characteristics of dentate gyrus granule cells and semilunar cells that are recruited during contextual memory processing. The study provides solid evidence to suggest mechanisms that may be involved in the recruitment of neurons into memory engrams in the dentate gyrus.

    2. Reviewer #1 (Public review):

      Dovek and colleagues aimed at investigating the cellular and circuitry mechanisms underlying the recruitment of dentate gyrus neurons (including two morpho-physiologically-distinct subpopulations of excitatory cells called granular cells or GCs, and semilunar cells or SGCs) into memory representations, also known as engrams. To this end, the authors used TRAP2 mice to investigate the dentate gyrus "engram" neurons that were activated or not (i.e., labeled or not) in a non-fear-based context (mostly enriched environment or EE, but also Barnes Maze or BM).

      A significant proportion of dentate gyrus neurons are labeled after EE exposure (35%) or after BM acquisition (15%). SGCs, distinguished from GCs using morphology-based classification, showed disproportionately context-dependent recruitment. Consistent with previous observations (Erwin et al., 2022), SGCs account for a third of behaviorally recruited "engram" neurons, although they represent less than 5% of excitatory neurons in the dentate gyrus.

      Then, the authors compared the intrinsic physiological properties of GCs and SGCs that are recruited or not during EE. Consistent with previous observations (Williams et al., 2007, Afrasiabi et al., 2022), SGCs and GCs exhibited numerous differences (e.g., Rin, firing frequency) regardless of whether they were behaviorally activated or not. Differences in physiology between excitatory neuron subtypes might explain the preferential recruitment of SGCs. Interestingly, "engram" SGCs displayed lower values of adaptation in firing rate than non-recruited SGCs.

      To examine how GCs and SGCs activated during EE are integrated into the local dentate gyrus microcircuits, the authors next performed a dual patch-clamp recording combined with wide-field optogenetics. Despite the presence of spontaneous EPSCs, no direct functional glutamatergic interconnection was observed between pairs of "engram" GCs and SGCs. In addition, although optogenetic stimulation of a large, random, population of neurons evokes IPSCs (indicating efficient lateral inhibition as in Stefanelli et al., 2016), the specific stimulation of behaviorally recruited GCs or SGCs rarely elicits IPSCs onto surrounding non-engram excitatory neurons.

      To assess whether neurons recruited or not during EE receive differential glutamatergic drive, the authors recorded spontaneous excitatory inputs received by labeled and unlabeled GCs and SGCs. They observed that sEPSCs in labeled GCs and SGCs are more frequent and larger than in unlabeled GCs and SGCs, respectively.

      Last, the authors investigated whether neurons (without discriminating GCs and SGCs) recruited in the same context were characterized by a higher propensity to receive temporally correlated inputs. To this end, they performed dual patch-clamp and analyzed the temporal correlation of spontaneous EPSCs received by pairs of neurons (either two dentate gyrus "engram" neurons, or one "engram" neuron and one "non-engram" neuron in an EE context). They observed that the temporal correlation of excitatory events received by pairs of engram neurons was greater than that of pairs of neurons that do not belong to the same ensemble, and that expected by chance.

      Altogether, the data suggest that the context-dependent recruitment of dentate gyrus excitatory neurons, particularly SGCs is correlated to distinctive intrinsic properties and (correlated) excitatory afferent. Contrary to a leading hypothesis, the authors found no evidence that recruited neurons drive robust feedforward excitation of other engram neurons or feedback inhibition of non-engram neurons.

      Strengths:

      This article provides some information about the mechanisms that may be involved in the recruitment of neural ensembles that form non-fear-based memory engrams in the dentate gyrus. I find it interesting that the authors considered not only granular cells, the main population of excitatory neurons in the dentate gyrus, but also a sparse subpopulation of semilunar cells, a relatively understudied type of dentate excitatory neuron.

      Weakness:

      Most of the data presented are descriptive and based on correlation rather than causation.

    3. Reviewer #2 (Public review):

      Summary:

      The authors use the TRAP2 mouse line to label dentate gyrus cells active during and enriched environment paradigm and cut brain slices from these animals one week later to determine whether granule cells (GC) and semilunar granule cells (SGC) labelled during the exposure share common features. They particularly focus on the role of SGCs and potential circuit mechanisms by which they could be selectively embedded in the labelled assembly. The authors claim that SGCs are disproportionately recruited into IEG expressing assemblies due to intrinsic firing characteristics but cannot identify any contributing circuit connectivity motives in the slice preparation, although they claim that an increased correlation between spontaneous synaptic currents in the slice could signify common synaptic inputs as the source of assembly formation.

      Strengths:

      The authors chose a timely and relevant question, namely, how memory-bearing neuronal assemblies, or 'engrams', are established and maintained in the dentate gyrus. After the initial discovery of such memory-specific ensembles of immediate-early gene expressing engrams in 2012 (Ramirez et al.) this issue has been explored by several high-profile studies that have considerably expanded our understanding of the underlying molecular and cellular mechanisms, but still leave a lot of unanswered questions.

      Weaknesses:

      (1) The authors claim that recurrent excitation from SGCs onto GCs or other SGCs is irrelevant because they did not find any connections in 32 simultaneous recordings (plus 63 in the next experiment). Without a demonstration that other connections from SGCs (e.g. onto mossy cells or interneurons) are preserved in their preparation and if so at what rates, it is unclear whether this experiment is indicative of the underlying biology or the quality of the preparation. The argument that spontaneous EPSCs are observed is not very convincing as these could equally well arise from severed axons (in fact we would expect that the vast majority of inputs are not from local excitatory cells). The argument on line 418 that SGCs have compact axons isn't particularly convincing either given that the morphologies from which they were derived were also obtained in slice preparations and would be subject to the same likelihood of severing the axon. Finally, even in paired slice recordings from CA3 pyramidal cells the experimentally detected connectivity rates are only around 1% (Guzman et al., 2016). The authors would need to record from a lot more than 32 pairs (and show convincing positive controls regarding other connections) to make the claim that connectivity is too low to be relevant.

      The authors now provide evidence that at least some synaptic connections are preserved by recruiting GC assemblies with channelrhodopsin, resulting in feedback inhibition which supports their argument.

      (2) Another concern is that optogenetic GC stimulation rarely ever evokes feedback inhibition onto other cells which contrasts with both other in vitro (e.g. Braganza et al., 2020) and in vivo studies (Stefanelli et al., 2016) studies. Without a convincing demonstration that monosynaptic connections between SGCs/GCs and interneurons in both directions is preserved at least at the rates previously described in other slice studies (e.g. Geiger et al., 1997, Neuron, Hainmueller et al., 2014, PNAS, Savanthrapadian et al., 2014, J. Neurosci). The authors now provide evidence that at least some synaptic connections are preserved by stimulating a random subset of granule cells optogenetically, although it still remains unclear how the rate of connectivity compares to other studies or a live organism.

      (3) Probably the most convincing finding in this study is the higher zero-time lag correlation of spontaneous EPSCs in labelled vs. unlabeled pairs. Unfortunately, the authors use spontaneous EPSCs to begin with, which likely represent a mixture of spontaneous release from severed axons, minis, and coordinated discharge from intact axon segments or entire neurons, make it very hard to determine the meaning and relevance of this finding. The authors now show the baseline EPSC rates and conventional Cross correlograms (CCG; see e.g. English et al., 2017, Neuron; Senzai and Buzsaki, 2017, Neuron) lending more support to this conclusion.

      (4) Finally, one of the biggest caveats of the study is that the ensemble is labelled a full week before the slice experiment and thereby represents a latent state of a memory rather than encoding, consolidation, or recall processes. The authors acknowledge that in the discussion but they should also be mindful of this when discussing other (especially in vivo) studies and comparing their results to these. For instance, Pignatelli et al 2018 show drastic changes in GC engram activity and features driven by behavioral memory recall, so the results of the current study may be very different if slices were cut immediately after memory acquisition (if that was possible with a different labelling strategy), or if animals were re-exposed to the enriched environment right before sacrificing the animal. The authors discuss this limitation appropriately.

      There are also a few minor issues limiting the extent of interpretations of the data:

      (1) Only about 7% of the 'engram' cells are re-activated one week after exposure (line 147), it is unclear how meaningful this assembly is given the high number of cells that may either be labelled unrelated to the EE or no longer be part of the memory-related ensemble.

      (2) Line 215: The wording '32 pairwise connections examined' suggests that there actually were synaptic connections; would recommend altering the wording to 'simultaneously recorded cells examined' to avoid confusion.

    4. Reviewer #3 (Public review):

      Summary:

      The study explores the cellular and circuit features that distinguish dentate gyrus semilunar granule cells and granule cells activated during contextual memory formation. The authors tag memory and enriched environment-activated dentate granule cells and semilunar granule cells and show their reactivation in an appropriate context a week later. They perform patch clamp recordings from activated and surrounding neurons to understand the cellular driving of the selective activation of semilunar granule cells and granule cells. Authors perform dual patch clamp recordings from various pairs of labeled semilunar granule cells, labeled granule cells, unlabeled granule cells, and unlabeled semilunar granule cells. The sustained firing of semilunar granule cells explained their preferential activation. In addition, activated neurons received correlated inputs.

      Strengths:

      The authors confirmed the engram cell properties of activated semilunar granule cells and granule cells in two different paradigms, validating these findings using an enriched environment paradigm.

      The authors carefully separate semilunar granule cells from granule cells, using electrophysiology and morphology. Cell filling to confirm morphology further strengthens confidence.

      The dual patch recordings, which are technically challenging, are carefully performed, and the presence of synaptic activity is confirmed.

      The authors report that sEPSCs recorded from labelled sGCS are more frequent, higher in amplitude, and temporally correlated than their counterparts.

      The authors provide evidence that lateral inhibition is not playing a role in the selective activation of sGCs during contextual learning.

      Exclusive use of slice physiology limits some of these conclusions due to the shearing of connections during the slicing process.

    5. Author Response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      (1) I think the article is a little too immature in its current form. I'd recommend that the authors work on their writing. For example, the objectives of the article are not completely clear to me after reading the manuscript, composed of parts where the authors seem to focus on SGCs, and others where they study "engram" neurons without differentiating the neuronal type (Figure 5). The next version of the manuscript should clearly establish the objectives and sub-aims.

      We now provide clarification for focusing on the labeling status versus the cell types in figure 5. Since figure 5 focuses on inputs to labeled pairs versus Labeledunlabeled pairs the pairs include mixed groups with GCs and SGCs. Since the question pertains to inputs rather than cell types, we did not specifically distinguish the cell types. This is now explained in the text on page 15:  “Note that since the intent was to determine the input correlation depending on labeling status of the cell pairs rather than based on cell type, we do not explicitly consider whether analyzed cell pairs included GCs or SGCs.”

      (2) In addition, some results are not entirely novel (e.g., the disproportionate recruitment as well as the distinctive physiological properties of SGCs), and/or based on correlations that do not fully support the conclusions of the article. In addition to re-writing, I believe that the article would benefit from being enriched with further analyses or even additional experiments before being resubmitted in a more definitive form.

      We now indicate the data comparing labeled versus unlabeled SGCs is novel. Moreover, we also highlight that (1) recruitment of SGCs has not been previously examined in Barnes Maze or Enriched Environment, (2) that our unbiased morphological analysis of SGC recruitment is more robust than subsampling of recorded neurons in prior studies and (3) that our data show that prior may have overestimated SGC recruitment to engrams. Thus, the data characterized as “not novel” are essential for appropriate analysis of behaviorally tagged neurons which is the thrust of our study.  

      Reviewer #2 (Public Review):

      (1) The authors conclude that SGCs are disproportionately recruited into cfos assemblies during the enriched environment and Barnes maze task given that their classifier identifies about 30% of labelled cells as SGCs in both cases and that another study using a different method (Save et al., 2019) identified less than 5% of an unbiased sample of granule cells as SGCs. To make matters worse, the classifier deployed here was itself established on a biased sample of GCs patched in the molecular layer and granule cell layer, respectively, at even numbers (Gupta et al., 2020). The first thing the authors would need to show to make the claim that SGCs are disproportionately recruited into memory ensembles is that the fraction of GCs identified as SGCs with their own classifier is significantly lower than 30% using their own method on a random sample of GCs (e.g. through sparse viral labelling). As the authors correctly state in their discussion, morphological samples from patch-clamp studies are problematic for this purpose because of inherent technical issues (i.e. easier access to scattered GCs in the molecular layer).

      We now clarify, on page 9, that a trained investigator classified cell types based on predefined morphological criteria.  No automated classifiers were used to assign cell types in the current study.

      (2) The authors claim that recurrent excitation from SGCs onto GCs or other SGCs is irrelevant because they did not find any connections in 32 simultaneous recordings (plus 63 in the next experiment). Without a demonstration that other connections from SGCs (e.g. onto mossy cells or interneurons) are preserved in their preparation and if so at what rates, it is unclear whether this experiment is indicative of the underlying biology or the quality of the preparation. The argument that spontaneous EPSCs are observed is not very convincing as these could equally well arise from severed axons (in fact we would expect that the vast majority of inputs are not from local excitatory cells). The argument on line 418 that SGCs have compact axons isn't particularly convincing either given that the morphologies from which they were derived were also obtained in slice preparations and would be subject to the same likelihood of severing the axon. Finally, even in paired slice recordings from CA3 pyramidal cells the experimentally detected connectivity rates are only around 1% (Guzman et al., 2016). The authors would need to record from a lot more than 32 pairs (and show convincing positive controls regarding other connections) to make the claim that connectivity is too low to be relevant.

      We have conducted additional control experiments (detailed in response to Editorial comment #3), in which we replicated the results of Stefanelli et al (2016) identifying that optogenetic activation of a focal cohort of ChR2 expressing granule cells leads to robust feedback inhibition of adjacent granule cells. These control experiments demonstrate that the slice system supports the feedback inhibitory circuit which requires GC/SGC to hilar neuron synapses.

      (3) Another troubling sign is the fact that optogenetic GC stimulation rarely ever evokes feedback inhibition onto other cells which contrasts with both other in vitro (e.g. Braganza et al., 2020) and in vivo studies (Stefanelli et al., 2016) studies. Without a convincing demonstration that monosynaptic connections between SGCs/GCs and interneurons in both directions is preserved at least at the rates previously described in other slice studies (e.g. Geiger et al., 1997, Neuron, Hainmueller et al., 2014, PNAS, Savanthrapadian et al., 2014, J. Neurosci), the notion that this setting could be closer to naturalistic memory processing than the in vivo experiments in Stefanelli et al. (e.g. lines 443-444) strikes me as odd. In any case, the discussion should clearly state that compromised connectivity in the slice preparation is likely a significant confound when comparing these results.

      We have conducted additional control experiments (detailed in response to Editorial comment #3), in which we replicated the results of Stefanelli et al identifying that optogenetic activation of a focal cohort of ChR2 expressing granule cells leads to robust feedback inhibition of adjacent granule cells. These control experiments demonstrate that the slice system in our studies support the feedback inhibitory circuit detailed in prior studies. We also clarify that Stefanelli study labeled random neurons and did not examine natural behavioral engrams and  discuss (on page 20) the correspondence/consistency of our results with that of Braganza et al 2020.

      (4) Probably the most convincing finding in this study is the higher zero-time lag correlation of spontaneous EPSCs in labelled vs. unlabeled pairs. Unfortunately, the fact that the authors use spontaneous EPSCs to begin with, which likely represent a mixture of spontaneous release from severed axons, minis, and coordinated discharge from intact axon segments or entire neurons, makes it very hard to determine the meaning and relevance of this finding. At the bare minimum, the authors need to show if and how strongly differences in baseline spontaneous EPSC rates between different cells and slices are contributing to this phenomenon. I would encourage the authors to use low-intensity extracellular stimulation at multiple foci to determine whether labelled pairs really share higher numbers of input from common presynaptic axons or cells compared to unlabeled pairs as they claim. I would also suggest the authors use conventional Cross correlograms (CCG; see e.g. English et al., 2017, Neuron; Senzai and Buzsaki, 2017, Neuron) instead of their somewhat convoluted interval-selective correlation analysis to illustrate codependencies between the event time series. The references above also illustrate a more robust approach to determining whether peaks in the CCGs exceed chance levels.

      We have included data on sEPSC frequency in the recorded cell pairs (Supplemental Fig 4) and have also conducted additional experiments and present data demonstrating that labeled cell show higher sEPSC frequency and amplitude than corresponding unlabeled cells in both cell types (new Fig 5).  We also include data from new  experiments to show that over 50% of the sEPSCs represent action potential driven events (Supplemental fig 3). 

      We thank the reviewer for the suggestion to explore alternative methods of analyses including CCGs to further strengthen our findings. We have now conducted CCGs on the same data set and report that “The dynamics of the cross-correlograms generated from our data sets using previously established methods to evaluate monosynaptic connectivity (Bartho et al., 2004; Senzai and Buzsaki, 2017) parallelled that of the CCP plots (Supplemental Fig. 6) illustrating that the methods similarly capture co-dependencies between event time series. We note, here, that while the CCG and CCP are qualitatively similar, the magnitude of the peaks were different, due to the sparseness of synaptic events. 

      (5) Finally, one of the biggest caveats of the study is that the ensemble is labelled a full week before the slice experiment and thereby represents a latent state of a memory rather than encoding consolidation, or recall processes. The authors acknowledge that in the discussion but they should also be mindful of this when discussing other (especially in vivo) studies and comparing their results to these. For instance, Pignatelli et al 2018 show drastic changes in GC engram activity and features driven by behavioral memory recall, so the results of the current study may be very different if slices were cut immediately after memory acquisition (if that was possible with a different labelling strategy), or if animals were re-exposed to the enriched environment right before sacrificing the animal.

      As noted by the reviewer, we fully acknowledge and are cognizant of the concern that slices prepared a week after labeling may not reflect ongoing encoding. Although our data show that labeled cells are reactivated in higher proportion during recall, we have discussed this caveat and will include alternative experimental strategies in the discussion.

      Reviewer #3 (Public Review):

      (1) Engram cells are (i) activated by a learning experience, (ii) physically or chemically modified by the learning experience, and (iii) reactivated by subsequent presentation of the stimuli present at the learning experience (or some portion thereof), resulting in memory retrieval. The authors show that exposure to Barnes Maze and the enriched environment-activated semilunar granule cells and granule cells preferentially in the superior blade of the dentate gyrus, and a significant fraction were reactivated on re-exposure. However, physical or chemical modification by experience was not tested. Experience modifies engram cells, and a common modification is the Hebbian, i.e., potentiation of excitatory synapses. The authors recorded EPSCs from labeled and unlabeled GCs and SGCs. Was there a difference in the amplitude or frequency of EPSCs recorded from labeled and unlabeled cells?

      We have included data on sEPSC frequency in the recorded cell pairs (Supplemental Fig 4) and have also conducted additional experiments and report and present data demonstrating that labeled cell show higher sEPSC frequency and amplitude than corresponding unlabeled cells in both cell types (new Fig 5).  We also include data from new  experiments to show that over 50% of the sEPSCs represent action potential driven events (Supplemental fig 3).

      (2) The authors studied five sequential sections, each 250 μm apart across the septotemporal axis, which were immunostained for c-Fos and analyzed for quantification. Is this an adequate sample? Also, it would help to report the dorso-ventral gradient since more engram cells are in the dorsal hippocampus. Slices shown in the figures appear to be from the dorsal hippocampus. 

      We thank the reviewer for the comment. We analyzed sections along the dorsoventral gradient. As explained in the methods, there is considerable animal to animal variability in the number of labeled cells which was why we had to use matched littermate pairs in our experiments This variability could render it difficult to tease apart dorsoventral differences. 

      (3) The authors investigated the role of surround inhibition in establishing memory engram SGCs and GCs. Surprisingly, they found no evidence of lateral inhibition in the slice preparation. Interneurons, e.g., PV interneurons, have large axonal arbors that may be cut during slicing.

      Similarly, the authors point out that some excitatory connections may be lost in slices. This is a limitation of slice electrophysiology.

      We have conducted additional control experiments (detailed in response to Editorial comment #3), in which we replicated the results of Stefanelli et al identifying that optogenetic activation of a focal cohort of ChR2 expressing granule cells leads to robust feedback inhibition of adjacent granule cells. These control experiments demonstrate that the slice system supports the feedback inhibitory circuit detailed in prior studies. 

      We now discuss (page 21) that “the possibility that slice recordings lead to underestimation of feedback dendritic inhibition cannot be ruled out.”

      Reviewer #1 (Recommendations for the authors):

      (1) I struggle to understand the added value of the Barnes Maze data (Figures 1 and S1), since the authors then focus on the EE for practical reasons. In particular, the analysis of mouse performance (presented in supplemental Figure 1) does not seem traditional to me. For example, instead of the 3 classical exploration strategies (i.e., random, serial, direct), the authors describe 6, and assign each of these strategies a score based on vague criteria (why are "long corrected" and "focused research" both assigned a score of 0.5?). Unless I'm mistaken, no other classic parameters are described (e.g., success rate, latency, number of errors). If the authors decide to keep the BM results, I recommend better justifying its existence and adding more details, including in the method section. Otherwise, perhaps they should consider withdrawing it. Even if we had to use two different behavioral contexts, wouldn't it have made sense to use, in addition to the EE, the fear conditioning test, which is widely used in the study of engrams? Under these conditions (Stefanelli et al., 2016), the number of cells recruited after fear conditioning seems sufficient to reproduce the analyses presented in Figures 2-5 and determine whether or not lateral inhibition is dependent on the type of context (Stefanelli and colleagues suggest significant strong lateral inhibition during fear conditioning, whereas the data from Dovek and colleagues suggest quite the opposite after exposure to EE).

      The Barnes Maze data was included to evaluate the DG ensemble activation during a dentate dependent non-fear based behavioral task. This is now introduced and explained in the results. We have now included plots of the primary latency and number of errors in finding the escape hole to confirm the improvement over time (Supplemental Fig. 1). We specifically used the BUNS analysis to evaluate the use of spatial strategy and show that by day 6, day of tamoxifen induction, the mice are using a spatial strategy for navigation. Our approach to evaluate exploration strategy is based on criteria published in Illouz et al 2016. This is now detailed in the methods on page 25. We hope that  the inclusion of the supplemental data and revisions to methods and results address the concerns regarding Barnes Maze experiments. 

      Regarding Stefanelli et al., 2016, please note that the study adopted random labeling of neurons using a CaMKII promotor driven reporter expression which they activated during spatial exploration of fear conditioning behaviors. As such labeled neurons in the Stefanelli study were NOT behaviorally driven, rather they were optically activated. This is now clarified in the text. The main drive for our study was to evaluate behaviorally tagged neurons which is novel, distinct from the Stefanelli study, and, we would argue, more behaviorally realistic and relevant.

      Additionally, the lateral inhibition observed in Stafanelli et al was in response to activation of GCs labeled by virally mediate CAMKII-driven ChR2 expression. Using a similar labeling approach, new control data presented in Supplemental fig. 3 show that we are fully able to replicate the lateral inhibition observed by Stefanalli et al. These control experiments further suggest that the sparse and distributed GC/SGC ensembles activated during non-aversive behavioral tasks may not be sufficient to elicit robust lateral inhibition as has been observed when a random population of adjacent neurons are activated. Our findings are also consistent with observations by Barganza et al., 2020. This is now Discussed on page 21.

      (2) The authors recorded sEPSCs received by recruited and non-recruited GCs and SGCs after EE exposure. However, it appears that they studied them very little, apart (from a temporal correlation analysis (Figure 5). Yet it would be interesting to determine whether or not the four neuronal populations possess different synaptic properties. 

      What is the frequency and amplitude of sEPSCs in GCs and SGCs recruited or not after EE exposure? Similarly, can the author record the sIPSCs received by dentate gyrus engram and non-engram GCs and SGCs? If so, what is their frequency and amplitude?

      As suggested by the editorial comment #2, we how include data on the frequency and amplitude of the sEPSCs in GCs and SGCs used in our analysis of figure 5. Given the low numbers of unlabeled SGCs and labeled GCs in our paired recordings (Supplemental Fig. 5), we choose not to use this data set for analysis of cell-type and labeling based differences in EPSC parameters. However, we have previously reported that sIPSC frequency is higher in SGCs than in GCs. Additionally, we have identified that sEPSC frequency in SGCs is higher than in GC (Dovek et al, in preprint, DOI: 10.1101/2025.03.14.643192).  

      To specifically address reviewer concerns, we have conducted new recorded EPSCs in a cohort of labeled and unlabeled GCs and SGCs and present data demonstrating that labeled cell show higher sEPSC frequency and amplitude than corresponding unlabeled cells in both cell types (new Fig 5). These experiments were conducted in TRAP2-tdT labeled cells which were not stable in cesium based recordings. As such we, we deferred the IPSC analysis for later and restricted analysis to sEPSCs for this study. 

      (3) Previous data showed that dentate gyrus neurons that are recruited or not in a given context could exhibit distinct morphological characteristics (Pléau et al. 2021) and biochemical content (Penk expression, Erwin et al., 2020). In order to enrich the electrophysiological data presented in Figure 2, could the authors take advantage of the biocytin filling to perform a morphological and biochemical comparison of the different neuronal types (i.e., GCs and SGCs recruited or not after EE)?

      Thank you for this suggestion. Unfortunately, detailed morphometry and biochemical analysis on labeled and unlabeled neurons was not conducted as part of this study as our focus was on circuit differences. In our experience, unless the sections are imaged soon after staining, the sections are suboptimal for detailed morphological reconstruction and analysis. Our ongoing studies suggest that PENK is an activity marker and not a selective marker for SGCs and we are undertaking transcriptomic analysis to identify molecular differences between GCs and SGCs. We respectfully submit that these experiments are outside the scope of this study.

      (4) Figures 3 and 4 show only schematic diagrams and representative data. No quantification is shown. Instead of pie charts showing the identity of each pair (which I find unnecessary), I'll use pie charts representing the % of each pair in which an excitatory or inhibitory drive was recorded (with the corresponding n).

      Please note that we did not observe evoked synaptic potentials in any except one pair precluding the possibility of quantification. However, we submit that it is important for the readers to have information on the number of pairs and the types of pre-post synaptic pairs in which the connections were tested.

      (5) Figure 3: Given that GCs form very few recurrences in non-pathological conditions, it hardly surprises me that they form few or no local glutamatergic connections. In contrast, this result surprises me more for SGCs, whose axons form collaterals in the dentate gyrus granular and molecular layers (Williams et al., 2007; Save et al., 2019). To control the reliability of their conditions, could the authors check whether SGCs do indeed form connections with hilar mossy cells, as has been reported in the past? To test whether this lack of interconnectivity is specific to neurons belonging to the same engram (or not), could the authors test whether or not the stimulation of labeled GCs/SGCs (via membrane depolarization or even optogenetics) generates EPSCs in unlabeled GCs?

      As suggested by the reviewer, we have examined whether widefield optical activation of all labeled neurons including GCs and SGCs lead to EPSCs in unlabeled GCs (63 cells tested). However, we did not observe eEPSCs. This data is presented on page 13, (Fig 4F) in the results and discussed on page 20. Since the wide field stimulation should activate terminals and lead to release even if the axon is severed, our data suggest the glutamatergic drive from SGC to GC may be limited.

      As noted above, we have demonstrated the presence of lateral inhibition consistent with data in Stefanelli et al in our new supplementary figure 3. We have also shown that sustained SGC firing upon perforant path stimulations is associated with sustained firing in hilar interneurons (Afrasiabi et al., 2022) indicating presence of the SGC to hilar connectivity in our slice preparation. Therefore, we choose not to undertake challenging 2P guided paired recording of SGCs and mossy cells adjacent to SGC axon terminals reported in Williams et al 2007 to replicate the 9%  SGC to MC synaptic connections. These 2P guided slice physiology studies are outside the technical scope of our study.

      (6) Figure 4: The results are relatively in contradiction with the strong lateral inhibition reported in the past (Stefanelli et al., 2016), but the experimental conditions are different in the two studies. Stimulation of a single labeled GC or SGC may not be sufficient to activate an inhibitory neuron, and for the latter to inhibit an unlabeled GC or SGC. Is it possible to measure the sIPSCs received by unlabelled neurons during optogenetic stimulation of all labelled neurons? Could the authors verify whether under their experimental conditions GCs and SGCs do indeed form connections with interneurons, as reported before? Finally, Stefanelli and colleagues (2016) suggest that lateral inhibition is provided by dendrites- targeting somatostatin interneurons. If the authors are recording in the soma, could they underestimate more distal inhibitory inputs? If so, could they record the dendrites of unlabeled neurons?

      Our new control data (Supplementary Fig. 3) using an AAV mediated CAMKII promotor driven random expression of ChR2 on GCs, similar to Stefanelli et al (2016) demonstrates our ability replicate the lateral inhibition observed by Stefanalli et al. (2016). Thus, our findings more accurately represent lateral inhibition supported by a sparse behaviorally labeled cohort than findings of Stefanelli et al based on randomly labeled neurons. This is now discussed on page 22-23. We respectfully submit that dendritic recordings are outside the scope of the current study.

      We also discuss the possibility that somatic recordings may under sample dendritic inhibitory inputs on page 23 “the possibility that slice recordings lead to underestimation of feedback dendritic inhibition cannot be ruled out.”

      (7) Figure 5: For ease of reading, I would substantially simplify the Results section related to Figure 5, keeping only the main general points of the analysis and the results themselves. The details of the analysis strategy, and the justification for the choices made, are better placed in the Method section (I advise against "data not shown").

      We thank the reviewer for the suggestion to improve accessibility of the results and have moved text related to justification of strategy and controls to the methods. We have also removed references to data not shown.

      (8) Figure 5: why do the authors no longer discriminate between GCs and SGCs?

      Since figure 5 focuses on inputs to labeled pairs versus labeled-unlabeled pairs the pairs include mixed groups with GCs and SGCs. Since the question pertains to inputs rather than cell types, we did not specifically distinguish the cell types. This is now explained in the text on page 15.

      (9) Figure 5: I would like to know more about the temporally connected inputs and their implication in context-dependent recruitment of dentate gyrus neurons. What could be the origin of the shared input received by the neurons recruited after EE exposure? For example, do labeled neurons receive more (temporally correlated or not) inputs from the entorhinal cortex (or any other upstream brain region) than unlabeled neurons? Is there any way (e.g., PP stimulation or any kind of manipulation) to test the causal relationship between temporally correlated input and the context-dependent recruitment of a given neuron?

      We appreciate the reviewer’s comments on the need to examine the source and nature of the correlated inputs to behaviorally labeled neurons. However, the suggested experiments are nontrivial as artificial stimulation of afferent fibers is unlikely to be selective for labeled and unlabeled cells. Given the complexities in design, implementation and interpretation of these experiments we respectfully submit that these are outside the scope of the current study.

      Reviewer #2 (Recommendations for the authors):

      There are a few minor issues limiting the extent of interpretations of the data:

      (1) Only about 7% of the 'engram' cells are re-activated one week after exposure (line 147), it is unclear how meaningful this assembly is given the high number of cells that may either be labelled unrelated to the EE or no longer be part of the memory-related ensemble.

      We now discuss (page 22-23) that the % labeling is consistent with what has been observed in the DG 1 week after fear conditioning (DeNardo et al., 2019) and discuss the caveat that all labeled cells may not represent an engram.  

      (2) Line 215: The wording '32 pairwise connections examined' suggests that there actually were synaptic connections, would recommend altering the wording to 'simultaneously recorded cells examined' to avoid confusion.

      Revised as suggested

    1. eLife Assessment

      This valuable work explores the timely idea that aperiodic activity in human electrophysiology recordings is dynamically modulated in response to task events in a manner that may be relevant for behavioral performance. Moreover, the authors present solid evidence that, in some circumstances, these aperiodic changes might be misinterpreted as oscillatory changes. While many aspects of the manuscript were intriguing, there was a sense that some of the interpretations were overstated - for instance the claim that aperiodic activity distorts interpretations of theta specifically, versus having a more nuanced impact on the time-frequency representation. Softening some of the language may further improve the manuscript.

    2. Reviewer #1 (Public review):

      Summary:

      Frelih et al. investigated both periodic and aperiodic activity in EEG during working memory tasks. In terms of periodic activity, they found post-stimulus decreases in alpha and beta activity, while in terms of aperiodic activity, they found a bi-phasic post-stimulus steepening of the power spectrum, which was weakly predictive of performance. They conclude that it is crucial to properly distinguish between aperiodic and periodic activity in event-related designs as the former could confound the latter. They also add to the growing body of research highlighting the functional relevance of aperiodic activity in the brain.

      Strengths:

      This is a well-written, timely paper that could be of interest to the field of cognitive neuroscience, especially to researchers investigating the functional role of aperiodic activity. The authors describe a well-designed study that looked at both the oscillatory and non-oscillatory aspects of brain activity during a working memory task. The analytic approach is appropriate, as a state-of-the-art toolbox is used to separate these two types of activity. The results support the basic claim of the paper that it is crucial to properly distinguish between aperiodic and periodic activity in event-related designs as the former could confound the latter. They also add to the growing body of research highlighting the functional relevance of aperiodic activity in the brain. Commendably, the authors include replications of their key findings on multiple independent data sets.

      Weaknesses:

      The authors also claim that their results speak to the interplay between oscillatory and non-oscillatory activity, and crucially, that task-related changes in the theta frequency band - often attributed to neural oscillations in the field - are in fact only a by-product of non-oscillatory changes. I believe these claims are too bold and are not supported by compelling evidence in the paper. Some control analyses - e.g., contrasting the scalp topographies of purportedly theta and non-oscillatory effects - could help strengthen the latter argument, but it may be safest to simply soften these two claims.

      In terms of the methodology used, I suggest the authors make it clearer to readers that the primary results were obtained on a sample of middle-aged-to-older-adults, some with subjective cognitive complaints, and note that while stimulus-locked event-related potentials (ERPs) were removed from the data prior to analyses, response-locked ERPs were not. This could potentially confound aperiodic findings. Contrasting the scalp topographies of response-related ERPs and the identified aperiodic components, especially the later one, could bring some clarity here too.

      I have also found certain parts of the introduction to be somewhat confusing.

      Comments on the latest version:

      The authors have addressed several of the weaknesses I noted in my original review, specifically, they softened their claims regarding the theta findings, while simultaneously strengthening these findings with additional analyses (using simulations as well as a new measure of rhythmicity, the phase autocorrelation function, pACF). Most of the other suggested control analyses were also implemented. While I believe the fact that the participants in the main sample were not young adults could be made even more explicit, and the potential interaction between age and aperiodic changes could be unpacked a little in the discussion, the age of the sample is definitely addressed upfront.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Frelih et al, investigate the relationship between aperiodic neural activity, as measured by EEG, and working memory performance, and compares this to the more commonly analyzed periodic, and in particular theta, measures that are often associated with such tasks. To do so, they analyze a primary dataset of 57 participants engaging in an n-back task, as well as a replication dataset, and use spectral parameterization to measure periodic and aperiodic features of the data, across time. In the revision, the authors have clarified some key points, and added a series of additional analyses and controls, including the use of an additional method, that helps to complement the original analyses and further corroborates their claims. In doing so, they find both periodic and aperiodic features that relate to the task dynamics, but importantly, the aperiodic component appears to explain away what otherwise looks like theta activity in a more traditional analysis. This study therefore helps to establish that aperiodic activity is a task-relevant dynamic feature in working memory tasks and may be the underlying change in many other studies that reported 'theta' changes, but did not use methods that could differentiate periodic and aperiodic features.

      Strengths:

      Key strengths of this paper include that it addresses an important question - that of properly adjudicating which features of EEG recordings relate to working memory tasks - and in doing so provides a compelling answer, with important implications for considering prior work and contributing to understanding the neural underpinnings of working memory. The revision is improved by showing this using an additional analysis method. I do not find any significant faults or error with the design, analysis, and main interpretations as presented by this paper, and as such, find the approach taken to be a valid and well-enacted. The use of multiple variants of the working memory task, as well as a replication dataset significantly strengthens this manuscript, by demonstrating a degree of replicability and generalizability. This manuscript is also an important contribution to motivating best practices for analyzing neuro-electrophysiological data, including in relation to using baselining procedures. I think the updates in the revision have helped to clarify the findings and impact of this study.

      Weaknesses:

      Overall, I do not find any obvious weaknesses with this manuscript and it's analyses that challenge the key results and conclusions. Updates through the revision have addressed my previous points about adding some additional notes on the methods and conclusions.

    4. Reviewer #3 (Public review):

      Summary:

      Using a specparam (1/f) analysis of task-evoked activity, the authors propose that "substantial changes traditionally attributed to theta oscillations in working memory tasks are, in fact, due to shifts in the spectral slope of aperiodic activity." This is a very bold and ambitious statement, and the field of event-related EEG would benefit from more critical assessments of the role of aperiodic changes during task events. Unfortunately, the data shown here does not support the main conclusion advanced by the authors.

      Strengths:

      The field of event-related EEG would benefit from more critical assessments of the role of aperiodic changes during task events. The authors perform a number of additional control analyses, including different types of baseline correction, ERP subtraction, as well as replication of the experiment with two additional datasets.

      Weaknesses:

      The authors did not first show that their first task successfully evoked theta power, nor that specparam is capable of quantifying the background around a short theta burst, nor that theta effects are different between baseline corrected vs. spectral parameterized quantification.

      Comments on revisions:

      The authors have completed a substantial revision based on the comments from all of the reviewers. Overall, the major claims of the initial report have been profoundly tempered, but more of the conclusions are supported by the data.

    1. eLife Assessment

      This study concerns how macaque visual cortical area MT represents stimuli composed of more than one speed of motion. The study is valuable because little is known about how the visual pathway segments and preserves information about multiple stimuli, and the study involves perceptual reports from both humans and one monkey regarding whether there are one or two speeds in the stimulus. The study presents compelling evidence that (on average) MT neurons shift from faster-speed-takes-all at low speeds to representing the average of the two speeds at higher speeds. Ultimately, this study raises intriguing questions about how exactly the response patterns in visual cortical area MT might preserve information about each speed, since such information could potentially be lost in an average response as described here, depending on assumptions about how MT activity is evaluated by other visual areas.

    2. Reviewer #1 (Public review):

      Summary:

      Most studies in sensory neuroscience investigate how individual sensory stimuli are represented in the brain (e.g., the motion or color of a single object). This study starts tackling the more difficult question of how the brain represents multiple stimuli simultaneously and how these representations help to segregate objects from cluttered scenes with overlapping objects.

      Strengths

      The authors first document the ability of humans to segregate two motion patterns based on differences in speed. Then they show that a monkey's performance is largely similar; thus establishing the monkey as a good model to study the underlying neural representations.

      Careful quantification of the neural responses in the middle temporal area during the simultaneous presentation of fast and slow speeds leads to the surprising finding that, at low average speeds, many neurons respond as if the slowest speed is not present, while they show averaged responses at high speeds. This unexpected complexity of the integration of multiple stimuli is key to the model developed in this paper.

      One experiment in which attention is drawn away from the receptive field supports the claim that this is not due to the involuntary capture of attention by fast speeds.

      A classifier using the neuronal response and trained to distinguish single speed from bi-speed stimuli shows a similar overall performance and dependence on the mean speed as the monkey. This supports the claim that these neurons may indeed underlie the animal's decision process.

      The authors expand the well-established divisive normalization model to capture the responses to bi-speed stimuli. The incremental modeling (eq 9 and 10) clarifies which aspects of the tuning curves are captured by the parameters.

    3. Reviewer #3 (Public review):

      Summary:

      This study concerns how macaque visual cortical area MT represents stimuli composed of more than one speed of motion.

      Strengths:

      The study is valuable because little is known about how the visual pathway segments and preserves information about multiple stimuli. The study presents compelling evidence that (on average) MT neurons shift from faster-speed-takes-all at low speeds to representing the average of the two speeds at higher speeds. An additional strength of the study is the inclusion of perceptual reports from both humans and one monkey participant performing a task in which they judged whether the stimuli involved one vs two different speeds. Ultimately, this study raises intriguing questions about how exactly the response patterns in visual cortical area MT might preserve information about each speed, since such information is potentially lost in an average response as described here.

    1. eLife Assessment

      This valuable study uses tools of population and functional genomics to examine long non-coding RNAs (lncRNAs) in the context of human evolution. Analyses of computationally predicted human-specific lncRNAs and their genomic targets lead to the development of hypotheses regarding the potential roles of these genetic elements in human biology. The conclusions regarding evolutionary acceleration and adaptation, however, only incompletely take data and literature on human/chimpanzee genetics and functional genomics into account.

    2. Reviewer #2 (Public review):

      In this valuable manuscript, Lin et al attempt to examine the role of long non coding RNAs (lncRNAs) in human evolution, through a set of population genetics and functional genomics analyses that leverage existing datasets and tools. Although the methods are incomplete and at times inadequate, the results nonetheless point towards a possible contribution of long non coding RNAs to shaping humans, and suggest clear directions for future, more rigorous study.

      Comments on revisions:

      I thank the authors for their revision and changes in response to previous rounds of comments. As it had been nearly two years since I last saw the manuscript, I reread the full text to familiarise myself again with the findings presented. While I appreciate the changes made and think they have strengthened the manuscript, I still find parts of it a bit too speculative or hyperbolic. In particular, I think claims of evolutionary acceleration and adaptation require more careful integration with existing human/chimpanzee genetics and functional genomics literature. For example:

      Line 155: "About 5% of genes have significant sequence differences in humans and chimpanzees," This statement needs a citation, and a definition of what is meant by 'significant', especially as multiple lines below instead mention how it's not clear how many differences matter, or which of them, etc.

      line 187: "Notably, 97.81% of the 105141 strong DBSs have counterparts in chimpanzees, suggesting that these DBSs are similar to HARs in evolution and have undergone human-specific evolution." I do not see any support for the inference here. Identifying HARs and acceleration relies on a far more thorough methodology than what's being presented here. Even generously, pairwise comparison between two taxa only cannot polarise the direction of differences; inferring human-specific change requires outgroups beyond chimpanzee.

      line 210: "Based on a recent study that identified 5,984 genes differentially expressed between human-only and chimpanzee-only iPSC lines (Song et al., 2021), we estimated that the top 20% (4248) genes in chimpanzees may well characterize the human-chimpanzee differences" I do not agree with the rationale for this claim, and do not agree that it supports the cutoff of 0.034 used below. I also find that my previous concerns with the very disparate numbers of results across the three archaics have not been suitably addressed.

      I also think that there is still too much of a tendency to assume that adaptive evolutionary change is the only driving force behind the observed results in the results. As I've stated before, I do not doubt that lncRNAs contribute in some way to evolutionary divergence between these species, as do other gene regulatory mechanisms; the manuscript leans down on it being the sole, or primary force, however, and that requires much stronger supporting evidence. Examples include, but are not limited to:

      line 230: "These results reveal when and how HS lncRNA-mediated epigenetic regulation influences human evolution." This statement is too speculative.

      Line 268: "yet the overall results agree well with features of human evolution." What does this mean? This section is too short and unclear.

      Line 325: "and form 198876 HS lncRNA-DBS pairs with target transcripts in all tissues." This has not been shown in this paper - sequence based analyses simply identify the *potential* to form pairs.

      Line 423: "Our analyses of these lncRNAs, DBSs, and target genes, including their evolution and interaction, indicate that HS lncRNAs have greatly promoted human evolution by distinctly rewiring gene expression." I do not agree that this conclusion is supported by the findings presented - this would require significant additional evidence in the form of orthogonal datasets.

      I also return briefly to some of my comments before, in particular on the confounding effects of gene length and transcript/isoform number. In their rebuttal the authors argued that there was no need to control for this, but this does in fact matter. A gene with 10 transcripts that differ in the 5' end has 10 times as many chances of having a DBS than a gene with only 1 transcript, or a gene with 10 transcripts but a single annotated TSS. When the analyses are then performed at the gene level, without taking into account the number of transcripts, this could introduce a bias towards genes with more annotated isoforms. Similarly, line 246 focuses on genes with "SNP numbers in CEU, CHB, YRI are 5 times larger than the average." Is this controlled for length of the DBS? All else being equal a longer DBS will have more SNPs than a shorter one. It is therefore not surprising that the same genes that were highlighted above as having 'strong' DBS, where strength is impacted by length, show up here too.

    3. Author Response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary

      While DNA sequence divergence, differential expression, and differential methylation analysis have been conducted between humans and the great apes to study changes that "make us human", the role of lncRNAs and their impact on the human genome and biology has not been fully explored. In this study, the authors computationally predict HSlncRNAs as well as their DNA Binding sites using a method they have developed previously and then examine these predicted regions with different types of enrichment analyses. Broadly, the analysis is straightforward and after identifying these regions/HSlncRNAs the authors examined their effects using different external datasets.

      I no longer have any concerns about the manuscript as the authors have addressed my comments in the first round of review.

      We thank the reviewer for the valuable comments, which have helped us improve the manuscript.

      Reviewer #2 (Public Review):

      Lin et al attempt to examine the role of lncRNAs in human evolution in this manuscript. They apply a suite of population genetics and functional genomics analyses that leverage existing data sets and public tools, some of which were previously built by the authors, who clearly have experience with lncRNA binding prediction. However, I worry that there is a lack of suitable methods and/or relevant controls at many points and that the interpretation is too quick to infer selection. While I don't doubt that lncRNAs contribute to the evolution of modern humans, and certainly agree that this is a question worth asking, I think this paper would benefit from a more rigorous approach to tackling it.

      I thank the authors for their revisions to the manuscript; however, I find that the bulk of my comments have not been addressed to my satisfaction. As such, I am afraid I cannot say much more than what I said last time, emphasising some of my concerns with regards to the robustness of some of the analyses presented. I appreciate the new data generated to address some questions, but think it could be better incorporated into the text - not in the discussion, but in the results.

      We thank the reviewer for the careful reading and valuable comments. In this round of revision, we address the two main concerns: (1) there is a lack of suitable methods and/or relevant controls at many points, and (2) the interpretation is too quick to infer selection. Based on these comments, we have carefully revised all sections of the manuscript, including the Introduction, Results, Discussion, and Materials and Methods.

      In addition, we have performed two new analyses. Based on the two analyses, we have added one figure and two sections to Results, two sections to Materials and Methods, one figure to Supplementary Notes, and two tables to Supplementary Tables. These results were obtained using new methods and provided more support to the main conclusion.

      To be more responsible, we re-look into the comments made in the first round and respond to them further. The following are point-to-point responses to comments.

      Since many of the details in the Responses-To-Comments are available in published papers and eLife publishes Responses-To-Comments, we do not greatly revise supplementary notes to avoid ostensibly repeating published materials.

      “lack of suitable methods and/or relevant controls”.

      We carefully chose the methods, thresholds, and controls in the study; now, we provide clearer descriptions and explanations.

      (1) We have expanded the last paragraph in Introduction to briefly introduce the methods, thresholds, and controls.

      (2) In many places in Results and Materials and Methods, revisions are made to describe and justify methods, thresholds, and controls.

      (3) Some methods, thresholds, and controls have good consensus, such as FDR and genome-wide background, but others may not, such as the number of genes that greatly differ between humans and chimpanzees. Now, we describe our reasons for the latter situation. For example, we explain that “About 5% of genes have significant sequence differences in humans and chimpanzees, but more show expression differences due to regulatory sequences. We sorted target genes by their DBS affinity and, to be prudential, chose the top 2000 genes (DBS length>252 bp and binding affinity>151) and bottom 2000 genes (DBS length<60 bp but binding affinity>36) to conduct over-representation analysis”.

      (4) We also carefully choose proper words to make descriptions more accurate.

      Responses to the suggestion “new data generated could be better incorporated into the text”.

      (1) We think that this sentence “The occurrence of HS lncRNAs and their DBSs may have three situations – (a) HS lncRNAs preceded their DBSs, (b) HS lncRNAs and their DBSs co-occurred, (c) HS lncRNAs succeeded their DBSs. Our results support the third situation and the rewiring hypothesis”, previously in Discussion, should be better in section 2.3. We have revised it and moved it into the second paragraph of section 2.3.

      (2) Our two new analyses generated new data, and we describe them in Results.

      (3) It is possible to move more materials from Supplementary Notes to the main text, but it is probably unnecessary because the main text currently has eight sub-sections, two tables, and four figures.

      Responses to the comment “the interpretation is too quick to infer selection”.

      (1) When using XP-CLR, iSAFE, Tajima's D, Fay-Wu's H, the fixation index (Fst), and linkage disequilibrium (LD) to detect selection signals, we used the widely adopted parameters and thresholds but did not mention this clearly in the original manuscript. Now, in the first sentence of the second paragraph of section 2.4, we add the phrase “with widely-used parameters and thresholds” (more details are available in section 4.7 and Supplementary Notes).

      (2) It is not the first time we used these tests. Actually, we used these tests in two other studies (Tang et al. Uncovering the extensive trade-off between adaptive evolution and disease susceptibility. Cell Rep. 2022; Tang et al. PopTradeOff: A database for exploring population-specificity of adaptive evolution, disease susceptibility, and drug responsiveness. Comput Struct Biotechnol J. 2023). In this manuscript, section 2.5 and section 4.12 describe how we use these tests to detect signals and infer selection. We also cite the above two published papers from which the reader can obtain more details.

      (3) Also, in section 2.4, we stress that “Signals in considerable DBSs were detected by multiple tests, indicating the reliability of the analysis”.

      To further respond to the comments of “lack of suitable methods” and “this paper would benefit from a more rigorous approach to tackling it”, we have performed two new analyses. The results of the new analyses agree well with previous results and provide new support for the main conclusion. The result of section 2.5 is novel and interesting.

      We write in Discussion “Two questions are how mouse-specific lncRNAs specifically rewire gene expression in mice and how human- and mouse-specific rewiring influences the cross-species transcriptional differences”. To investigate whether the rewiring of gene expression by HS lncRNA in humans is accidental in evolution, we have made further genomic and transcriptomic analyses (Lin et al. Intrinsically linked lineage-specificity of transposable elements and lncRNAs reshapes transcriptional regulation species- and tissue-specifically. doi: https://doi.org/10.1101/2024.03.04.583292). To verify the obtained conclusions, we analyzed the spermatogenesis data from multiple species and obtained supporting evidence (not published).

      I note some specific points that I think would benefit from more rigorous approaches, and suggest possible ways forward for these.

      Much of this work is focused on comparing DNA binding domains in human-unique long-noncoding RNAs and DNA binding sites across the promoters of genes in the human genome, and I think the authors can afford to be a bit more methodical/selective in their processing and filtering steps here. The article begins by searching for orthologues of human lncRNAs to arrive at a set of 66 human-specific lncRNAs, which are then characterised further through the rest of the manuscript. Line 99 describes a binding affinity metric used to separate strong DBS from weak DBS; the methods (line 432) describe this as being the product of the DBS or lncRNA length times the average Identity of the underlying TTSs. This multiplication, in fact, undoes the standardising value of averaging and introduces a clear relationship between the length of a region being tested and its overall score, which in turn is likely to bias all downstream inference, since a long lncRNA with poor average affinity can end up with a higher score than a short one with higher average affinity, and it's not quite clear to me what the biological interpretation of that should be. Why was this metric defined in this way?

      (1) Using RNA:DNA base-pairing rules, other DBS prediction programs return just DBSs with lengths. Using RNA:DNA base-pairing rules and a variant of Smith-Waterman local alignment, LongTarget returns DBSs with lengths and identity values together with DBDs (local alignment makes DBDs and DBSs predicted simultaneously). Thus, instead of measuring lncRNA/DNA binding based on DBS length, we measure lncRNA/DNA binding based on both DBS length and DBD/DBS identity (simply called identity, which is the percentage of paired nucleotides in the RNA and DNA sequences). This allows us to define “binding affinity”. One may think that binding affinity is a more complex function of length and identity. But, according to in vitro studies (see the review Abu Almakarem et al. 2012 and citations therein, and see He et al. 2015 and citations therein), the strength of a triplex is determined by all paired nucleotides (i.e., triplet). Thus, binding affinity=length * identity is biologically reasonable.

      (2) Further, different from predicting DBS upon individual base-pairing rules such as AT-G and CG-C, LongTarget integrates base-pairing rules into rulesets, each covering A, T, C, and G (see the two figures below, which are from He et al 2015). This makes every nucleotide in the RNA and DNA sequences comparable and allows the computation of identity.

      (3) On whether LongTarget may predict unreasonably long DBSs. Three technical features of LongTarget make this highly unlikely (and more unlikely than other programs). The three features are (a) local alignment, (b) gap penalty, and (c) TT penalty (He et al. 2015).

      (4) Some researchers may think that a higher identity threshold (e.g., 0.8 or even higher) makes the predicted DBSs more reliable. This is not true. To explore plausible identity values, we analyzed the distribution of Kcnq1ot1’s DBSs in the large Kcnq1 imprinting region (which contains many known imprinted genes). We found that a high threshold for identity (e.g., 0.8) will make DBSs in many known imprinted genes fail to be predicted. Upon our analysis of many lncRNAs and upon early in vitro experiments, plausible identity values range from 0.4 to 0.8.

      (5) Is it necessary or advisable to define an identity threshold? Since identity values from 0.4 to 0.8 are plausible and identity is a property of a DBS but does not reflect the strength of the whole triplex, it is more reasonable to define a threshold for binding affinity to control predicted DBSs. As explained above, binding affinity = length*identity is a reasonable measure of the strength of a triplex. The default threshold is 60, and given an identity of 0.6 in many triplexes, a DBS with affinity=60 is about 100 bp. Compared with TF binding sites (TFBS), 100 bp is quite long. As we explain in the main text, “taking a DBS of 147 bp as an example, it is extremely unlikely to be generated by chance (p < 8.2e-19 to 1.5e-48)”.

      (6) How to validate predicted DBSs? Validation faces these issues. (a) DBDs are predicted on the genome level, but target transcripts are expressed in different tissues and cells. So, no single transcriptomic dataset can validate all predicted DBSs of a lncRNA. No matter using what techniques and what cells, only a small portion of predicted DBSs can be experimentally captured (validated). (b) The resolution of current experimental techniques is limited; thus, experimentally identified DBSs (i.e., “peaks”) are much longer than computationally predicted DBSs. (c) Experimental results contain false positives and false negatives. So, validation (or performance evaluation) should also consider the ROC curves (Wen et al. 2022).

      (7) As explained above, a long DBS may have a lower binding affinity than a short DBS. A biological interpretation is that the long DBS may accumulate mutations that decrease its binding ability gradually.

      There is also a strong assumption that identified sites will always be bound (line 100), which I disagree is well-supported by additional evidence (lines 109-125). The authors show that predicted NEAT1 and MALAT1 DBS overlap experimentally validated sites for NEAT1, MALAT1, and MEG3, but this is not done systematically, or genome-wide, so it's hard to know if the examples shown are representative, or a best-case scenario.

      (1) We did not make this assumption. Apparently, binding depends on multiple factors, including co-expression of genes and specific cellular context.

      (2) On the second issue, “this is not done systematically, or genome-wide”. We did genome-wide but did not show all results (supplementary fig 2 shows three genomic regions, which are impressively good). In Wen et al. 2022, we describe the overall results.

      It's also not quite clear how overlapping promoters or TSS are treated - are these collapsed into a single instance when calculating genome-wide significance? If, eg, a gene has five isoforms, and these differ in the 3' UTR but their promoter region contains a DBS, is this counted five times, or one? Since the interaction between the lncRNA and the DBS happens at the DNA level, it seems like not correcting for this uneven distribution of transcripts is likely to skew results, especially when testing against genome-wide distributions, eg in the results presented in sections 5 and 6. I do not think that comparing genes and transcripts putatively bound by the 40 HS lncRNAs to a random draw of 10,000 lncRNA/gene pairs drawn from the remaining ~13500 lncRNAs that are not HS is a fair comparison. Rather, it would be better to do many draws of 40 non-HS lncRNAs and determine an empirical null distribution that way, if possible actively controlling for the overall number of transcripts (also see the following point).

      (1) We predicted DBSs in the promoter region of 179128 Ensembl-annotated transcripts and did not merge DBSs (there is no need to merge them). If multiple transcripts share the same TSS, they may share the same DBS, which is natural.

      (2) If the DBSs of multiple transcripts of a gene overlap, the overlap does not raise a problem for lncRNA/DNA binding analysis in specific tissues because usually only one transcript is expressed in a tissue. Therefore, there is no such situation “If, e.g., a gene has five isoforms, and these differ in the 3' UTR but their promoter region contains a DBS, is this counted five times, or one?”

      (3) It is unclear to us what “it seems like not correcting for this uneven distribution of transcripts is likely to skew results” means. Regarding testing against genome-wide distributions, statistically, it is beneficial to make many rounds of random draws genome-wide, but this will take a huge amount of time. Since more variables demand more rounds of drawing, to our knowledge, this is not widely practiced in large-scale transcriptomic data analyses.

      (4) If the difference (result) is small thus calls for rigorous statistical testing, making many rounds of random draws genome-wide is necessary. In our results, “45% of these pairs show a significant expression correlation in specific tissues (Spearman's |rho| >0.3 and FDR <0.05). In contrast, when randomly sampling 10000 pairs of lncRNAs and protein-coding transcripts genome-wide, the percent of pairs showing this level of expression correlation (Spearman's |rho| >0.3 and FDR <0.05) is only 2.3%”.

      Thresholds for statistical testing are not consistent, or always well justified. For instance, in line 142 GO testing is performed on the top 2000 genes (according to different rankings), but there's no description of the background regions used as controls anywhere, or of why 2000 genes were chosen as a good number to test? Why not 1000, or 500? Are the results overall robust to these (and other) thresholds? Then line 190 the threshold for downstream testing is now the top 20% of genes, etc. I am not opposed to different thresholds in principle, but they should be justified.

      (1) We used the g:Profiler program to perform over-representation analysis to identify enriched GO terms. This analysis is used to determine what pre-defined gene sets (GO terms) are more present (over-represented) in a list of “interesting” genes than what would be expected by chance. Specifically, this analysis is often used to examine whether the majority of genes in a pre-defined gene set fall in the extremes of a list: the top and bottom of the list, for example, may correspond to the largest differences in expression between the two cell types. g:Profiler always takes the whole genome as the reference; that is why we did not mention the whole genome reference. We now add in section 2.2 “(with the whole genome as the reference)”.

      (2) Why choosing 2000 but not 2500 genes is somewhat subjective. We now explain that “About 5% of genes have significant sequence differences in humans and chimpanzees, but more show expression differences due to regulatory sequences. We sorted target genes by their DBS affinity and, to be prudential, chose the top 2000 genes (DBS length>252 bp and binding affinity>151) and bottom 2000 genes (DBS length<60 bp but binding affinity>36) to conduct over-representation analysis”.

      Likewise, comparing Tajima's D values near promoters to genome-wide values is unfair, because promoters are known to be under strong evolutionary constraints relative to background regions; as such it is not surprising that the results of this comparison are significant. A fairer comparison would attempt to better match controls (eg to promoters without HS lncRNA DBS, which I realise may be nearly impossible), or generate empirical p-values via permutation or simulation.

      We used these tests to detect selection signals in DBSs but not in the whole promoter regions. Using promoters without HS lncRNA DBS as the control also has risks because promoter regions contain other kinds of regulatory sequences.

      There are huge differences in the comparisons between the Vindija and Altai Neanderthal genomes that to me suggest some sort of technical bias or the such is at play here. e.g. line 190 reports 1256 genes to have a high distance between the Altai Neanderthal and modern humans, but only 134 Vindija genes reach the same threshold of 0.034. The temporal separation between the two specimens does not seem sufficient to explain this difference, nor the difference between the Altai Denisovan and Neanderthal results (2514 genes for Denisovan), which makes me wonder if it is a technical artefact relating to the quality of the genome builds? It would be worth checking.

      We feel it is hard to know whether or not the temporal separation between these specimens is sufficient to explain the differences because many details of archaic humans and their genomes remain unknown and because mechanisms determining genotype-phenotype relationships remain poorly known. After 0.034 was determined, these numbers of genes were determined accordingly. We chose parameters and thresholds that best suit the most important requirements, but these parameters and thresholds may not best suit other requirements; this is a problem for all large-scale studies.     

      Inferring evolution: There are some points of the manuscript where the authors are quick to infer positive selection. I would caution that GTEx contains a lot of different brain tissues, thus finding a brain eQTL is a lot easier than finding a liver eQTL, just because there are more opportunities for it. Likewise, claims in the text and in Tables 1 and 2 about the evolutionary pressures underlying specific genes should be more carefully stated. The same is true when the authors observe high Fst between groups (line 515), which is only one possible cause of high Fst - population differentiation and drift are just as capable of giving rise to it, especially at small sample sizes.

      (1) We add in Discussion that “Finally, not all detected signals reliably indicate positive selection”.

      (2) Our results are that more signals are detected in CEU and CHB than in YRI; this agrees all population genetics studies and implies that our results are not wrongly biased because more samples and larger samples were obtained from CEU and CHB.

    1. eLife Assessment

      This important study presents a well-constructed multiscale simulation framework to investigate ATP-driven DNA translocation by prokaryotic SMC complexes, supporting a segment-capture mechanism. The strength of evidence is convincing, highlighting the necessity of a precise balance between electrostatic interactions and hydrogen bonding, as well as the critical role of kleisin asymmetry in ensuring unidirectional movement.

    2. Reviewer #1 (Public review):

      Summary:

      This study used explicit-solvent simulations and coarse-grained models to identify the mechanistic features that allow for the unidirectional motion of SMC on DNA. Shorter explicit-solvent models describe relevant hydrogen bond energetics, which were then encoded in a coarse-grained structure-based model. In the structure-based model, the authors mimic chemical reactions as signaling changes in the energy landscape of the assembly. By cycling through the chemical cycle repeatedly, the authors show how these time-dependent energetic shifts naturally lead SMC to undergo translocation steps along DNA that are on a length scale that has been identified.

      Strengths:

      Simulating large-scale conformational changes in complex assemblies is extremely challenging. This study utilizes highly-detailed models to parameterize a coarse-grained model, thereby allowing the simulations to connect the dynamics of precise atomistic-level interactions with a large-scale conformational rearrangement. This study serves as an excellent example for this overall methodology, where future studies may further extend this approach to investigated any number of complex molecular assemblies.

      Weaknesses:

      The only relative weakness is that the text does not always clearly communicate which aspects of the dynamics are expected to be robust. That is, which aspects of the dynamics/energetics are less precisely described by this model? Where are the limits of the models, and why should the results be considered within the range of applicability of the models?

    3. Reviewer #2 (Public review):

      Summary:

      The authors perform coarse grained and all atom simulations to provide a mechanism for loop extrusion that is involved in genome compaction.

      Strengths:

      The simulations are very thoughtful. They provide insights into the translocation process, which is only one of the mechanisms. Much of the analyses is very good. Over all the study advances the use of simulations in this complicated systems.

      Weaknesses:

      Even the authors point out several limitations, which cannot be easily overcome in the paper because of the paucity of experimental data. Nevertheless, the authors could have done so to illustrate the main assertion that loop extrusion occurs by the motor translocating on DNA. They should mention more clearly that there are alternative theories that have accounted for a number of experimental data,

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, Yamauchi and colleagues combine all-atom and coarse-grained MD simulations to investigate the mechanism of DNA translocation by prokaryotic SMC complexes. Their multiscale approach is well-justified and supports a segment-capture model in which ATP-dependent conformational changes lead to the unidirectional translocation of DNA. A key insight from the study is that asymmetry in the kleisin path enforces directionality. The work introduces an innovative computational framework that captures key features of SMC motor action, including DNA binding, conformational switching, and translocation.

      This work is well executed and timely, and the methodology offers a promising route for probing other large molecular machines where ATP activity is essential.

      Strengths:

      This manuscript introduces an innovative yet simple method that merges all-atom and coarse-grained, purely equilibrium, MD simulations to investigate DNA translocation by SMC complexes, which is triggered by activated ATP processes. Investigating the impact of ATP on large molecular motors like SMC complexes is extremely challenging, as ATP catalyses a series of chemical reactions that take and keep the system out of equilibrium. The authors simulate the ATP cycle by cycling through distinct equilibrium simulations where the force field changes according to whether the system is assumed to be in the disengaged, engaged, and V-shaped states; this is very clever as it avoids attempting to model the non-equilibrium process of ATP hydrolysis explicitly. This equilibrium switching approach is shown to be an effective way to probe the mechanistic consequences of ATP binding and hydrolysis in the SMC complex system.

      The simulations reveal several important features of the translocation mechanism. These include identifying that a DNA segment of ~200 bp is captured in the engaged state and pumped forward via coordinated conformational transitions, yielding a translocation step size in good agreement with experimental estimates. Hydrogen bonding between DNA and the top of the ATPase heads is shown to be critical for segment capturtrans, as without it, translocation is shown to fail. Finally, asymmetry in the kleisin subunit path is shown to be responsible for unidirectionally.

      This work highlights how molecular simulations are an excellent complement to experiments, as they can exploit experimental findings to provide high-resolution mechanistic views currently inaccessible to experiments. The findings of these simulations are plausible and expand our understanding of how ATP hydrolysis induces directional motion of the SMC complex.

      Weaknesses:

      There are aspects of the methodology and modelling assumptions that are not clear and could be better justified. The major ones are listed below:

      (1) The all-atom MD simulations involve a 47-bp DNA duplex interacting with the ATPase heads, from which key residues involved in hydrogen bonding are identified. However, DNA mechanics-including flexibility and hydrogen bond formation-are known to be sequence-dependent. The manuscript uses a single arbitrary sequence but does not discuss potential biases. Could the authors comment on how sequence variability might affect binding geometry or the number of hydrogen bonds observed?

      (2) A key feature of the coarse-grained model is the inclusion of a specific hydrogen-bonding potential between DNA and residues on the ATPase heads. The authors select the top 15 hydrogen-bond-forming residues from the all-atom simulations (with contact probability > 0.05), but the rationale for this cutoff is not explained. Also, the strength of hydrogen bonds in coarse-grained models can be sensitive to context. How did the authors calibrate the strength of this interaction relative to electrostatics, and did they test its robustness (e.g., by varying epsilon or residue set)? Could this interaction be too strong or too weak under certain ionic conditions? What happens when salt is changed?

      (3) To enhance sampling, the translocation simulations are run at 300 mM monovalent salt. While this is argued to be physiological for Pyrococcus yayanosii, such a concentration also significantly screens electrostatics, possibly altering the interaction landscape between DNA and protein or among protein domains. This may significantly impact the results of the simulations. Why did the authors not use enhanced sampling methods to sample rare events instead of relying on a high-salt regime to accelerate dynamics?

      (4) Only a small fraction of the simulated trajectories complete successful translocation (e.g., 45 of 770 in one set), and this is attributed to insufficient simulation time. While the authors are transparent about this, it raises questions about the reliability of inferred success rates and about possible artefacts (e.g., DNA trapping in coiled-coil arms). Could the authors explore or at least discuss whether alternative sampling strategies (e.g., Markov State Models, transition path sampling) might address this limitation more systematically?

    5. Author Response:

      We thank the reviewers for their insightful comments on our manuscript. We are encouraged by their positive assessment of our multiscale simulation approach and segment-capture mechanism.

      In our revision, we will address the reviewers' primary concerns, which are summarized into three key points: (1) providing a more comprehensive discussion of the validity, robustness, and limitations of our model; (2) improving contextualization with alternative mechanisms; and (3) enhancing the clarity of our results, figures, and terminology.

      1) Model Validity, Robustness, and Limitations:

      As suggested by Reviewers #1 and #3, we will provide a more thorough discussion of our model's assumptions and limitations.[tt1]  This is essential to evaluate the generalizability and reliability of our conclusions. We will clarify which aspects of the dynamics we believe to be robust, elaborate on the rationale behind key parameter choices, such as the selection criteria for hydrogen-bonding residues and the calibration of their interaction strength, and discuss how these choices may influence the simulation outcomes. Furthermore, we will mention the potential impact of our choices regarding DNA sequence, DNA length, and the high-salt concentration, explaining why we opted for this simulation strategy over alternative enhanced-sampling techniques.

      2) Contextualization with Alternative Mechanisms:

      Following the comments by Reviewer #2, we will expand our discussion to better contextualize our work. We will provide a more detailed comparison between our segment-capture model and alternative mechanisms, particularly the 'scrunching' model (e.g., the theoretical work by Takaki et al. Nat. Commun. 2021,). This will help clarify how our high-resolution mechanistic view that reveals stepwise conformational transitions underlying segment capture fits into the broader landscape of SMC loop extrusion research. We believe this will contribute to the ongoing scientific discourse.

      3) Clarity of Results, Figures, and Terminology:

      Based on valuable suggestions from Reviewers #2 and #3, we will revise our manuscript to improve the clarity and accessibility of our findings. We will update figures and their descriptions (e.g., Figure 4I, J), providing a clearer step-by-step explanation of the translocation process within the ATP cycle (related to Figure 2), clarifying the role of each conformational state, elucidating how these transitions contribute to the loop extrusion mechanism, and defining key terms such as "pumping" more precisely.

      We are confident that these revisions will substantially strengthen the mechanistic clarity and scientific contribution of our work.

    1. eLife Assessment

      Research on push-pull systems has often focused on controlled environments, leaving significant gaps in our understanding of how these systems function under real-world conditions. This important and solid study makes a substantial contribution by investigating the volatile emissions and behavioral effects of Desmodium in natural and semi-field contexts which offer insights of broad interest for sustainable agriculture and pest management. While the authors rightly acknowledge some remaining limitations, the revised manuscript now provides a well-supported and transparent assessment of the ecological role of Desmodium volatiles in push-pull systems.

    2. Reviewer #2 (Public review):

      Based on the controversy of whether the Desmodium intercrop emits bioactive volatiles that repel the fall armyworm, the authors conducted this study to assess the effects of the volatiles from Desmodium plants in the push-pull system on behavior of FAW oviposition. This topic is interesting and the results are valuable for understanding the push-pull system for the management of FAW, the serious pest. The methodology used in this study is valid, leading to reliable results and conclusions. I just have a few concerns and suggestions for the improvement of this paper:

      (1) The volatiles emitted from D. incanum were analyzed and their effects on the oviposition behavior of FAW moth were confirmed. However, it would be better and useful to identify the specific compounds that are crucial for the success of the push-pull system.

      (2) That would be good to add "symbols" of significance in Figure 4 (D).

      (3) Figure A is difficult for readers to understand.

      (4) It will be good to deeply discuss the functions of important volatile compounds identified here with comparison with results in previous studies in the discussion better.

      Comments on revisions:

      The authors addressed all my concerns, and I believe that the current version is appropriate for publication.

    3. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The manuscript of Odermatt et al. investigates the volatiles released by two species of Desmodium plants and the response of herbivores to maize plants alone or in combination with these species. The results show that Desmodium releases volatiles in both the laboratory and the field. Maize grown in the laboratory also released volatiles, in a similar range. While female moths preferred to oviposit on maize, the authors found no evidence that Desmodium volatiles played a role in lowering attraction to or oviposition on maize.

      Strengths:

      The manuscript is a response to recently published papers that presented conflicting results with respect to whether Desmodium releases volatiles constitutively or in response to biotic stress, the level at which such volatiles are released, and the behavioral effect it has on the fall armyworm. These questions are relevant as Desmodium is used in a textbook example of pest-suppressive sustainable intercropping technology called push-pull, which has supported tens of thousands of smallholder farmers in suppressing moth pests in maize. A large number of research papers over more than two decades have implied that Desmodium suppresses herbivores in push-pull intercropping through the release of large amounts of volatiles that repel herbivores. This premise has been questioned in recent papers. Odermatt et al. thus contribute to this discussion by testing the role of odors in oviposition choice. The paper confirms that ovipositing FAW preferred maize, and also confirmed that odors released from Desmodium appeared not important in their bioassays.

      The paper is a welcome addition to the literature and adds quality headspace analyses of Desmodium from the laboratory and the field. Furthermore, the authors, some of whom have since long contributed to developing push-pull, also find that Desmodium odors are not significant in their choice between maize plants. This advances our knowledge of the mechanisms through which push-pull suppresses herbivores, which is critically important to evolving the technique to fit different farming systems and translating this mechanism to fit with other crops and in other geographical areas.

      Thank you for your careful assessment of our manuscript.

      Weaknesses:

      Below I outline the major concerns:

      (1) Clear induction of the experimental plants, and lack of reflective discussion around this: from literature data and previous studies of maize and Desmodium, it is clear that the plants used in this study, particularly the Desmodium, were induced. Maize appeared to be primarily manually damaged, possibly due to sampling (release of GLV, but little to no terpenoids, which is indicative of mostly physical stress and damage, for example, one of the coauthor's own paper Tamiru et al. 2011), whereas Desmodium releases a blend of many compounds (many terpenoids indicative of herbivore induction). Erdei et al. also clearly show that under controlled conditions maize, silver leaf and green leaf Desmodium release volatiles in very low amounts. While the condition of the plants in Odermatt et al. may be reflective of situations in push-pull fields, the authors should elaborate on the above in the discussion (see comments) such that the readers understand that the plant's condition during the experiments. This is particularly important because it has been assumed that Desmodium releases typical herbivore-induced volatiles constitutively, which is not the case (see Erdei et al. 2024). This reflection is currently lacking in the manuscript.

      We acknowledge the need for a more reflective discussion on the possible causes of volatile emission due to physical damage. Although the field plants were carefully handled, it is possible that some physical stress may have contributed to the release of volatiles, such as green leaf volatiles (GLVs). We ensured the revised manuscript reflects this nuanced interpretation (lines 282 – 286). However, we also explained more clearly that our aim was to capture the volatile emission of plants used by farmers under realistic conditions and moth responses to these plants, not to be able to attribute the volatile emission to a specific cause (lines 115 – 117). We revised relevant passages throughout the results and discussion to ensure that we do not make any claims about the reason for volatile emissions, and that our claims regarding these plants and their headspace being representative of the system as practiced by farmers are supported. In the revised manuscript we provide a new supplementary table S2 that additionally shows the classification of the identified substances, which also shows that the majority of the substances that were found in the headspace of the sampled plants of Desmodium intortum or Desmodium incanum are monoterpenes, sesquiterpenes, or aromatic compounds, and not GLVs (that are typically emitted following damage).

      (2) Lack of controls that would have provided context to the data: The experiments lack important controls that would have helped in the interpretation:

      2a The authors did not control the conditions of the plants. To understand the release of volatiles and their importance in the field, the authors should have included controlled herbivory in both maize and Desmodium. This would have placed the current volatile profiles in a herbivory context. Now the volatile measurements hang in midair, leading to discussions that are not well anchored (and should be rephrased thoroughly, see eg lines 183-188). It is well known that maize releases only very low levels of volatiles without abiotic and biotic stressors. However, this changes upon stress (GLVs by direct, physical damage and eg terpenoids upon herbivory, see above). Erdei et al. confirm this pattern in Desmodium. Not having these controls, means that the authors need to put the data in the context of what has been published (see above).

      We appreciate this concern. Our study aimed to capture the real-world conditions of push-pull fields, where Desmodium and maize grow in natural environments without the direct induction of herbivory for experimental purposes (lines 115 – 117). We agree that in further studies it would be important to carry out experiments under different environmental conditions, including herbivore damage. However, this was not within the scope of the present study.

      2b It would also have been better if the authors had sampled maize from the field while sampling Desmodium. Together with the above point (inclusion of herbivore-induced maize and Desmodium), the levels of volatile release by Desmodium would have been placed into context.

      We acknowledge that sampling maize and other intercrop plants, such as edible legumes, alongside Desmodium in the push-pull field would have allowed us to make direct comparisons of the volatile profiles of different plants in the push-pull system under shared field conditions. Again, this should be done in future experiments but was beyond the scope of the present study. Due to the amount of samples we could handle given cost and workload, we chose to focus on Desmodium because there is much less literature on the volatile profiles of field-grown Desmodium than maize plants in the field: we are aware of one study attempting to measure field volatile profiles from Desmodium intortum (Erdei et al. 2024) and no study attempting this for Desmodium incanum. We pointed out this justification for our focus on Desmodium in the manuscript (lines 435 - 439). Additionally, we suggested in the discussion that future studies should measure volatile profiles from all plants commonly used in push-pull systems alongside Desmodium (lines 267 – 269).

      2c To put the volatiles release in the context of push-pull, it would have been important to sample other plants which are frequently used as intercrop by smallholder farmers, but which are not considered effective as push crops, particularly edible legumes. Sampling the headspace of these plants, both 'clean' and herbivore-induced, would have provided a context to the volatiles that Desmodium (induced) releases in the field - one would expect unsuccessful push crops to not release any of these 'bioactive' volatiles (although 'bioactive' should be avoided) if these odors are responsible for the pest suppressive effect of Desmodium. Many edible intercrops have been tested to increase the adoption of push-pull technology but with little success.

      We very much agree that such measurements are important for the longer-term research program in this field. But again, for the current study this would have exploded the size of the required experiment. Regarding bioactivity, we have been careful to use the phrase "potentially bioactive" solely when referring to findings from the literature (lines 99–103), in order to avoid making any definitive claims about our own results.

      Because of the lack of the above, the conclusions the authors can draw from their data are weakened. The data are still valuable in the current discussion around push-pull, provided that a proper context is given in the discussion along the points above.

      We think our revisions made the specific aims of this study more explicit and help to avoid misleading claims.

      (3) 'Tendency' of the authors to accept the odor hypothesis (i.e. that Desmodium odors are responsible for repelling FAW and thereby reduce infestation in maize under push-pull management) in spite of their own data: The authors tested the effects of odor in oviposition choice, both in a cage assay and in a 'wind tunnel'. From the cage experiments, it is clear that FAW preferred maize over Desmodium, confirming other reports (including Erdei et al. 2024). However, when choosing between two maize plants, one of which was placed next to Desmodium to which FAW has no tactile (taste, structure, etc), FAW chose equally. Similarly in their wind tunnel setup (this term should not be used to describe the assay, see below), no preference was found either between maize odor in the presence or absence of Desmodium. This too confirms results obtained by Erdei et al. (but add an important element to it by using Desmodium plants that had been induced and released volatiles, contrary to Erdei et al. 2024). Even though no support was found for repellency by Desmodium odors, the authors in many instances in the manuscript (lines 30-33, 164-169, 202, 279, 284, 304-307, 311-312, 320) appear to elevate non-significant tendencies as being important. This is misleading readers into thinking that these interactions were significant and in fact confirming this in the discussion. The authors should stay true to their own data obtained when testing the hypothesis of whether odors play a role in the pest-suppressive effect of push-pull.

      We appreciate this feedback and agree that we may have overstated claims that could not be supported by strict significance tests. However, we believe that non-significant tendencies can still provide valuable insights. In the revised version of the manuscript, we ensured a clear distinction between statistically significant findings and non-significant trends and remove any language that may imply stronger support for the odor hypothesis than what the data show in all the lines that were mentioned.

      (4) Oviposition bioassay: with so many assays in close proximity, it is hard to certify that the experiments are independent. Please discuss this in the appropriate place in the discussion.

      We have pointed this out in the submitted manuscript in lines 275 – 279. Furthermore, we included detailed captions to figure 4 - supporting figure 3 & figure 4 - supporting figure 4. We are aware that in all such experiments there is a danger of between-treatment interference, which we pointed out for our specific case. We stated that with our experimental setup we tried to minimize interference between treatments by spacing and temporal staggering. We would like to point out that this common caveat does not invalidate experimental designs when practicing replication and randomization. We assume that insects are able to select suitable oviposition sites in the background of such confounding factors under realistic conditions.

      (5) The wind tunnel has a number of issues (besides being poorly detailed):

      5a. The setup which the authors refer to as a 'wind tunnel' does not qualify as a wind tunnel. First, there is no directional flow: there are two flows entering the setup at opposite sides. Second, the flow is way too low for moths to orient in (in a wind tunnel wind should be presented as a directional cue. Only around 1.5 l/min enters the wind tunnel in a volume of 90 l approximately, which does not create any directional flow. Solution: change 'wind tunnel' throughout the text to a dual choice setup /assay.)

      We agree with these criticisms and changed the terminology accordingly from ‘wind tunnel’ to ‘dual choice assay’. We have now conducted an additional experiment which we called ‘no-choice assay’ that provides conditions closer to a true wind tunnel. The setup of the added experiment features an odor entry point at only one side of the chamber to create a more directional airflow. Each treatment (maize alone, maize + D. intortum, maize + D. incanum, and a control with no plants) was tested separately, with only one treatment conducted per evening to avoid cross-contamination, as described in the methods section of the no-choice assay.

      5b. There is no control over the flows in the flight section of the setup. It is very well possible that moths at the release point may only sense one of the 'options'. Please discuss this.

      We added this to the discussion (lines 369 – 374). The new no-choice assays also address this concern by using a setup with laminar flow.

      5c. Too low a flow (1,5 l per minute) implies a largely stagnant air, which means cross-contamination between experiments. An experiment takes 5 minutes, but it takes minimally 1.5 hours at these flows to replace the flight chamber air (but in reality much longer as the fresh air does not replace the old air, but mixes with it). The setup does not seem to be equipped with e.g. fans to quickly vent the air out of the setup. See comments in the text. Please discuss the limitations of the experimental setup at the appropriate place in the discussion.

      We added these limitations to the discussion and addressed these concerns with new experiments (see answer 5a).

      5d. The stimulus air enters through a tube (what type of tube, diameter, length, etc) containing pressurized air (how was the air obtained into bags (type of bag, how is it sealed?), and the efflux directly into the flight chamber (how, nozzle?). However, it seems that there is no control of the efflux. How was leakage prevented, particularly how the bags were airtight sealed around the plants? 

      We added the missing information to the methods and provided details about types of bags, manufacturers, and pre-treatments in the method section. In short, PTFE tubes connected bagged plants to the bioassay setup and air was pumped in at an overpressure, so leakage was not eliminated but contamination from ambient air was avoided.

      5e. The plants were bagged in very narrowly fitting bags. The maize plants look bent and damaged, which probably explains the GLVs found in the samples. The Desmodium in the picture (Figure 5 supplement), which we should assume is at least a representative picture?) appears to be rather crammed into the bag with maize and looks in rather poor condition to start with (perhaps also indicating why they release these volatiles?). It would be good to describe the sampling of the plants in detail and explain that the way they were handled may have caused the release of GLVs.

      We included a more detailed description of the plant handling and bagging processes to the methods to clarify how the plants were treated during the dual-choice and the no-choice assays reported in the revised manuscript. We politely disagree that the maize plants were damaged and the Desmodium plants not representative of those encountered in the field. The plants were grown in insect-proof screen houses to prevent damage by insects and carefully curved without damaging them to fit into the bag. The Desmodium plant pictured was D. incanum, which has sparser foliage and smaller leaves than D. intortum.

      (6) Figure 1 seems redundant as a main figure in the text. Much of the information is not pertinent to the paper. It can be used in a review on the topic. Or perhaps if the authors strongly wish to keep it, it could be placed in the supplemental material.

      We think that Figure 1 provides essential information about the push-pull system and the FAW. To our knowledge, this partly contradictory evidence so far has not been synthesized in the literature. We realize that such a figure would more commonly be provided in a review article, but we do not think that the small number of studies on this topic so far justify a stand-alone review. Instead, the introduction to our manuscript includes a brief review of these few studies, complemented by the visual summary provided in Figure 1 and a detailed supplementary table.

      Reviewer #2 (Public review):

      Based on the controversy of whether the Desmodium intercrop emits bioactive volatiles that repel the fall armyworm, the authors conducted this study to assess the effects of the volatiles from Desmodium plants in the push-pull system on behavior of FAW oviposition. This topic is interesting and the results are valuable for understanding the push-pull system for the management of FAW, the serious pest. The methodology used in this study is valid, leading to reliable results and conclusions. I just have a few concerns and suggestions for improvement of this paper:

      (1) The volatiles emitted from D. incanum were analyzed and their effects on the oviposition behavior of FAW moth were confirmed. However, it would be better and useful to identify the specific compounds that are crucial for the success of the push-pull system.

      We fully agree that identifying specific volatile compounds responsible for the push-pull effect would provide valuable insights into the underlying mechanisms of the system. However, the primary focus of this study was to address the still unresolved question whether Desmodium emits detectable or “significant” amounts of volatiles at all under field conditions, and the secondary aim was to test whether we could demonstrate a behavioral effect of Desmodium headspace on FAW moths. Before conducting our experiments, we carefully considered the option of using single volatile compounds and synthetic blends in bioassays. We decided against this because we judged that the contradictory evidence in the literature was not a sufficient basis for composing representative blends. Furthermore, we think it is an important first step to test f. or behavioral responses to the headspaces of real plants. We consider bioassays with pure compounds to be important for confirmation and more detailed investigation in future studies. There was also contradictory evidence in the literature regarding moth responses to plants. We thus opted to focus on experiments with whole plants to maintain ecological relevance.

      (2) That would be good to add "symbols" of significance in Figure 4 (D).

      We report the statistical significance of the parameters in Figure 4 (D) in Table 3, which shows the mixed model applied for oviposition bioassays. While testing significance between groups is a standard approach, we used a more robust model-based analysis to assess the effects of multiple factors simultaneously. We provided a cross-reference to Table 3 from the figure description of Figure 4 (D) for readers to easily find the statistical details.

      (3) Figure A is difficult for readers to understand.

      Unfortunately, it is not entirely clear which specific figure is being referred to as "Figure A" in this comment. We tried to keep our figures as clear as possible.

      (4) It will be good to deeply discuss the functions of important volatile compounds identified here with comparison with results in previous studies in the discussion better.

      Our study does not provide strong evidence that specific volatiles from Desmodium plants are important determinants of FAW oviposition or choice in the push-pull system. Therefore, we prefer to refrain from detailed discussions of the potential importance of individual compounds. However, in the revised version, we provide an additional table S2 which identifies the overlap with volatiles previously reported from Desmodium, as only the total numbers are summarized in the discussion of the submitted paper.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The points raised are largely self-explanatory as to what needs to be done to fully resolve them. At a minimum the text needs to be seriously revised to:

      (1) reflect the data obtained.

      (2) reflect on the limitations of their experimental setup and data obtained.

      (3) put the data obtained and its limitations in what these tell us and particularly what not. Ideally, additional headspace measurements are taken, including from herbivory and 'clean' maize and Desmodium (in which there is better control of biotic and abiotic stress), as well as other crops commonly planted as companion crops with maize (but none of them reducing pest pressure).

      Thank you for this summary. Please see our detailed responses above.

      In addition to the main points of critique provided above, I have provided additional comments in the text (https://elife-rp.msubmit.net/elife-rp_files/2024/07/18/00134767/00/134767_0_attach_28_25795_convrt.pdf). These elaborate on the above points and include some new ones too. These are the major points of critique, which I hope the authors can address.

      Thank you very much for these detailed comments.

      Reviewer #2 (Recommendations for the authors):

      It is important to note that the original push-pull system was developed against stemborers and involved Napier grass (still used) around the field, which attracts stemborer moths, and Molasses grass as the intercrop that repels the moths and attracts parasitoids. Later, Molasses grass was replaced by desmodiums because it is a legume that fixes nitrogen and therefore can increase nitrate levels in the soil, but most importantly because it prevents germination of the parasitic Striga weed. The possible repellent effect of desmodium on pests and attraction of natural enemies was never properly tested but assumed, probably to still be able to use the push-pull terminology. This "mistake" should be recognized here and in future publications. It is a real pity that the controversy over the repellent effect of desmodium distracts from the amazing success of the push-pull system, also against the fall armyworm.

      We thank the reviewer for pointing out these issues, which are part of the reason for our Figure 1 and why we would like to keep it. We have described this development of the system in the introduction to better present the push-pull system. Our aim in Figure 1 and Table S1 is to highlight both the evidence of the system's success, and the gaps in our understanding, regarding specifically control of damage from the FAW.

    1. eLife Assessment

      This is a valuable study on how past sensory experiences shape perception across multiple time scales. Using a behavioural task and reanalysed EEG data, the authors identify two unifying mechanisms across time scales: a process resulting in faster responses to expected stimuli modulated by attention to task, and reduced early decoding precision for expected inputs interpreted as dampened feedforward processing. The manipulation to dissociate task-related and unrelated history effects over multiple timescales is novel and promising, but the evidence is incomplete and could be strengthened by clarifying the measures, justifying analyses choices, and the relationship to other work.

    2. Reviewer #1 (Public review):

      Summary:

      This paper addresses an important and topical issue: how temporal context, at various time scales, affects various psychophysical measures, including reaction times, accuracy, and localization. It offers interesting insights, with separate mechanisms for different phenomena, which are well discussed.

      Strengths:

      The paradigm used is original and effective. The analyses are rigorous.

      Weaknesses:

      Here I make some suggestions for the authors to consider. Most are stylistic, but the issue of precision may be important.

      (1) The manuscript is quite dense, with some concepts that may prove difficult for the non-specialist. I recommend spending a few more words (and maybe some pictures) describing the difference between task-relevant and task-irrelevant planes. Nice technique, but not instantly obvious. Then we are hit with "stimulus-related", which definitely needs some words (also because it is orthogonal to neither of the above).

      (2) While I understand that the authors want the three classical separations, I actually found it misleading. Firstly, for a perceptual scientist to call intervals in the order of seconds (rather than milliseconds), "micro" is technically coming from the raw prawn. Secondly, the divisions are not actually time, but events: micro means one-back paradigm, one event previously, rather than defined by duration. Thirdly, meso isn't really a category, just a few micros stacked up (and there's not much data on this). And macro is basically patterns, or statistical regularities, rather than being a fixed time. I think it would be better either to talk about short-term and long-term, which do not have the connotations I mentioned. Or simply talk about "serial dependence" and "statistical regularities". Or both.

      (3) More serious is the issue of precision. Again, this is partially a language problem. When people use the engineering terms "precision" and "accuracy" together, they usually use the same units, such as degrees. Accuracy refers to the distance from the real position (so average accuracy gives bias), and precision is the clustering around the average bias, usually measured as standard deviation. Yet here accuracy is percent correct: also a convention in psychology, but not when contrasting accuracy with precision, in the engineering sense. I suggest you change "accuracy" to "percent correct". On the other hand, I have no idea how precision was defined. All I could find was: "mixture modelling was used to estimate the precision and guess rate of reproduction responses, based on the concentration (k) and height of von Mises and uniform distributions, respectively". I do not know what that means.

      (4) Previous studies show serial dependence can increase bias but decrease scatter (inverse precision) around the biased estimate. The current study claims to be at odds with that. But are the two measures of precision relatable? Was the real (random) position of the target subtracted from each response, leaving residuals from which the inverse precision was calculated? (If so, the authors should say so..) But if serial dependence biases responses in essentially random directions (depending on the previous position), it will increase the average scatter, decreasing the apparent precision.

      (5) I suspect they are not actually measuring precision, but location accuracy. So the authors could use "percent correct" and "localization accuracy". Or be very clear what they are actually doing.

    3. Reviewer #2 (Public review):

      Summary:

      This study investigates the influence of prior stimuli over multiple time scales in a position discrimination task, using pupillometry data and a reanalysis of EEG data from an existing dataset. The authors report consistent history-dependent effects across task-related, task-unrelated, and stimulus-related dimensions, observed across different time scales. These effects are interpreted as reflecting a unified mechanism operating at multiple temporal levels, framed within predictive coding theory.

      Strengths:

      The goal of assessing history biases over multiple time scales is interesting and resonates with both classic (Treisman & Williams, 1984) and recent work (Fritsche et al., 2020; Gekas et al., 2019). The manipulations used to distinguish task-related, unrelated, and stimulus-related reference frames are original and promising.

      Weaknesses:

      I have several concerns regarding the text, interpretation, and consistency of the results, outlined below:

      (1) The abstract should more explicitly mention that conclusions about feedforward mechanisms were derived from a reanalysis of an existing EEG dataset. As it is, it seems to present behavioral data only.

      (2) The EEG task seems quite different from the others, with location and color changes, if I understand correctly, on streaks of consecutive stimuli shown every 100 ms, with the task involving counting the number of target events. There might be different mechanisms and functions involved, compared to the behavioral experiments reported.

      (3) How is the arbitrary choice of restricting EEG decoding to a small subset of parieto-occipital electrodes justified? Blinks and other artifacts could have been corrected with proper algorithms (e.g., ICA) (Zhang & Luck, 2025) or even left in, as decoders are not necessarily affected by noise. Moreover, trials with blinks occurring at the stimulus time should be better removed, and the arbitrary selection of a subset of electrodes, while reducing the information in input to the decoder, does not account for trials in which a stimulus was missed (e.g., due to blinks).

      (4) The artifact that appears in many of the decoding results is puzzling, and I'm not fully convinced by the speculative explanation involving slow fluctuations. I wonder if a different high-pass filter (e.g., 1 Hz) might have helped. In general, the nature of this artifact requires better clarification and disambiguation.

      (5) Given the relatively early decoding results and surprisingly early differences in decoding peaks, it would be useful to visualize ERPs across conditions to better understand the latencies and ERP components involved in the task.

      (6) It is unclear why the precision derived from IEM results is considered reliable while the accuracy is dismissed due to the artifact, given that both seem to be computed from the same set of decoding error angles (equations 8-9).

      (7) What is the rationale for selecting five past events as the meso-scale? Prior history effects have been shown to extend much further back in time (Fritsche et al., 2020).

      (8) The decoding bias results, particularly the sequence of attraction and repulsion, appear to run counter to the temporal dynamics reported in recent studies (Fischer et al., 2024; Luo et al., 2025; Sheehan & Serences, 2022).

      (9) The repulsive component in the decoding results (e.g., Figure 3h) seems implausibly large, with orientation differences exceeding what is typically observed in behavior.

      (10) The pattern of accuracy, response times, and precision reported in Figure 3 (also line 188) resembles results reported in earlier work (Stewart, 2007) and in recent studies suggesting that integration may lead to interference at intermediate stimulus differences rather than improvement for similar stimuli (Ozkirli et al., 2025).

      (11) Some figures show larger group-level variability in specific conditions but not others (e.g., Figures 2b-c and 5b-c). I suggest reporting effect sizes for all statistical tests to provide a clearer sense of the strength of the observed effects.

      (12) The statement that "serial dependence is associated with sensory stimuli being perceived as more similar" appears inconsistent with much of the literature suggesting that these effects occur at post-perceptual stages (Barbosa et al., 2020; Bliss et al., 2017; Ceylan et al., 2021; Fischer et al., 2024; Fritsche et al., 2017; Sheehan & Serences, 2022).

      (13) If I understand correctly, the reproduction bias (i.e., serial dependence) is estimated on a small subset of the data (10%). Were the data analyzed by pooling across subjects?

      (14) I'm also not convinced that biases observed in forced-choice and reproduction tasks should be interpreted as arising from the same process or mechanism. Some of the effects described here could instead be consistent with classic priming.

    1. eLife Assessment

      The authors study how apolipoprotein L1 variants impact inflammation and lipid accumulation in macrophages. The findings will be useful for researchers investigating macrophage metabolism and inflammation. The discovery that the polyamine spermidine in part mediates such effects is interesting, but the supporting evidence for a physiologically relevant role is currently incomplete due to the lack of relevant in vivo studies.

    2. Reviewer #1 (Public review):

      Summary:

      Liu et al. investigated the mechanisms by which apolipoprotein L1 (APOL1) G1 and G2 variants cause inflammation and lipid accumulation in macrophages by bone-marrow-derived macrophages from transgenic mice and human iPS cells. Although these findings are not novel, this work provides solid evidence to prove enhanced inflammation and lipid accumulation in macrophages by APOL1 G1 and G2 variants by a variety of in vitro assays and metabolomics measurements. Further, metabolomics measurements identified that the spermidine synthesis pathway was altered by APOL1 G1 and G2 variants, and the polyamine inhibitor reversed the variants-induced phenotypes.

      Strengths:

      Their hypothesis and choice of experiments in each section were clear and mostly solid. Mitochondrial morphological quantification by transmission electron microscopy images was convincing. The authors confirmed APOL1 localization inside macrophages and built stories based on their findings. Showing relevant positive and negative findings in line with current knowledge of APOL1-variants-driven pathologies, such as cation flux, cGAS-STING pathways, indicates a good rigor.

      Weaknesses:

      Although most methods in this work were solid, the choice of α-difluoromethylornithine (DFMO) as an inhibitor of spermidine synthesis was not direct. It was still unclear if DFMO was reversing the phenotypes by lowering spermidine levels. Seahorse assay results would have avoided potential variabilities in cell densities by normalization. Heatmaps showing RNA-seq results would be appreciated better with a clear description of how the color is defined and calculated.

    3. Reviewer #2 (Public review):

      Summary:

      The G1 and G2 variants of the Apolipoprotein L1 (APOL1) gene are well-established risk factors for chronic kidney disease. While macrophages have been implicated in the pathogenesis of APOL1-mediated kidney diseases (AMKD), the precise impact of the G1 and G2 APOL1 variants on macrophage function and the underlying molecular mechanisms remains insufficiently characterized. In this manuscript, the authors investigate pathological phenotypes in macrophages carrying the G1 and G2 APOL1 variants. They report an accumulation of neutral lipids and activation of pro-inflammatory pathways, which appear to be at least partly driven by an accumulation of the polyamine spermidine and upregulation of the spermidine synthesis pathway. These findings reveal a pro-inflammatory role for G1 and G2 APOL1 in macrophages and identify the spermidine synthesis pathway as a potential therapeutic target.

      Strengths:

      The authors employ a comprehensive set of approaches to characterize macrophage phenotypes, including assessments of lipid accumulation, pro-inflammatory cytokine release, responses to M2-polarizing cytokines, autophagy, mitochondrial function, and metabolic profiling. The reversal of pathological phenotypes in G1 and G2 APOL1 macrophages by the polyamine synthesis inhibitor α-difluoromethylornithine provides compelling evidence supporting a causal role for spermidine in mediating APOL1 variant-associated dysfunction. Furthermore, the inclusion of both mouse and human models strengthens the translational relevance of the findings.

      Weaknesses:

      The manuscript would benefit from a clearer articulation of the specific role macrophages play in the pathogenesis of APOL1-associated kidney diseases to better emphasize the significance of the study. Additionally, the experimental design lacks a clear, logical progression, and the rationale behind some experiments is insufficiently justified, making certain conclusions difficult to fully support based on the presented data. Given the availability of established animal models of APOL1-associated kidney diseases, it is unclear why the authors chose to derive macrophages from the bone marrow of G1 and G2 APOL1 mice for in vitro assays rather than isolating and testing macrophages in vivo within these models. In vitro assays may exaggerate macrophage responses compared to physiological conditions, which could affect the interpretation of the data. Addressing this point would strengthen the manuscript.

    4. Reviewer #3 (Public review):

      Summary:

      Liu et al investigate the impact of G1 and G2 variants of the gene encoding Apolipoprotein L1 (APOL1) on macrophage inflammation. The authors have used bone marrow-derived macrophages and human induced pluripotent stem cell-derived macrophages as their model to identify altered immune signaling caused by G1 and G2 APOL1. The unbiased metabolite analysis indicates the possible involvement of altered polyamine metabolism in the regulation of inflammatory response in G1 and G2 macrophages. This study shows that targeting polyamine metabolism can limit macrophage inflammation and lipid accumulation in vitro conditions.

      Strengths:

      This study shows the importance of polyamine metabolism in the regulation of macrophage inflammatory response. The authors showed that spermidine synthesis is closely associated with altered macrophage functions with two risk-variant forms of APOL1 (G1 and G2). The altered macrophage lipid metabolism is known to be associated with macrophage dysfunction in G1 and G2 APOL1. However, the involvement of polyamine in the regulation of lipid accumulation and inflammation in macrophages in G1 and G2 variants is interesting and could be explored as a novel therapeutic approach for chronic inflammation.

      Weaknesses:

      The novelty of this study lies in the association of polyamine metabolism with lipid metabolism dysregulation in macrophages. The weakness of the manuscript is that insufficient experiments to support the claim of involvement of polyamine metabolism in the regulation of macrophage inflammation, which undermines the novelty of this study. The authors performed in vitro experiments targeting spermidine synthesis to show reduced inflammation and lipid accumulation, but have not performed any in vivo analysis of chronic kidney inflammation progression in G1 and G2 mice, which they have used to generate bone-marrow-derived macrophages. They have not shown any data that supports the specificity of DFMO in targeting spermidine synthesis.

    1. eLife Assessment

      This study presents a valuable finding of novel markers that may potentially identify resident tendon stem/progenitor cells (TSPCs). The study also presents a comprehensive single-cell transcriptional dataset that will be of value to the field. The evidence supporting the identification of novel markers of a TSPC is incomplete, requiring clarification of current analyses and additional validation experiments to demonstrate that these markers are indeed specific and these cells are indeed TSPCs. This work will be of interest to biologists and engineers focused on tendons and ligaments.

    2. Reviewer #1 (Public review):

      This study is focused on identifying unique, innovative surface markers for mature Achilles tendons by combining the latest multi-omics approaches and in vitro evaluation, which would address the knowledge gap of controversial identity of TPSCs with unspecific surface markers. The use of multi-omics technologies, in vivo characterization, in vitro standard assays of stem cells, and in vitro tissue formation is a strength of this work and could be applied for other stem cell quantification in the musculoskeletal research. The evaluation and identification of Cd55 and Cd248 in TPSCs have not been conducted in tendon, which is considered as innovative. Additionally, the study provided solid sequencing data to confirm co-expressions of Cd55 and Cd248 with other well-described surface markers such as Ly6a, Tpp3, Pdgfra, and Cd34. Generally, the data shown in the manuscript support the claims that the identified surface antigens mark TPSCs in juvenile tendons.

    3. Reviewer #2 (Public review):

      Summary:

      The molecular signature of tendon stem cells is not fully identified. The endogenous location of tendon stem cells within native tendon is also not fully elucidated. Several molecular markers have been identified to isolate tendon stem cells but they lack tendon specificity. Using the declining tendon repair capacity of mature mice, the authors compared the transcriptome landscape and activity of juvenile (2 weeks) and mature (6 weeks) tendon cells of mouse Achilles tendons and identified CD55 and CD248 as novel surface markers for tendon stem cells. CD55+ CD248+ FACS-sorted cells display a preferential tendency to differentiate into tendon cells compared to CD55neg CD248neg cells.

      Strengths:

      The authors generated a lot of data of juvenile and mature Achilles tendons, using scRNAseq, snRNAseq, ATACseq strategies. This constitutes a resource datasets.

      Weaknesses:

      The analyses and validation of identified genes are not complete and could be pushed further. The endogenous expression of newly-identified genes in native tendons would be informative. The comparison of scRNAseq and snRNAseq datasets for tendon cell populations would strengthen the identification of tendon cell populations.

    4. Author Response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      This study is focused on identifying unique, innovative surface markers for mature Achilles tendons by combining the latest multi-omics approaches and in vitro evaluation, which would address the knowledge gap of the controversial identity of TPSCs with unspecific surface markers. The use of multi-omics technologies, in vivo characterization, in vitro standard assays of stem cells, and in vitro tissue formation is a strength of this work and could be applied for other stem cell quantification in musculoskeletal research. The evaluation and identification of Cd55 and Cd248 in TPSCs have not been conducted in tendons, which is considered innovative. Additionally, the study provided solid sequencing data to confirm co-expressions of Cd55 and Cd248 with other well-described surface markers such as Ly6a, Tpp3, Pdgfra, and Cd34. Generally, the data shown in the manuscript support the claims that the identified surface antigens mark TPSCs in juvenile tendons.

      However, there are missing links between scientific questions aimed to be addressed in Introduction and Methodology/Results. If the study focuses on unsatisfactory healing responses of mature tendons and understanding of mature TPSCs, at least mature Achilles tendons from more than 12-week-old mice and their comparison with tendons from juvenile/neonatal mice should be conducted. However, either 2-week or 6-weekold mice, used for characterization here, are not skeletally mature, Additionally, there is a lack of complete comparison of TPSCs between 2-week and 6-week-old mice in the transcriptional and epigenetic levels.

      In order to distinguish TPSCs and characterize their epigenetic activities, the authors used scRNA-seq, snRNA-seq, and snATAC-seq approaches. The integration, analysis, and comparison of sequencing data across assays and/or time points is confusing and incomplete. For example, it should be more comprehensive to integrate both scRNA-seq and snRNA-seq data (if not, why both assays were used for Achilles tendons of both 2-week and 6-week timepoints). snRNA-seq and snATAC-seq data of 6-week-old mice were separately analyzed. No comparison of difference and similarity of TPSCs of 2-week and 6-week-old mice was conducted.

      Given the goal of this work to identify specific TPSC markers, the specificity of Cd55 and Cd248 for TPSCs is not clear. First, based on the data shown here, Cd55 and Cd248 mark the same cell population which is identified by Ly6a, TPPP3, and Pdgfra. Although, for instance, Cd34 is expressed by other tissues as discussed here, no data/evidence is provided by this work showing that Cd55 and Cd248 are not expressed by other musculoskeletal tissues/cells. Second, the immunostaining of Cd55 and Cd248 doesn't support their specificity. What is the advantage of using Cd55 and Cd248 for TPSCs compared to using other markers?

      Reviewer #2 (Public review): 

      Summary: 

      The molecular signature of tendon stem cells is not fully identified. The endogenous location of tendon stem cells within the native tendon is also not fully elucidated. Several molecular markers have been identified to isolate tendon stem cells but they lack tendon specificity. Using the declining tendon repair capacity of mature mice, the authors compared the transcriptome landscape and activity of juvenile (2 weeks) and mature (6 weeks) tendon cells of mouse Achilles tendons and identified CD55 and CD248 as novel surface markers for tendon stem cells. CD55+ CD248+ FACS-sorted cells display a preferential tendency to differentiate into tendon cells compared to CD55neg CD248neg cells.

      Strengths: 

      The authors generated a lot of data on juvenile and mature Achilles tendons, using scRNAseq, snRNAseq, and ATACseq strategies. This constitutes a resource dataset.

      Weaknesses: 

      The analyses and validation of identified genes are not complete and could be pushed further. The endogenous expression of newly identified genes in native tendons would be informative. The comparison of scRNAseq and snRNAseq datasets for tendon cell populations would strengthen the identification of tendon cell populations. 

      Reviewer #3 (Public review): 

      Summary: 

      In their report, Tsutsumi et al., use single nucleus transcriptional and chromatin accessibility analyses of mouse achilles tendon in an attempt to uncover new markers of tendon stem/progenitor cells. They propose CD55 and CD248 as novel markers of tendon stem/progenitor cells. 

      Strengths: 

      This is an interesting and important research area. The paper is overall well written.

      Weaknesses: 

      Major problems: 

      (1) It is not clear what tissue exactly is being analyzed. The authors build a story on tendons, but there is little description of the dissection. The authors claim to detect MTJ and cartilage cells, but not bone or muscle cells. The tendon sheath is known to express CD55, so the population of "progenitors" may not be of tendon origin.

      (2) Cluster annotations are seemingly done with a single gene. Names are given to cells without functional or spatial validation. For example, MTJ cells are annotated based on Postn, but it is never shown that Postn is only expressed at the MTJ, and not in other anatomical locations in the tendon. 

      (3) The authors compare their data to public data based on interrogating single genes in their dataset. It is now standard practice to integrate datasets (eg, using harmony), or at a minimum using gene signatures built into Seurat (eg AddModuleScore).

      (4) Progenitor populations (SP1, SP2). The authors claim these are progenitors but show very clearly that they express macrophage genes. What are they, macrophages or fibroblasts?

      (5) All omics analysis is done on single data points (from many mice pooled). The authors make many claims on n=1 per group for readouts dependent on sample number (eg frequency of clusters).

      (6) The scRNAseq atlas in Figure 1 is made by analyzing 2W and 6W tendons at the same time. The snRNAseq and ATACseq atlas are built first on 2W data, after which the 6W data is compared. Why use the 2W data as a reference?

      Why not analyze the two-time points together as done with the scRNAseq? 

      (7) Figure 5: The authors should show the gating strategy for FACS. Were non-fibroblasts excluded (eg, immune cells, endothelia...etc). Was a dead cell marker used? If not, it is not surprising that fibroblasts form colonies and express fibroblast genes when compared to CD55-CD248- immune cells, dead cells, or debris. Can control genes such as Ptprc or Pecam1 be tested to rule out contamination with other cell types?

      Minor problems: 

      (1) Report the important tissue processing details: type of collagenase used. Viability before loading into 10x machine.

      Reviewer #1 (Recommendations for the authors): 

      (1) Better healing responses in neonatal mice than mature mice have been well appreciated in the field and differences in ECM environment, immune responses, and cell function might account for varied injury results. However, direct evidence/data between better healing and abundant TSPCs needs to be discussed in the Introduction. 

      We agree with this insightful comment. We have now enhanced our introduction to include a more direct discussion of the relationship between better healing responses in neonatal mice and the abundance of TSPCs. We specifically highlighted how Howell et al. (2017) demonstrated that tendons in juvenile mice can regenerate functional tissue after injury, while this ability is lost in mature mice. Based on this observation, we articulated our hypothesis that juvenile mouse tendons likely contain abundant TSPCs, which potentially explains their superior healing capacity. Additionally, we have added a statement emphasizing that "investigating TSPCs biology is important for understanding tendon regeneration and homeostasis" (lines 61-62), which clearly articulates the central role that TSPCs play in tendon repair processes and tissue maintenance.

      (2) 6-week-old mouse Achilles tendons are not mature enough and clinically relevant to understand the deficiency of regenerative capacity of TPSCs for undesired healing. If the goal of this study is to identify TSPCs of mature tendons, evaluation of Achilles tendons from at least 12-week-old mice is more reasonable. 

      We agree with this insightful comment. We have now enhanced our introduction to include a more direct discussion of the relationship between better healing responses in neonatal mice and the abundance of TSPCs. We specifically highlighted how Howell et al. (2017) demonstrated that tendons in juvenile mice can regenerate functional tissue after injury, while this ability is lost in mature mice. Based on this observation, we articulated our hypothesis that juvenile mouse tendons likely contain abundant TSPCs, which potentially explains their superior healing capacity. Additionally, we have added a statement emphasizing that "investigating TSPCs biology is important for understanding tendon regeneration and homeostasis" (lines 61-62), which clearly articulates the central role that TSPCs play in tendon repair processes and tissue maintenance.

      (3) 40-60 mouse Achilles tendons pooled for one sample seems a lot and there is mixed/missed information about how many total cells were collected for each sample and how they were used for different sequencing assays. This could raise the concern that cell digestion was not complete and possibly abundant resident cells might be missed for sequencing analysis.

      We agree with this insightful comment. We have now enhanced our introduction to include a more direct discussion of the relationship between better healing responses in neonatal mice and the abundance of TSPCs. We specifically highlighted how Howell et al. (2017) demonstrated that tendons in juvenile mice can regenerate functional tissue after injury, while this ability is lost in mature mice. Based on this observation, we articulated our hypothesis that juvenile mouse tendons likely contain abundant TSPCs, which potentially explains their superior healing capacity. Additionally, we have added a statement emphasizing that "investigating TSPCs biology is important for understanding tendon regeneration and homeostasis" (lines 61-62), which clearly articulates the central role that TSPCs play in tendon repair processes and tissue maintenance.

      (4) The methods section has necessary information missing, which could create confusion for readers. Which time points are used for scRNA-seq and snATAC-seq? Which time points of cells are integrated and analyzed regarding each assay/combined assays? Why is transcriptional expression evaluated by both scRNA-seq and snRNA-seq and is there any technological difference between the two assays?

      We have thoroughly revised the Methods section to clearly specify which time points were used for each assay (line 132-133 and line 148-149). We have also clarified how cells from different time points were integrated and analyzed (lines 167-170, 179-184 and 494-502). Regarding the use of both scRNA-seq and snRNA-seq, we have explained that this complementary approach allowed us to capture both cytoplasmic and nuclear transcripts, providing a more comprehensive view of gene expression profiles while also enabling direct integration with snATAC-seq data. Comparison of similarity between scRNA-seq integration data (2-week and 6-week) and snRNA-seq (2-week) clusters confirmed that the clusters in each data set are almost correlated. We added the dot plot and correlation data in supplemental figure 5. Additionally, we have included comprehensive lists of differentially expressed genes (DEGs) for each identified cluster across all datasets (supplementary tables 1-15), which provide detailed molecular signatures for each cell population and facilitate cross-dataset comparisons.

      (5) snATAC-sequencing data seems to be used to only confirm the findings by snRNA-seq and snATAC-sequencing data is not well explored. This assay directly measures/predicts transcription factor activities and epigenetic changes, which might be more accurate in inferring transcription factors from RNA sequencing data using the R package SCENIC.

      We appreciate the reviewer's insightful comment regarding the utilization of our snATAC-seq data. We agree that snATAC-seq provides valuable direct measurements of chromatin accessibility and transcription factor binding sites that can complement inference-based approaches like SCENIC. To address this concern, we have revised our manuscript to better emphasize the value of our snATAC-seq data in transcription factor activity evaluation. We have modified our text (lines 570-574). This modification emphasizes that our integrated approach leverages the strengths of both methodologies, with snATAC-seq providing direct measurements of chromatin accessibility and transcription factor binding sites that can validate and enhance the inference-based predictions from SCENIC analysis of RNA-seq data.

      (6) The image quality of immunostaining of Cd55 and Cd248 is low. The images show that only part of the tendon sheath has positive staining. Co-localization of Cd55 and Cd248 can't be found.

      We agree with the reviewer regarding the limitations of our immunostaining images. To obtain clearer images, we used paraffin sections for our analysis. Additionally, the antibodies for CD55 and CD248 required different antigen retrieval conditions to work effectively, which unfortunately prevented us from performing co-immunostaining to directly demonstrate co-localization. Despite these technical limitations, we have optimized the processing and imaging parameters to improve the quality of the immunostaining images in Figure 5A. These improved images more clearly demonstrate the expression of CD55 and CD248 in the tendon sheath, although in separate sections. The consistent localization patterns observed in these separate stainings, together with our FACS and functional analyses of double-positive cells, strongly support their co-expression in the same cell population. We have also updated the corresponding Methods section (lines 260-272) to include these optimized immunostaining protocols for better reproducibility.

      (7) Only TEM data of tendon construct formed by sorted cells are shown. Results of mechanical tests will be super helpful to show the capacity of these TPSCs for tendon assembly.

      We appreciate the reviewer's suggestion regarding mechanical testing. We would like to direct the reviewer's attention to Figure 5I in our manuscript, where we have already included tensile strength measurements of the tendon construct. These mechanical test results demonstrate the functional capacity of CD55/CD248+ cells to form tendon-like tissue with appropriate mechanical properties, providing quantitative evidence of their ability for tendon assembly.

      (8) Cells negative for CD55/CD248 could be mixed cell populations, including hematopoietic lineages, cells from tendon mid substance, immune cells, and/or endothelial cells. Under induction of tri-lineage media, these mixed cell populations could process different, unpredicted phenotypes (shown by no increased gene expression of tenogenic, chondrogenic, and osteogenic markers after induction). Higher tenogenic gene expressions of TPSCs after induction don't mean that TPSCs are induced into tenocytes if compared to unknown cell populations with/without similar induction. Additionally, PCR data in Figure 5 presented as ΔΔCT, with unclear biological meanings, is challenging to interpret.

      We appreciate the reviewer's suggestion regarding mechanical testing. We would like to direct the reviewer's attention to Figure 5I in our manuscript, where we have already included tensile strength measurements of the tendon construct. These mechanical test results demonstrate the functional capacity of CD55/CD248+ cells to form tendon-like tissue with appropriate mechanical properties, providing quantitative evidence of their ability for tendon assembly.

      Reviewer #2 (Recommendations for the authors): 

      The aim of this study was to identify novel markers for tendon stem cells. The authors used the fact that tendon cells of juvenile tendons have a greater ability to regenerate versus mature tendons. scRNAseq, snRNAseq, and snATACseq datasets were generated and analyzed in juvenile and mature Achilles tendons (mice). 

      The authors generated a lot of data that could be exploited further to show that these two novel surface tendon markers are more tendon-specific than those previously identified. Another concern is that there is no robust data indicative of the endogenous location of CD55+ CD248+ cells in the native tendon. Same comments for the transcription factors regulating the transcription of CD55 and CD248 and that of Scx and Mkx. A validation of the ATACseq data with a location in native tendons would be pertinent.

      The analysis was performed by comparing 2 sub-clusters of the same datasets and not between the two stages. Given the introduction highlighting the differential ability to regenerate between the two stages, the comparison between the two stages was somehow expected. I wonder if there is an explanation for the absence of analysis between the two stages.

      The authors have all the datasets to (bioinformatically) compare scRNAseq and snRNAseq datasets. This comparative analysis would strengthen the clustering of tendon cell populations at both stages. The labeling/identification of clusters associated with tendon cell populations is not obvious. I am surprised that there is no tendon sheath cluster such as endotenon or peritenon. A discussion on the different tendon cell populations (tendon clusters) is lacking.

      (1) Choice of the three markers 

      The authors chose three genes known to be markers for tendon stem cells, Tppp3, PdgfRa, and Ly6a, and investigated clusters (or subclusters) that co-express these three genes. Except for Tppp3, the other two genes lack tendonspecificity. Ly6a is a stem cell marker and is recognized to be a marker of epi/perimysium in fetal and perinatal stages in mouse limbs (PMID: 39636726). Pdgfra is a generic marker of all connective tissue fibroblasts. Could it be that the identification of the two novel surface markers was biased with this choice? The identification of CD55 and CD248 has been done by comparing DEGs between cluster 4 (SP2) and cluster 1 (SP1). What about an unbiased comparison of both clusters 4 and 1 (or individual clusters) between mature and juvenile samples? The reader expected such a comparison since it was introduced as the rationale of the paper to compare juvenile and mature tendon cells.

      We selected Tppp3, PdgfRa, and Ly6a based on established literature identifying them as TSPC markers (Harvey et al., 2019; Tachibana et al., 2022). While only Tppp3 has tendon specificity, these genes collectively represent reliable TSPC markers currently available.

      Our identification of CD55 and CD248 came from comparing SP2 and SP1 clusters that showed these three markers plus tendon development genes. We did compare juvenile and mature samples as shown in Figure 1G, revealing decreased stem/progenitor marker expression with maturation. Additionally, we performed a comprehensive comparison between 2-week and 6-week samples visualized as a heatmap in Supplemental Figure 3, which clearly demonstrates the transcriptional changes that occur during tendon maturation. We have also provided the complete lists of differentially expressed genes for each identified cluster

      (supplementary tables 1-15), allowing for unbiased examination of cluster-specific gene signatures across developmental stages.

      Our functional validation confirmed CD55/CD248 positive cells express Tppp3, PdgfRa, and Ly6a while demonstrating high clonogenicity and tenogenic differentiation capacity, confirming their TSPC identity.

      (2) Concerns with cluster identification 

      The cluster11, named as MTJ cluster, in 2-week scRNAseq datasets was not detected in 6-week scRNAseq datasets (Figure 1A). Does it mean that MTJ disappears at 6 weeks in Achilles tendons? In the snRNAseq MTJ cluster was defined on the basis of Postn expression. «Cluster 11, with high Periostin (Postn) expression, was classified as a myotendinous junction (MTJ).» Line 379.

      What is the basis/reference to set a link between Postn and MTJ? 

      Could the CA clusters be enthesis clusters? Is there any cartilage in the Achilles tendon?

      If there are MTJ clusters, one could expect to see clusters reflecting tendon attachment to cartilage/bone.

      I am surprised to see no cluster reflecting tendon attachments (endotenon or peritenon).

      Cluster 9 was identified as a proliferating cluster in scRNAseq datasets. Does the Cell Cycle Regression step have been performed?

      Thank you for highlighting these important questions about our cluster identification. The MTJ cluster (cluster 11) appears reduced but not absent in 6-week samples. We based our MTJ classification on Postn expression, which is enriched at the myotendinous junction, as documented by Jacobson et al. (2020) in their proteome analysis of myotendinous junctions. We have added this reference to the manuscript to provide clear support for our cluster annotation (lines 400-401).

      Regarding the CA cluster, these cells express chondrogenic markers but are not enthesis clusters. We have revised our manuscript to acknowledge that these could potentially represent enthesis cells, as you suggested (lines 412-414). While Achilles tendons themselves don't contain cartilage, our digestion process likely captured some adjacent cartilaginous tissues from the calcaneus insertion site.

      We acknowledge the absence of clearly defined endotenon/epitenon clusters. We have added more comprehensive explanations about peritenon tissues in our manuscript (lines 431-433 and 584-585), noting that previous studies (Harvey et al., 2019) have reported that Tppp3-positive populations are localized to the peritenon, and our SP clusters might also reflect peritenon-derived cells. This additional context helps clarify the potential tissue origins of our identified cell populations.

      For the proliferating cluster (cluster 9), we confirmed high expression of cell cycle markers (Mki67, Stmn1) but did not perform cell cycle regression to maintain biological relevance of proliferation status in our analysis. We have clarified this methodological decision in the revised Methods section.

      (3) What is the meaning of all these tendon clusters in scRNAseq snRNAseq and snATACseq? The authors described 2 or 3 SP clusters (depending on the scRNAseq or snRNAseq datasets), 2 CT clusters, 1 MTJ cluster, and 1CA cluster. Do genes with enriched expression in these different clusters correspond to different anatomical locations in native tendons? Are there endotenon and peritenon clusters? Is there a correlation between clusters (or subclusters) expressing stem cell markers and peritenon as described for Tppp3

      Thank you for this important question about the biological significance of our identified clusters. The multiple tendon-related clusters we identified likely represent distinct cellular states and differentiation stages rather than strictly discrete anatomical locations. The SP clusters (stem/progenitor cells) express markers consistent with tendon progenitors reported in the literature, including Tppp3, which has been described in the peritenon. As we mentioned in our response to the previous question, we have added more comprehensive explanations about peritenon tissues in our manuscript (Lines 432-433 and 584-585), noting that previous studies (Harvey et al., 2019) have reported that Tppp3-positive populations are localized to the peritenon, and our SP clusters might reflect peritenon-derived cells. Our immunohistochemistry data in Figure 5A further confirms that CD55/CD248 positive cells are localized primarily to the tendon sheath region, similar to the localization pattern of Tppp3 reported by Harvey et al. (2019). The tenocyte clusters (TC) represent mature tendon cells within the fascicles, and their distinct transcriptional profiles suggest heterogeneity even within mature tenocytes. The MTJ cluster specifically expresses genes enriched at the myotendinous junction, while the CA cluster likely represents cells from the enthesis region, as you suggested. In the revised manuscript, we have clarified this interpretation and added additional discussion about the relationship between cluster identity and anatomical localization, particularly regarding the SP clusters and their correlation with peritenon regions.

      (4) The use of single-cell and single-nuclei RNAseq strategies to analyze tendon cell populations in juvenile and mature tendons is powerful, but the authors do not exploit these double analyses. A comparison between scRNAseq and snRNAseq datasets (2 weeks and 6 weeks) is missing. The similar or different features at the level of the clustering or at the level of gene expression should be explained/shown and discussed. This analysis should strengthen the clustering of tendon cell populations at both stages. In the same line, why are there 3 SP clusters in snRNAseq versus 2 SP clusters in scRNAseq? The MTJ cluster R2-5 expressing Sox9 should be discussed.

      Thank you for highlighting this important gap. We have conducted a comprehensive comparison between scRNA-seq and snRNA-seq datasets, revealing substantial correlation between cell populations identified by both methodologies. We've added a detailed dot plot visualization and correlation heatmap in Supplemental Figure 5 that demonstrates the relationships between clusters across datasets. The additional SP cluster in snRNA-seq likely reflects the greater sensitivity of nuclear RNA sequencing in capturing certain cell states that might be missed during whole-cell isolation. Our analysis shows this SP3 cluster represents a transitional state between stem/progenitor cells and differentiating tenocytes. Regarding the Sox9-expressing MTJ cluster R2-5, we have expanded our discussion in the revised manuscript (lines 500502) to address this finding, incorporating relevant references (Nagakura et al., 2020) that describe Sox9 expression at the myotendinous junction. This expression pattern suggests that cells at this specialized interface may maintain developmental plasticity between tendon and cartilage fates, which is consistent with the transitional nature of this anatomical region.

      (5) The claim of "high expression of CD55 and CD248 in the tendon sheath" is not supported by the experiments. The images of immunostaining (Figure 5A) are not very convincing. It is not explained if these are sections of 3Dtendon constructs or native tendons. The expression in 3D-tendon constructs is not informative, since tendon sheaths are not present. The endogenous expression of the transcription factors regulating tendon gene expression would be informative to localize tendon stem cells in native tendons.

      Thank you for this important critique. We agree that the original immunostaining images were not sufficiently convincing. To address this, we have used paraffin sections and optimized our staining protocols to improve image quality. It's worth noting that CD55 and CD248 antibodies required different antigen retrieval conditions to work effectively, which unfortunately prevented us from performing coimmunostaining to directly demonstrate co-localization in the same section. Despite these technical limitations, we have significantly improved the quality of the immunostaining images in Figure 5A with enhanced processing and imaging parameters 

      The improved images more clearly demonstrate the preferential expression of CD55 and CD248 in the tendon sheath/peritenon regions. The consistent localization patterns observed in these separate stainings, together with our FACS and functional analyses of double-positive cells, strongly support their coexpression in the same cell population.

      In the revised manuscript, we have also improved the figure legends to clearly indicate the nature of the tissue samples and updated the methods section to provide more detailed protocols for the immunostaining procedures used.

      Your suggestion regarding transcription factor visualization is valuable. While beyond the scope of our current study, we agree that examining the endogenous expression of regulatory transcription factors like Klf3 and Klf4 would provide additional insights into tendon stem cell localization in native tendons, and we plan to pursue this in future work

      Minor concerns:

      (1) Lines 392-397 « To identify progenitor populations within these clusters, we analyzed expression patterns of previously reported markers Tppp3 and Pdgfra (Harvey et al., 2019; Tachibana, et al., 2022), along with the known stem/progenitor cell marker Ly6a (Holmes et al., 2007; Sung et al., 2008; Hittinger et al., 2013; Sidney et al., 2014; Fang et al., 2022). We identified subclusters within clusters 1 and 4 showing high expression of these genes, which we defined as SP1 and SP2. SP2 exhibited the highest expression of these genes, suggesting it had the strongest progenitor characteristics.» Please cite relevant Figures. Feature and violin plots (scRNAseq) across all cells (not for the only 2 SP1 and SP2 clusters) of Tppp3, Pdgfra and Ly6a are missing.

      Thank you for pointing out this important oversight. We have modified the manuscript to clarify that the text in question describes Figure 1B. Additionally, we have added new feature plots showing the expression of Tppp3, Pdgfra, and Ly6a across all cells in supplymental figure 1B

      (2) The labeling of clusters with numbers in single-cell, single nuclei RNAseq, and ATACseq is difficult to follow.

      We appreciate your feedback on this issue. We recognize that the numerical labeling system across different datasets (scRNA-seq, snRNA-seq, and snATAC-seq) makes it difficult to track the same cell populations. To address this, we have added Supplemental Figure 5, which clearly shows the correspondence between cell populations in single-cell and single-nucleus RNA-seq datasets.

      (3) Figure 1C. It is not clear from the text and Figure legend if the DEGs are for the merged 2 and 6 weeks. If yes, an UMAP of the merged datasets of 2 and 6 weeks would be useful.

      We appreciate your feedback on this issue. We recognize that the numerical labeling system across different datasets (scRNA-seq, snRNA-seq, and snATAC-seq) makes it difficult to track the same cell populations. To address this, we have added Supplemental Figure 5, which clearly shows the correspondence between cell populations in single-cell and single-nucleus RNA-seq datasets.

      (4) Along the Text, there are a few sentences with obscure rationale. Here are a few examples (not exhaustive):

      Abstract 

      “Combining single-nucleus ATAC and RNA sequencing analyses revealed that Cd55 and Cd248 positive fractions in tendon tissue are TSPCs, with this population decreasing at 6 weeks.”

      The rationale of this sentence is not clear. How can single-nucleus ATAC and RNA sequencing analyses identify Cd55 and Cd248 positive fractions as tendon stem cells?

      Thank you for highlighting this unclear statement in our abstract. We agree that the previous wording did not adequately explain how our sequencing analyses identified CD55 and CD248 positive cells as TSPCs. We have revised this sentence to clarify that our multi-modal approach (combining scRNA-seq, snRNA-seq, and snATAC-seq) enabled us to identify Cd55 and Cd248 positive populations as TSPCs based on their co-expression with established TSPC markers such as Tppp3, Pdgfra, and Ly6a. This comprehensive analysis across different sequencing modalities provided strong evidence for their identity as tendon stem/progenitor cells, which we further validated through functional assays. The revised abstract now more clearly communicates the logical progression of our analysis and findings

      Line 80-82 

      “Cd34 is known to be highly expressed in mouse embryonic limb buds at E14.5 compared to E11.5 (Havis et al., 2014), making it a potential marker for TSPCs.”

      The rationale of this sentence is not clear. How can "the fact to be expressed in E14.5 mouse limbs" be an indicator of being a "potential marker of tendon stem cells"?

      Thank you for highlighting this unclear statement in our abstract. We agree that the previous wording did not adequately explain how our sequencing analyses identified CD55 and CD248 positive cells as TSPCs. We have revised this sentence to clarify that our multi-modal approach (combining scRNA-seq, snRNA-seq, and snATAC-seq) enabled us to identify Cd55 and Cd248 positive populations as TSPCs based on their co-expression with established TSPC markers such as Tppp3, Pdgfra, and Ly6a. This comprehensive analysis across different sequencing modalities provided strong evidence for their identity as tendon stem/progenitor cells, which we further validated through functional assays. The revised abstract now more clearly communicates the logical progression of our analysis and findings

      Line 611 

      “Recent reports have highlighted the role of the Klf family in limb development (Kult et al., 2021), suggesting its potential importance in tendon differentiation”

      Why does the "role of Klf family in limb development" suggest an "importance in tendon differentiation"?

      Thank you for highlighting this logical gap in our manuscript. You're right that involvement in limb development doesn't necessarily indicate specific importance in tendon differentiation. We've revised this statement to more accurately reflect current knowledge, noting that while Klf factors are involved in limb development, their specific role in tendon differentiation requires further investigation (lines 658-659). This revised text better aligns with our findings of Klf3 and Klf4 expression in tendon progenitor cells without making unsupported claims about their functional significance

      Reviewer #3 (Recommendations for the authors): 

      In addition to the points highlighted above some additional points are listed below.

      (1) Case in point: the authors claim CD55 and CD248 are found at the tendon sheath (line 541), which is not part of the tendon proper (although the IHC seems to show green in the epi/endotenon).

      Thank you for highlighting this logical gap in our manuscript. You're right that involvement in limb development doesn't necessarily indicate specific importance in tendon differentiation. We've revised this statement to more accurately reflect current knowledge, noting that while Klf factors are involved in limb development, their specific role in tendon differentiation requires further investigation (lines 658-659). This revised text better aligns with our findings of Klf3 and Klf4 expression in tendon progenitor cells without making unsupported claims about their functional significance

      (2) All cell types seem to express collagen based on Figure 1B, so either there is serious background contamination (eg, ambient RNA), or an error in data analysis.

      Thank you for highlighting this logical gap in our manuscript. You're right that involvement in limb development doesn't necessarily indicate specific importance in tendon differentiation. We've revised this statement to more accurately reflect current knowledge, noting that while Klf factors are involved in limb development, their specific role in tendon differentiation requires further investigation (lines 658-659). This revised text better aligns with our findings of Klf3 and Klf4 expression in tendon progenitor cells without making unsupported claims about their functional significance

      Minor problems: 

      (1) The figures are confusingly formatted. It is hard to go between cluster numbers and names. Clusters of similar cell types (eg progenitors) are not grouped to facilitate comparison, as ordering is based on cluster number).

      Thank you for highlighting this logical gap in our manuscript. You're right that involvement in limb development doesn't necessarily indicate specific importance in tendon differentiation. We've revised this statement to more accurately reflect current knowledge, noting that while Klf factors are involved in limb development, their specific role in tendon differentiation requires further investigation (lines 658-659). This revised text better aligns with our findings of Klf3 and Klf4 expression in tendon progenitor cells without making unsupported claims about their functional significance

      (2) The introduction does not distinguish between findings in mice and man. A lot of confusion in the tendon literature probably arises from interspecies differences, which are rarely addressed. 

      We appreciate this important point about species distinctions. We have revised our introduction to clearly identify species-specific findings by adding the term "murine" before TSPC references when discussing mouse studies (lines 64, 66, 70, 75, 100, and 108). We agree that interspecies differences are important considerations in tendon biology research, particularly when translating findings between animal models and humans. Our study focuses specifically on mouse models, and we have been careful not to overgeneralize our conclusions to human tendon biology without appropriate evidence. This clarification helps readers better contextualize our findings within the broader tendon literature landscape.

    1. eLife Assessment

      This study reanalyzed previously published scRNA-seq and TCR-seq data to examine the proportion and characteristics of dual-TCR-expressing Treg cells in mice, presenting some useful insights into TCR diversity and immune regulation. However, the evidence is incomplete, particularly with respect to data interpretation, statistical rigor, and the functionality of dual -TCR Treg cells. The study is potentially of interest to immunologists studying T-cell biology.

    2. Reviewer #2 (Public review):

      Summary:

      The manuscript, by Xu and Peng, et al. investigates whether co-expression of 2 T cell receptor (TCR) clonotypes can be detected in FoxP3+ regulatory CD4+ T cells (Tregs) and if it is associated with identifiable phenotypic effects. This paper presents data reanalyzing publicly available single-cell TCR sequencing and transcriptional analysis, convincingly demonstrating that dual TCR co-expression can be detected in Tregs, both in peripheral circulation as well as among Tregs in tissues. They then compare metrics of TCR diversity between single-TCR and dual TCR Tregs, as well as between Tregs in different anatomic compartments, finding the TCR repertoires to be generally similar though with dual TCR Tregs exhibiting a less diverse repertoire and some moderate differences in clonal expansion in different anatomic compartments. Finally, they examine the transcriptional profile of dual TCR Tregs in these datasets, finding some potential differences in expression of key Treg genes such as Foxp3, CTLA4, Foxo3, Foxo1, CD27, IL2RA, and Ikzf2 associated with dual TCR-expressing Tregs, which the authors postulate implies a potential functional benefit for dual TCR expression in Tregs.

      Strengths:

      This report examines an interesting and potentially biologically significant question, given recent demonstrations that dual TCR co-expression is a much more common phenomenon than previously appreciated (approximately 15-20% of T cells) and that dual TCR co-expression has been associated with significant effects on the thymic development and antigenic reactivity of T cells. This investigation leverages large existing datasets of single-cell TCRseq/RNAseq to address dual TCR expression in Tregs. The identification and characterization of dual TCR Tregs is rigorously demonstrated and presented, providing convincing new evidence of their existence.

      Weaknesses:

      The existence of dual TCR expression by Tregs has previously been demonstrated in mice and humans, limiting the novelty of the reported findings. The presented results should be considered in the context of these prior important findings. The focus on self-citation of their previous work, using the same approach to measure dual TCR expression in other datasets. limits the discussion of other more relevant and impactful published research in this area. Also, Reference #7 continues to list incorrect authors. The authors do not present a balanced or representative description of the available knowledge about either dual TCR expression by T cells or TCR repertoires of Tregs.

      The approach used follows a template used previously by this group for re-analysis of existing datasets generated by other research groups. The descriptions and interpretations of the data as presented are still shallow, lacking innovative or thoughtful approaches that would potentially be innovation or provide new insight.

      This demonstration of dual TCR Tregs is notable, though the authors do not compare the frequency of dual TCR co-expression by Tregs with non-Tregs. This limits interpreting the findings in the context of what is known about dual TCR co-expression in T cells. The response to this criticism in a previous review is considered non-responsive and does not improve the data or findings.

      Comparison of gene expression by single- and dual TCR Tregs is of interest, but as presented is difficult to interpret. The interpretations of the gene expression analyses are somewhat simplistic, focusing on single-gene expression of some genes known to have function in Tregs. However, the investigators continue to miss an opportunity to examine larger patterns of coordinated gene expression associated with developmental pathways and differential function in Tregs (Yang. 2015. Science. 348:589; Li. 2016. Nat Rev Immunol. Wyss. 2016. 16:220; Nat Immunol. 17:1093; Zenmour. 2018. Nat Immunol. 19:291). No attempt to define clusters is made. No comparison is made of the proportions of dual TCR cells in transcriptionally-defined clusters. The broad assessment of key genes by single- and dual TCR cells is conceptually interesting, but likely to be confounded by the heterogeneity of the Treg populations. This would need to be addressed and considered to make any analyses meaningful.

      The study design, re-analysis of existing datasets generated by other scientific groups, precludes confirmation of any findings by orthogonal analyses.

    3. Reviewer #3 (Public review):

      Summary:

      This study addressed the TCR pairing types and CDR3 characteristics of Treg cells. By analyzing scRNA and TCR-seq data, it claims that 10-20% of dual TCR Treg cells exist in mouse lymphoid and non-lymphoid tissues and suggests that dual TCR Treg cells in different tissues may play complex biological functions.

      Strengths:

      The study addresses an interesting question of how dual-TCR-expressing Treg cells play roles in tissues.

      Weaknesses:

      This study is inadequate, particularly regarding data interpretation, statistical rigor, and the discussion of the functional significance of Dual TCR Tregs.

      Comments on revisions:

      Although the authors have provided brief explanations in response to the reviewers' comments, they do not present any additional analyses that would address the fundamental concerns in a convincing manner.<br /> Moreover, the in silico analyses presented in the manuscript alone are insufficient to support the conclusions, and the functional experiments requested by the reviewers have not been conducted.

      In the current rebuttal, while some textual additions have been made to the manuscript, the only substantial revision to the figures appears to be the inclusion of statistical significance annotations (e.g., Fig. 1G, Fig. 3G). These changes do not adequately strengthen the overall data or address the core issues raised.

    4. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      (1) The use of single-cell RNA and TCR sequencing is appropriate for addressing potential relationships between gene expression and dual TCR.

      Thank you for your detailed review and suggestions. The main advantages of scRNA+TCR-seq are as follows: (1) It enables comparative analysis of features such as the ratio of single TCR paired T cells to dual TCR paired T cells at the level of a large number of individual T cells, through mRNA expression of the α and β chains. In the past, this analysis was limited to a small number of T cells, requiring isolation of single T cells, PCR amplification of the α and β chains, and Sanger sequencing; (2) While analyzing TCR paired T cell characteristics, it also allows examination of mRNA expression levels of transcription factors in corresponding T cells through scRNA-seq.

      (2) The data confirm the presence of dual TCR Tregs in various tissues, with proportions ranging from 10.1% to 21.4%, aligning with earlier observations in αβ T cells.

      Thank you very much for your detailed review and suggestions. Early studies on dual TCR αβ T cells have been very limited in number, with reported proportions of dual TCR T cells ranging widely from 0.1% to over 30%. In contrast, scRNA+TCR-seq can monitor over 5,000 single and paired TCRs, including dual paired TCRs, in each sample, enabling more precise examination of the overall proportion of dual TCR αβ T cells. It is important to note that our analysis focuses on T cells paired with functional α and β chains, while T cells with non-functional chain pairings and those with a single functional chain without pairing were excluded from the total cell proportion analysis. Previous studies generally lacked the ability to determine expression levels of specific chains in T cells without dual TCR pairings.

      (3) Tissue-specific patterns of TCR gene usage are reported, which could be of interest to researchers studying T cell adaptation, although these were more rigorously analyzed in the original works.

      Thank you very much for your detailed review and suggestions. T cell subpopulations exhibit tissue specificity; thus, we conducted a thorough investigation into Treg cells from different tissue sites. This study builds upon the original by innovatively analyzing the differences in VDJ rearrangement and CDR3 characteristics of dual TCR Treg cells across various tissues. This provides new insights and directions for the potential existence of “new Treg cell subpopulations” in different tissue locations. The results of this analysis suggest the necessity of conducting functional experiments on dual TCR Treg cells at both the TCR protein level and the level of effector functional molecules.

      (4) Lack of Novelty: The primary findings do not substantially advance our understanding of dual TCR expression, as similar results have been reported previously in other contexts.

      Thank you for your detailed review and suggestions. Early research on dual TCR T cells primarily relied on transgenic mouse models and in vitro experiments, using limited TCR alpha chain or TCR beta chain antibody pairings. Flow cytometry was used to analyze a small number of T cells to estimate dual TCR T cell proportion. No studies have yet analyzed dual TCR Treg cell proportion, V(D)J recombination, and CDR3 characteristics at high throughput in physiological conditions. The scRNA+TCR-seq approach offers an opportunity to conduct extensive studies from an mRNA perspective. With high-throughput advantages of single-cell sequencing technology, researchers can analyze transcriptomic and TCR sequence characteristics of all dual TCR Treg cells within a study sample, providing new ideas and technical means for investigating dual TCR T cell proportions, characteristics, and origins under different physiological and pathological states.

      (5) Incomplete Evidence: The claims about tissue-specific differences lack sufficient controls (e.g., comparison with conventional T cells) and functional validation (e.g., cell surface expression of dual TCRs).

      Thank you for your detailed review and suggestions. This study indeed only analyzed dual TCR Treg cells from different tissue locations based on the original manuscript, without a comparative analysis of other dual TCR T cell subsets corresponding to these tissue locations. The main reason for this is that, in current scRNA+TCR-seq studies of different tissue locations, unless specific T cell subsets are sorted and enriched, the number of T cells obtained from each subset is very low, making a detailed comparative analysis impossible. In the results of the original manuscript, we observed a relatively high proportion of dual TCR Treg cell populations in various tissues, with differences in TCR composition and transcription factor expression. Following the suggestions, we have included additional descriptions in R1, citing the study by Tuovinen et al., which indicates that the proportion of dual TCR Tregs in lymphoid tissues is higher than other T cell types. This will help understand the distribution characteristics of dual TCR Treg cells in different tissues and provide a basis for mRNA expression levels to conduct functional experiments on dual TCR Treg cells in different tissue locations.

      (6) Methodological Weaknesses: The diversity analysis does not account for sample size differences, and the clonal analysis conflates counts and clonotypes, leading to potential misinterpretation.

      We thank you for your review and suggestions. In response to your question about whether the diversity analysis considered the sample size issue, we conducted a detailed review and analysis. This study utilized the inverse Simpson index to evaluate TCR diversity of Treg cells. A preliminary analysis compared the richness and evenness of single TCR Treg cell and dual TCR Treg cell repertoires. The two datasets analyzed were from four mouse samples with consistent processing and sequencing conditions. However, when analyzing single TCR Tregs and dual TCR Tregs from various tissues, differences in detected T cell numbers by sequencing cannot be excluded from the diversity analysis. Following recommendations, we provided additional explanations in R1: CDR3 diversity analysis indicates TCR composition of dual TCR Treg cells exhibits diversity, similar to single TCR Treg cells; however, diversity indices of single TCR Tregs and dual TCR Tregs are not suitable for statistical comparison. Regarding the "clonal analysis" you mentioned, we define clonality based on unique TCR sequences; cells with identical TCR sequences are part of the same clone, with ≥2 counts defined as expansion. For example, in Blood, there are 958 clonal types and 1,228 cells, of which 449 are expansion cells. In R1, we systematically verified and revised clonal expansion cells across all tissue samples according to a unified standard.

      (7) Insufficient Transparency: The sequence analysis pipeline is inadequately described, and the study lacks reproducibility features such as shared code and data.

      Thank you for your review and suggestions. Based on the original manuscript, we have made corresponding detailed additions in R1, providing further elaboration on the analysis process of shared data, screening methods, research codes, and tools. This aims to offer readers a comprehensive understanding of the analytical procedures and results.

      (8) Weak Gene Expression Analysis: No statistical validation is provided for differential gene expression, and the UMAP plots fail to reveal meaningful clustering patterns.

      Thank you very much for your review and suggestions. Based on your recommendations, we conducted an initial differential expression analysis of the top 10 mRNA molecules in single TCR Treg and dual TCR Treg cells using the DESeq2 R package in R1, with statistical significance determined by Padj < 0.05. Regarding the clustering patterns in the UMAP plots, since the analyzed samples consisted of isolated Treg cell subpopulations that highly express immune suppression-related genes, we did not perform a more detailed analysis of subtypes and expression gene differences. This study primarily aims to explore the proportions of single TCR and dual TCR Treg cells from different tissue sources, as well as the characteristics of CDR3 composition, with a focus on showcasing the clustering patterns of samples from different tissue origins and various TCR pairing types.

      (9) A quick online search reveals that the same authors have repeated their approach of reanalysing other scientists' publicly available scRNA-VDJ-seq data in six other publications,In other words, the approach used here seems to be focused on quick re-analyses of publicly available data without further validation and/or exploration.

      Thank you for your review and suggestions. Most current studies utilizing scRNA+TCR-seq overlook analysis of TCR pairing types and related research on single TCR and dual TCR T cell characteristics. Through in-depth analysis of shared scRNA+TCR-seq data from multiple laboratories, we discovered a significant presence of dual TCR T cells in high-throughput T cell research results that cannot be ignored. In this study, we highlight the higher proportion of dual TCR Tregs in different tissue locations, which exhibits a certain degree of tissue specificity, suggesting these cells may participate in complex functional regulation of Tregs. This finding provides new ideas and a foundation for further research into dual TCR Treg functions. However, as reviewers pointed out, findings from scRNA+TCR-seq at the mRNA level require additional functional experiments on dual TCR T cells at the protein level. We have supplemented our discussion in R1 based on these suggestions.

      Reviewer #2 (Public review):

      (1)The existence of dual TCR expression by Tregs has previously been demonstrated in mice and humans (Reference #18 and Tuovinen. 2006. Blood. 108:4063; Schuldt. 2017. J Immunol. 199:33, both omitted from references). The presented results should be considered in the context of these prior important findings.

      Thank you very much for your review and suggestions. Based on the original manuscript, we have supplemented our reading, understanding, and citation of closely related literature (Tuovinen, 2006, Blood, 108:4063 (line 44,line175 in R1); Schuldt, 2017, J Immunol, 199:33 (line 44,line178 in R1)). We once again appreciate the valuable comments from the reviewers, and we will refer to these in our subsequent dual TCR T cell research.

      (2) This demonstration of dual TCR Tregs is notable, though the authors do not compare the frequency of dual TCR co-expression by Tregs with non-Tregs. This limits interpreting the findings in the context of what is known about dual TCR co-expression in T cells.

      Thank you very much for your review and suggestions. This analysis is primarily based on the scRNA+TCR-seq study of sorted Treg cells, where we found the proportions and distinguishing features of dual TCR Treg cells in different tissue sites. Given the diversity and complexity of Treg function, conducting a comparative analysis of the origins of dual TCR Treg cells and non-T cells with dual TCRs will be a meaningful direction. Currently, peripheral induced Treg cells can originate from the conversion of non-Treg cells; however, little is known about the sources and functions of dual TCR Treg cell subsets in both central and peripheral sites. In R1, we have supplemented the discussion regarding the possible origins and potential applications of the "novel dual TCR Treg" subsets.

      (3) Comparison of gene expression by single- and dual TCR Tregs is of interest, but as presented is difficult to interpret. Statistical analyses need to be performed to provide statistical confidence that the observed differences are true.

      Thank you very much for your review and suggestions. Based on your recommendations, we performed an initial differential expression analysis of the top 10 mRNA molecules in single TCR Treg and dual TCR Treg cells using the DESeq2 R package in R1, with a statistical significance threshold of Padj<0.05 for comparisons.

      (4) The interpretations of the gene expression analyses are somewhat simplistic, focusing on the single-gene expression of some genes known to have a function in Tregs. However, the investigators miss an opportunity to examine larger patterns of coordinated gene expression associated with developmental pathways and differential function in Tregs (Yang. 2015. Science. 348:589; Li. 2016. Nat Rev Immunol. Wyss. 2016. 16:220; Nat Immunol. 17:1093; Zenmour. 2018. Nat Immunol. 19:291).

      Thank you for your review and suggestions. This study is based on publicly available scRNA+TCR-seq data from different organ sites generated by the original authors, focusing on sorted and enriched Treg cells within each tissue sample. However, there was no corresponding research on other cell types in each tissue sample, preventing analysis of other cells and factors involved in development and differentiation of single TCR Treg and dual TCR Treg. The literature suggested by the reviewer indicates that development, differentiation, and function of Treg cells have been extensively studied, resulting in significant advances. It also highlights complexity and diversity of Treg origins and functions. This research aims to investigate "novel dual TCR Treg cell subpopulations" that may exhibit tissuespecific differences found in the original authors' studies of Treg cells across different organ sites. This suggests further experimental research into their development, differentiation, origin, and functional gene expression as an important direction, which we have supplemented in the discussion section of R1.

      Reviewer #3 (Public review):

      (1) Definition of Dual TCR and Validity of Doublet Removal:This study analyzes Treg cells with Dual TCR, but it is not clearly stated how the possibility of doublet cells was eliminated. The authors mention using DoubletFinder for detecting doublets in scRNA-seq data, but is this method alone sufficient?We strongly recommend reporting the details of doublet removal and data quality assessment in the Supplementary Data.

      Thank you very much for your review and suggestions. In the analysis of the shared scRNA+TCR-seq data across multiple laboratories, as you mentioned, this study employed the DoubletFinder R package to exclude suspected doublets. Additionally, we used the nCount values of individual cells (i.e., the total sequencing reads or UMI counts for each cell) as auxiliary parameters to further optimize the assessment of cell quality. Generally, due to the possibility that doublet cells may contain gene expression information from two or more cells, their nCount values are often abnormally high. In this study, all cells included in the analysis had nCount values not exceeding 20,000. Among the five tissue sample datasets, we further utilized hashtag oligonucleotide (HTO) labeling (where HTO labeling provides each cell with a unique barcode to differentiate cells from different tissue sources. By analyzing HTO labels, doublets and negative cells can be accurately identified) to eliminate doublets and negative cells.After the removal of chimeric cells, all samples exhibited T cells that possessed two or more TCR clones. This phenomenon validates the reliability of the methodological approach employed in this study and indicates that the analytical results accurately reflect the proportion of dual TCR T cells. Based on the recommendations of the reviewers, we have supplemented and clarified the methods and discussion sections in the manuscript. It is particularly noteworthy that in our analysis, the discussed dual TCR Treg cells and single TCR Treg cells specifically refer to those T cells that possess both functional α and β chains, which are capable of forming TCR. We have excluded from this analysis any Treg cells that possess only a single functional α or β chain and do not form TCR pairs, as well as those Treg cells in which the α or β chains involved in TCR pairing are non-functional.

      (2) In Figure 3D, the proportion of Dual TCR T cells (A1+A2+B1+B2) in the skin is reported to be very high compared to other tissues. However, in Figure 4C, the proportion appears lower than in other tissues, which may be due to contamination by non-Tregs. The authors should clarify why it was necessary to include non-Tregs as a target for analysis in this study. Additionally, the sensitivity of scRNA-seq and TCR-seq may vary between tissues and may also be affected by RNA quality and sequencing depth in skin samples, so the impact of measurement bias should be assessed.

      We deeply appreciate your review and constructive comments. Based on the original manuscript, we have further supplemented and elaborated on the uniqueness and relative proportions of double TCR T cell pairs in skin tissue samples in Section R1. Due to the scarcity of T cells in skin samples, we included some non-Treg cells during single-cell RNA sequencing and TCR sequencing to obtain a sufficient number of cells for effective analysis. The presence of non-regulatory T cells may indeed impact the statistical representation of double TCR T cells as well as the related comparative analyses, as noted by the reviewer. T cells with A1+A2+B1+B2 type double TCR pairings are primarily found within the non-regulatory T cell population in the skin. In response to this point, we have provided a detailed explanation of this analytical result in the revised manuscript R1. Furthermore, concerning the two datasets included in the study, we conducted a comparative analysis in R1, exploring how factors such as sequencing depth at different tissue sites might introduce biases in our findings, which we have thoroughly elaborated upon in the discussion section. We thank you once again for your valuable suggestions.

      (3) Issue of Cell Contamination:In Figure 2A, the data suggest a high overlap between blood, kidney, and liver samples, likely due to contamination. Can the authors effectively remove this effect? If the dataset allows, distinguishing between blood-derived and tissue-resident Tregs would significantly enhance the reliability of the findings. Otherwise, it would be difficult to separate biological signals from contamination noise, making interpretation challenging.

      We thank you for your review and suggestions. We have carefully verified data sources for tissues such as blood, kidneys, and liver. In the study by Oliver T et al., various techniques were employed to differentiate between leukocytes from blood and those from tissues, ensuring accurate identification of leukocytes from tissue samples. First, anti-CD45 antibody was injected intravenously to label cells in the vasculature, verifying that analyzed cells were indeed resident in the tissue. Second, prior to dissection and cell collection, authors performed perfusion on anesthetized mice to reduce contamination of tissue samples by leukocytes from the vasculature. Additionally, during single-cell sequencing, authors utilized HTO technology to avoid overlap between cells from different tissues.

      Analysis of the scRNA+TCR-seq data shared by the original authors revealed highly overlapping TCR sequences in blood, kidney, and liver, despite distinct cell labels associated with each tissue. While these techniques minimize overlap of cells from different sources, they cannot completely rule out the potential impact of this technical issue. As suggested, we have provided additional clarification in R1 of the manuscript regarding this phenomenon of high overlap in the kidney, liver, and blood, indicating that the possibility of Treg migration from blood to kidney and liver cannot be entirely excluded.

      (4) Inconsistency Between CDR3 Overlap and TCR Diversity:The manuscript states that Single TCR Tregs have a higher CDR3 overlap, but this contradicts the reported data that Dual TCR Tregs exhibit lower TCR diversity (higher 1/DS score). Typically, when TCR diversity is low (i.e., specific clones are concentrated), CDR3 overlap is expected to increase. The authors should carefully address this discrepancy and discuss possible explanations.

      Thank you for your review and suggestions. Regarding the potential relationship between CDR3 overlap and TCR diversity, in samples with consistent sequencing depth, lower diversity indeed corresponds to a higher proportion of CDR3 overlap. In our analysis of scRNA+TCR-seq data, we found that single TCR Tregs exhibit both higher diversity and CDR3 overlap, seemingly presenting contradictory analytical results (i.e., dual TCR Tregs show lower TCR diversity and CDR3 overlap). In R1, we supplemented the analysis of possible reasons: the presence of multiple TCR chains in dual TCR Treg cells may lead to a higher uniqueness of CDR3 due to multiple rearrangements and selections, resulting in lower CDR3 overlap; the lower diversity of dual TCR Tregs may be related to the number of T cells sequenced in each sample. The CDR3 diversity analysis in this study merely suggests that the TCR composition of dual TCR Treg cells is diverse, similar to that of single TCR Tregs. However, the diversity indices of single TCR Tregs and dual TCR Tregs are not suitable for statistical comparative analysis. A more in-depth and specific analysis of the diversity and overlap of the VDJ recombination mechanisms and CDR3 composition in dual TCR Tregs during development will be an important technical means to elucidate the function of dual TCR Treg cells.

      (5) Functional Evaluation of Dual TCR Tregs:This study indicates gene expression differences among tissue-resident Dual TCR T cells, but there is no experimental validation of their functional significance. Including functional assays, such as suppression assays or cytokine secretion analysis, would greatly enhance the study's impact.

      We sincerely appreciate your review and suggestions: In this analysis of scRNA+TCR-seq data, we innovatively discovered a higher proportion of dual TCR Treg cells in different tissue sites, which exhibited differences in tissue characteristics. Furthermore, we conducted a comparative analysis of the homogeneity and heterogeneity between single TCR Treg and dual TCR Treg cells. This result provides a foundation for further research on the origin and characteristics of dual TCR Treg cells in different tissue sites, offering new insights for understanding the complexity and functional diversity of Treg cells. Based on your suggestions, we have supplemented R1 with the feasibility of further exploring the functions of tissue-resident dual TCR T cells and the necessity for potential application research.

      (6) Appropriateness of Statistical Analysis:When discussing increases or decreases in gene expression and cell proportions (e.g., Figure 2D), the statistical methods used (e.g., t-test, Wilcoxon, FDR correction) should be explicitly described. They should provide detailed information on the statistical tests applied to each analysis.

      Thank you for your review and suggestions: Based on the original manuscript, we have supplemented the specific statistical methods for the differences in cell proportions and gene expression in R1.

    1. eLife Assessment

      This study provides an important perspective on the influence of parental care in the establishment of the amphibian microbiome. Through a combination of cross-fostering experimental work, comparative analysis, and developmental time series, the authors provide compelling evidence that vertical transmission through care is possible, and solid but somewhat preliminary evidence that it plays a significant role in shaping frog skin microbiomes in nature or across time. This work will be of interest to researchers studying the evolution of parental care and microbiomes in vertebrates.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript describes a series of lab and field experiments to understand the role of tadpole transport in shaping the microbiome of poison frogs in early life. The authors conducted a cross-foster experiment in which R. variabilis tadpoles were carried by adults of their own species, carried by adults of another frog species, or not carried at all. After being carried for 6 hours, tadpole microbiomes resembled those of their caregiving species. Next, the authors reported higher microbiome diversity in tadpoles of two species that engage in transport-based parental care compared to one species that does not. Finally, they collected tadpoles either from the backs of an adult (i.e., they had recently been transported) or from eggs (i.e., not transported) but did not find significant overlap in microbiome composition between transported tadpoles and their parents.

      Strengths:

      The cross-foster experiment and the field experiment that reared transported and non-transported tadpoles are creative ways to address an important question in animal microbiome research. Together, they imply a small role for parental care in the development of the tadpole microbiome. The manuscript is generally well-written and easy to understand. The authors make an effort (improved since the first version of the manuscript) to acknowledge the limitations of their experimental design.

      Weaknesses:

      This manuscript has improved since the initial version and now more clearly discusses the limitations of its experimental design. I have no further revisions to request.

    3. Reviewer #2 (Public review):

      Summary:

      Here, the Fischer et al. attempt to understand the role of parental care, specifically the transport of offspring, in the development of the amphibian microbiome. The amphibian microbiome is an important study system due to its association with host health and disease outcomes. This study provides vertical transfer of bacteria through parental transport of tadpoles as one mechanism, among others, influencing tadpole microbiome composition. This paper gives insight into the relative roles of the environment, species, and parental care in amphibian microbiome assembly.

      The authors determine the time of bacterial colonization during tadpole development using PCR, observing that tadpoles were not colonized by bacteria prior to hatching from the vitelline membrane. This is an important finding for amphibian microbiome research and I would be curious to see if this is seen broadly across amphibian species. By doing this, the impact of transport can be more accurately assessed in their laboratory experiments. The authors found that caregiver species influenced community composition, with transported tadpoles sharing a greater proportion of their skin communities with the transporting species.

      In a comparison of three sympatric amphibian species that vary in their reproductive strategies, the authors found that tadpole community diversity was not reflective of habitat diversity, but may be associated with the different reproductive strategies of each species. Parental care explained some of the variance of tadpole microbiomes between species, however, transportation by conspecific adults did not lead to more similar microbiomes between tadpoles and adults compared to species that do not exhibit parental transport. This finding is in agreement with the understanding that the amphibian microbiome is distinct between developmental stages (eggs/tadpoles/adults) and also that amphibian microbiome composition is generally species specific.

      When investigating contributions of caretakers to transported offspring, the authors found that tadpole-adult pairs with a history of direct contact were not more similar than tadpole-adult pairs lacking that history. This conclusion was surprising when considering the direct contact between the adults and tadpoles, however if only certain taxa from the adults are capable of colonizing tadpoles, then one could expect that similar ASVs might be donated between tadpole-adult pairs.

      I did not find any major weaknesses in my review of this paper. I think that the data and conclusions here are of value to other researchers looking into the assembly of the amphibian microbiome. This paper offers insight into how tadpole-transport could influence the microbiome and adds to our overall understanding of amphibian microbiome assembly across the varied life histories of frogs.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1:

      (1) Developmental time series:

      It was not entirely clear how this experiment relates to the rest of the manuscript, as it does not compare any effects of transport within or across species.

      Implemented Changes:  

      The importance of species arrival timing for community assembly is addressed in both the introduction and discussion. To accommodate the reviewer’s concerns and further emphasize this point, we have added a clarifying sentence to the results section and included an illustrative example with supporting literature in the discussion.

      Results: Clarifying the timing of initial microbial colonization is essential for determining whether and how priority effects mediate community assembly of vertically transmitted microbes in early life, or whether these microbes arrive into an already established microbial landscape. We used non-sterile frogs of our captive laboratory colony (…)

      Discussion: For example, early microbial inoculation has been shown to increase the relative abundance of beneficial taxa such as Janthinobacterium lividum (Jones et al., 2024), whereas efforts to introduce the same probiotic into established adult communities have not led to long-term persistence (Bletz, 2013; Woodhams et al., 2016).  

      (2) Cross-foster experiment:

      The "heterospecific transport" tadpoles were manually brushed onto the back of the surrogate frog, while the "biological transport" tadpoles were picked up naturally by the parent. It is a little challenging to interpret the effect of caregiver species since it is conflated with the method of attachment to the parent. I noticed that the uptake of Os-associated microbes by Os-transported tadpoles seemed to be higher than the uptake of Rv-associated microbes by Rv-associated tadpoles (comparing the second box from the left to the rightmost boxplot in panel S2C). Perhaps this could be a technical artifact if manual attachment to Os frogs was more efficient than natural attachment to Rv frogs.

      I was also surprised to see so much of the tadpole microbiome attributed to Os in tadpoles that were not transported by Os frogs (25-50% in many cases). It suggests that SourceTracker may not be effectively classifying the taxa.

      Implemented Changes:  

      Methods (Study species, reproductive strategies and life history): Oophaga sylvatica (Os) (Funkhouser, 1956; CITES Appendix II, IUCN Conservation status: Near Threatened) is a large, diurnal poison frog (family Dendrobatidae) inhabiting lowland and submontane rainforests in Colombia and Ecuador. While male Os care for the clutch of up to seven eggs, females transport 1-2 tadpoles at a time to water-filled leaf axils where tadpoles complete their development (Pašukonis et al., 2022; Silverstone, 1973; Summers, 1992). Notably, females return regularly to these deposition sites to provision their offspring with unfertilized eggs.

      Discussion: Most poison frogs transport tadpoles on their backs, but the mechanism of adherence remains unclear. Similar to natural conditions, tadpoles that are experimentally placed onto a caregiver’s back also gradually adhere to the dorsal skin, where they remain firmly attached for several hours as the adult navigates dense terrain. Although transport durations were standardized, species-specific factors- such as microbial density at the contact site, microbial taxa identity, and skin physiology such as moisture -could influence microbial transmission between the transporting frog and the tadpole. While these differences may have contributed to varying transmission efficacies observed between the two frog species in our experiment, none of these factors should compromise the correct microbial source assignment. We thus conclude that transporting frogs serve as a source of microbiota for transported tadpoles. However, further studies on species-specific physiological traits and adherence mechanisms are needed to clarify what modulates the efficacy of microbial transmission during transport, both under experimental and natural conditions.  

      Methods (Vertical transmission): Cross-fostering tadpoles onto non-parental frogs has been used previously to study navigation in poison frogs (Pašukonis et al., 2017). According to our experience, successful adherence to both parent and heterospecific frogs depends on the developmental readiness of tadpoles, which must have retracted their gills and be capable of hatching from the vitelline envelope through vigorous movement. Another factor influencing cross-fostering success is the docility of the frog during initial attachment, as erratic movements easily dislodge tadpoles before adherence is established. Rv are small, jumpy frogs that are easily stressed by handling, making experimental fostering of tadpoles—even their own— impractical. Therefore, we favored an experimental design where tadpoles initiate natural transport and parental frogs pick them up with a 100% success rate. We chose the poison frog Os as foster frogs because adults are docile, parental care in this species involves transporting tadpoles, and skin microbial communities differ from Rv- a critical prerequisite for our SourceTracker analysis. The use of the docile Os as the foster species enabled a 100% cross-fostering success rate, with no notable differences in adherence strength after six hours.

      Methods (Sourcetracker Analysis): To assess training quality, we evaluated model selfassignment using source samples. We selected the model trained on a dataset rarefied to the read depth of the adult frog sample with the lowest read count (48162 reads), as it showed the best overall self-assignment performance, whereas models trained on datasets rarefied to the lowest overall read depth performed worse. Unlike studies using technical replicates, our source samples represent distinct biological individuals and sampling timepoints, where natural microbiome variability is expected within each source category. Consequently, we considered self-assignment rates above 70% acceptable. All source samples were correctly assigned to their respective categories (Rv, Os, or control), but with varying proportions of reads assigned as 'Unknown'. Adult frog sources were reliably selfidentified with high confidence (Os: 97.2% median, IQR = 1.4; Rv: 76.3% median, IQR = 38.1). Adult R. variabilis frogs displayed a higher proportion of 'Unknown' assignments compared to O. sylvatica, likely reflecting greater biological variability among individuals and/or a higher proportion of rare taxa not well captured in the training set. The control tadpole source showed lower self-assignment accuracy (median = 30.5%, IQR = 17.1), as expected given the low microbial biomass of these samples, which resulted in low read depth. Low readdepth limits the information available to inform the iterative updating steps in Gibbs sampling and reduces confidence in source assignments. We therefore verified the robustness of our results by performing the second Sourcetracker analysis as described above, training the model only on adult sources and assigning all tadpoles, including lowbiomass controls, as sinks (as described above). Self-assignment rates for the second training set varied (O. sylvatica: 79.2% median, IQR = 29; R. variabilis: 96.6% median, IQR = 3.7), while results remained consistent across analyses, supporting the reliability of our findings.

      (3) Cross-species analysis:

      Like the developmental time series, this analysis doesn't really address the central question of the manuscript. I don't think it is fair for the authors to attribute the difference in diversity to parental care behavior, since the comparison only includes n=2 transporting species and n=1 non-transporting species that differ in many other ways. I would also add that increased diversity is not necessarily an expectation of vertical transmission. The similarity between adults and tadpoles is likely a more relevant outcome for vertical transmission, but the authors did not find any evidence that tadpole-adult similarity was any higher in species with tadpole transport. In fact, tadpoles and adults were more similar in the non-transporting species than in one of the transporting species (lines 296-298), which seems to directly contradict the authors' hypothesis. I don't see this result explained or addressed in the Discussion.

      To address the reviewer’s concerns, we implemented the following changes:  

      Results:

      We rephrased the following sentence from the results part:  

      “These variations may therefore be linked to differing reproductive traits: Af and Rv lay terrestrial egg clutches and transport hatchlings to water, whereas Ll, a non-transporting species, lays eggs directly in water.”

      To read

      “These variations may therefore reflect differences in life history traits among the three species.”

      We moved the information on differing reproductive strategies into the Discussion, where it contributes to a broader context alongside other life history traits that may influence community diversity.

      Discussion (1): We added to our discussion that increased microbial diversity was not an expected outcome of vertical transmission.

      “However, increased microbial diversity is not a known outcome of vertical transmission, and further studies across a broader range of transporting and non-transporting species are needed to assess the role of transport in shaping diversity of tadpole-associated microbial communities.”

      Discussion (2): Likewise, communities associated with adults and tadpoles of transporting species were no more similar than those of non-transporting species. While poison frog tadpoles do acquire caregiver-specific microbes during transport, most of these microbes do not persist on the tadpoles' skin long-term. This pattern can likely be attributed to the capacity of tadpole skin- and gut microbiota to flexibly adapt to environmental changes (Emerson & Woodley, 2024; Santos et al., 2023; Scarberry et al., 2024). It may also reflect the limited compatibility of skin microbiota from terrestrial adults with aquatic habitats or tadpole skin, which differs structurally from that of adults (Faszewski et al., 2008). As a result, many transmitted microbes are probably outcompeted by microbial taxa continuously supplied by the aquatic environment. Interestingly, microbial communities of the non-transporting Ll were more similar to their adult counterparts than those of poison frogs. This pattern might reflect differences in life history among the species. While adult Ll commonly inhabit the rock pools where their tadpoles develop, adults of the two poison frog species visit tadpole nurseries only sporadically for deposition. These differences in habitat use may result in adult Ll hosting skin microbiota that are better adapted to aquatic environments as compared to Rv and Af. Additionally, their presence in the tadpoles’ habitat could make Ll a more consistent source of microbiota for developing tadpoles.

      (4) Field experiment: The rationale and interpretation of the genus-level network are not clear, and the figure is not legible. What does it mean to "visualize the microbial interconnectedness" or to be a "central part of the community"? The previous sentences in this paragraph (lines 337-343) seem to imply that transfer is parent-specific, but the genuslevel network is based on the current adult frogs, not the previous generation of parents that transported them. So it is not clear that the distribution or co-distribution of these taxa provides any insight into vertical transmission dynamics.

      Implemented Changes:  

      We appreciate the reviewer’s close reading and understand how the inclusion of the network visualization without further clarification may have led to confusion. To clarify, the network was constructed from all adult frogs in the population, including—but not limited to—the parental frogs examined in the field experiment. We do not make any claims about the origin of the microbial taxa found on parental frogs. Rather, our aim was to illustrate how genera retained on tadpoles (following potential vertical transmission) contribute to the skin microbial communities of adult frogs of this population beyond just the parental individuals. This finding supports the observation that these retained taxa are generally among the most abundant in adult frogs. However, since this information is already presented in Table S8 and the figure is not essential to the main conclusions, we have removed Supplementary Figure S5 and the accompanying sentence: “A genus-level network constructed from 44 adult frogs shows that the retained genera make up a central part of the community of adult Rv in wild populations (Fig. S5).” We have adjusted the Methods section accordingly.

      Reviewer #2:

      I did not find any major weaknesses in my review of this paper. The work here could potentially benefit from absolute abundance levels for shared ASVs between adults and tadpoles to more thoroughly understand the influences of vertical transmission that might be masked by relative abundance counts. This would only be a minor improvement as I think the conclusions from this work would likely remain the same, however.

      In response to the reviewer’s suggestion, we estimated the absolute abundance of specific ASVs for all samples of tadpoles in which Sourcetracker identified shared ASVs between adults and tadpoles. The resulting scaled absolute abundance values (in copies/μL and copies per tadpole) are provided in Table S10, and a description of the method has been incorporated into the revised Methods section of the manuscript. To support the robustness of this approach in our dataset, we additionally designed an ASV-specific system for ASV24902-Methylocella. Candidate primers were assessed for specificity by performing local BLASTn alignments against the full set of ASV sequences identified in the respective microbial communities of tadpoles. We optimized the annealing temperature via gradient PCR and confirmed primer specificity through Sanger sequencing of the PCR product (Forward: 5′–GAGCACGTAGGCGGATCT–3′ Reverse: 5′–GGACTACNVGGGTWTCTAAT–3′). Using this approach, we confirmed that the relative abundance of ASV24902 (18.05% in the amplicon sequencing data) closely matched its proportion of the absolute 16S rRNA copy number in transported tadpole 6 (18.01%). While we intended to quantify all shared ASVs, we were limited to this single target due to insufficient material for optimizing the assays. As this particular ASV was also detected in the water associated with the same tadpole, we chose not to include this confirmation in the manuscript. Nevertheless, the close match supports the reliability of our approach for scaling absolute abundances in this dataset.

      Results: Absolute abundances of shared ASVs likely originating from the parental source pool (as identified by Sourcetracker) after one month of growth ranged from 7804 to 172326 copies per tadpole (Table S10).

      Methods: Quantitative analysis of 16S rRNA copy numbers with digital PCR (dPCR)

      Absolute abundances were estimated for ASVs that were shared between tadpoles after a one-month growth period and their respective caregivers, and for which Sourcetracker analysis identified the caregiver as a likely source of microbiota. We followed the quantitative sequencing framework described by Barlow et al. (2020), measuring total microbial load via digital PCR (dPCR) with the same universal 16S rRNA primers used to amplify the v4 region in our sequencing dataset. Absolute 16S rRNA copy numbers obtained from dPCR were then multiplied by the relative abundances from our amplicon sequencing dataset to calculate ASV-specific scaled absolute abundances. All dPCR reactions were carried out on a QIAcuity Digital PCR System (Qiagen) using Nanoplates with a 8.5K partition configuration, using the following cycling program: 95°C for 2 minutes, 40 cycles of 95°C for 30 seconds and 52°C for 30 seconds and 72°C for 1 minute, followed by 1 cycle of 40°C for 5 minutes. Reactions were prepared using the QIAcuity EvaGreen PCR Kit (Qiagen, Cat. No. 250111) with 2 µL of DNA template per reaction, following the manufacturer's protocol, and included a negative no-template control and a cleaned and sequenced PCR product as positive control. Samples were measured in triplicates and serial dilutions were performed to ensure accurate quantification. Data were processed with the QIAcuity Software Suite (v3.1.0.0). The threshold was set based on the negative and positive controls in 1D scatterplots. We report mean copy numbers per microliter with standard deviations, correcting for template input, dPCR reaction volume, and dilution factor. Mean copy numbers per tadpole were additionally calculated by accounting for the DNA extraction (elution) volume.  

      Recommendations for the authors:

      Reviewer #1:

      (1) Figure 1b summarizes the ddPCR data as a binary (detected/not detected), but this contradicts the main text associated with this figure, which describes bacteria as present, albeit in low abundances, in unhatched embryos (lines 145-147). Could the authors keep the diagram of tadpole development, which I find very useful, but add the ddPCR data from Figure S1c instead of simply binarizing it as present/absent?

      We appreciate the reviewer’s positive feedback on the clarity of the figure. We agree that presenting the ddPCR data in a more quantitative manner provides a more accurate representation of bacterial abundance across developmental stages. In response, we have retained the developmental diagram, as suggested, and replaced the binary (detected/not detected) information in Figure 1B with rounded mean values for each stage. To complement this, we have included mean values and standard deviations in Table S1. The corresponding text in the main manuscript and legends has been revised accordingly to reflect these changes.  

      (2) More information about the foster species, Oophaga sylvatica, would be helpful. Are they sympatric with Rv? Is their transporting behavior similar to that of Rv?

      We thank the reviewer for this helpful comment. In response, we have added further details on the biology and parental care behavior of Oophaga sylvatica, including information on its distribution range. The species does not overlap with Ranitomeya variabilis at the specific study site where the field work was conducted, although the species are sympatric in other countries. These additions have been incorporated into the Methods section under "Study species, reproductive strategies, and life history."  

      (3) Plotting the proportion of each tadpole microbiome attributed to R. variabilis and the proportion attributed to O. sylvatica on the same plot is confusing, as these points are nonindependent and there is no way for the reader to figure out which points originated from the same tadpole. I would suggest replacing Figure 1D with Figure S2C, which (if I understand correctly) displays the same data, but is separated according to source.

      We agree with the reviewer that Figure S2C allows for clearer interpretation of our results. In response, we implemented the suggested change and replaced Figure 1D with the alternative visualization previously shown in Figure S2C, which displays the same data separated by source. To provide readers with a complementary overview of the full dataset, we have retained the original combined plot in the supplementary material as Figure S2D.

      (4) On the first read, I found the use of "transport" in the cross-fostering experiment confusing until I understood that they weren't being transported "to" anywhere in particular, just carried for 6 hours. A change of phrasing might help readers here.

      We acknowledge the reviewer’s concern and have replaced “transported” with “carried” to avoid confusion for readers who may be unfamiliar with the behavioral terminology. However, because “transport” is the term widely used by specialists to describe this behavior, we now introduce it in the context of the experimental design with the following phrasing:

      “For this design, sequence-based surveys of amplified 16S rRNA genes were used to assess the composition of skin-associated microbial communities on tadpoles and their adult caregivers (i.e., the frogs carrying the tadpoles, typically referred to as ‘transporting’ frogs).”

      (5) "Horizontal transfer" typically refers to bacteria acquired from other hosts, not environmental source pools (line 394).

      We addressed this concern by rephrasing the sentence in the Discussion to avoid potential confusion. The revised text now reads:

      “Across species, newborns might acquire bacteria not only through transfer from environmental source pools and other hosts (…)”  

      (6) The authors suggest that tadpole transport may have evolved in Rv and Af to promote microbial diversity because "increased microbial diversity is linked to better health outcomes" (lines 477-479). It is often tempting to assume that more diversity is always better/more adaptive, but this is not universally true. The fact that the Ll frogs seem to be doing fine in the same environment despite their lower microbiome diversity suggests that this interpretation might be too far of a reach based on the data here.

      We appreciate the reviewer’s concern, agree that increased microbial diversity is not inherently advantageous and have revised the paragraph to make this clearer.  

      “While increased microbial diversity is not inherently advantageous, it has been associated with beneficial outcomes such as improved immune function, lower disease risk, and enhanced fitness in multiple other vertebrate systems.”

      However, rather than claiming that greater diversity is always advantageous, we suggest that this possibility should not be excluded and consider it a relevant aspect of a comprehensive discussion. We also note that whether poison frog tadpoles perform equally well with lower microbial diversity remains an open question. Drawing such conclusions would require experimental validation and cannot be inferred from comparisons with an evolutionarily distant species that differs in life history.

      Reviewer #2:

      (1) Figure 2: Are the data points in C a subset (just the tadpoles for each species) of B? The numbers look a little different between them. The number of observed ASVs in panel B for Rv look a bit higher than the observed ASVs in panel C.

      The data shown in panel C are indeed a subset of the samples presented in panel B, focusing specifically on tadpoles of each species. The slight differences in the number of observed ASVs between panels result from differences in rarefaction depth between comparisons: due to variation in sequencing depth across species and life stages, we performed rarefaction separately for each comparison in order to retain the highest number of taxa while ensuring comparability within each group. Although we acknowledge that this is not a standard approach, we found that results were consistent when rarefying across the full dataset, but chose the presented approach to better accommodate variation in our sample structure. This methodological detail is described in the Methods section:

      “All alpha diversity analyses were conducted with datasets rarefied to 90% of the read number of the sample with the fewest reads in each comparison and visualized with boxplots.”

      It is also noted in the figure legend: “The dataset was separately rarefied to the lowest read depth f each comparison.” We hope this clarification adequately addresses the reviewer’s concern and therefore have not made additional changes.

      (2) Lines 304-305: in the Figure 4B plot, there appear to be 12 transported tadpoles and 8 non-transported tadpoles.

      Thank you for catching this. We have corrected the plot and the associated statistics (alpha and beta diversity) in the results section as well as in the figure. Importantly, the correction did not affect any other results, and the overall findings and interpretations remain unchanged.  

      (3) Line 311: I think this should be Figure 4B.

      (4) Line 430: tadpole transport.

      (5) Line 431: I believe commas need to surround this phrase "which range from a few hours to several days depending on the species (Lötters et al., 2007; McDiarmid & Altig, 1999; Pašukonis et al., 2019)".

      We thank the reviewer for the thorough review and have corrected all typographical and formatting errors noted in comments (3) – (5).

    1. eLife Assessment

      This study demonstrates the application of END-seq, originally developed to study genomewide DNA double-strand breaks, to telomere biology; the work packs a punch, concisely demonstrating the utility of this approach and the new insights that can be gained. The authors confirm that telomeres in telomerase-positive cells terminate with 5'-ATC in a Pot1-dependent manner, and demonstrate that this principle holds true in telomerase-negative ALT cells as well. S1-END-seq is similarly developed for telomeres, showing that ALT cells harbor several regions of ssDNA. The study is well-executed and convincing, the new insights are fundamental and compelling, and the optimized END-seq approaches will be widely utilized. The work will prompt additional studies that the reviewers look forward to, including combining telomeric END-seq with long-read sequencing to address the distribution and origin of variant telomere repeats and ssDNA along telomeres in ALT and telomerase-positive settings.

    2. Reviewer #1 (Public review):

      Summary

      This manuscript from Azeroglu et al. presents the application of END-Seq to examine the sequence composition of chromosome termini, i.e., telomeres. END-seq is a powerful genome sequencing strategy developed in Andre Nussesweig's lab to examine the sequences at DNA break sites. Here, END-Seq is applied to explore the nucleotide sequences at telomeres and to ascertain (i) whether the terminal end sequence is conserved in cells that activate ALT telomere elongation mechanism and (ii) whether the processes responsible for telomere end sequence regulation are conserved. With these aims clearly articulated, the authors convincingly show the power of this technique to examine telomere end-processing.

      Strengths

      (1) The authors effectively demonstrate the application of END-seq for these purposes. They verify prior data that 5'terminal sequences of telomeres in Hela and RPE cells end in a canonical ATC sequence motif. They verify that the same sequence is present at the 5' ends of telomeres by performing END-seq across a panel of ALT cancer cells. As in non-ALT cells, the established role of POT1, a ssDNA telomere binding protein, in coordinating the mechanism that maintains the canonical ATC motif is likewise verified. However, by performing END-Seq in mouse cells lacking POT1 isoforms, POT1a and POT1b, the authors uncover that POT1b is dispensable for this process. This reveals a novel, important insight relating to the evolution of POT1 as a telomere regulatory factor.

      (2) The authors then demonstrate the utility of S1-END-seq, a variation of END-Seq, to explore the purported abundance of single-stranded DNA at telomeres within telomeres of ALT cancer cells. Here, they demonstrate that ssDNA abundance is an intrinsic aspect of ALT telomeres and is dependent on the activity of BLM, a crucial mediator of ALT.

      Overall, the authors have effectively shown that END-seq can be applied to examine processes maintaining telomeres in normal and cancerous cells across multiple species. Using END-Seq, the authors confirm prior cell biological and sequencing data and the role of POT1 and BLM in regulating telomere termini sequences and ssDNA abundance. The study is nice and well-written, with the experimental rationale and outcomes clearly explained.

      Weaknesses

      This reviewer finds little to argue with in this study. It is timely and highly valuable for the telomere field. One minor question would be whether the authors could expand more on the application of END-Seq to examine the processive steps of the ALT mechanism? Can they speculate if the ssDNA detected in ALT cells might be an intermediate generated during BIR (i.e., is the ssDNA displaced strand during BIR) or a lesion? Furthermore, have the authors assessed whether ssDNA lesions are due to the loss of ATRX or DAXX, either of which can be mutated in the ALT setting?

      Comments on revisions:

      The authors addressed the comments. Thank you.

    3. Reviewer #2 (Public review):

      This is a short yet very clear manuscript demonstrating that two methods (END-seq and S1-END-seq), previously developed in the Nussenzweig laboratory to study DSBs in the genome, can also be applied to the 5' ends of mammalian telomeres and the accumulation of telomeric single-stranded DNA.

      The authors first validate the applicability of END-seq using different approaches and confirm that mammalian telomeres preferentially end with an ATC 5' end through a mechanism that requires intact POT1 (POT1a in mice). They then extend their analysis to cells that maintain telomeres through the ALT mechanism and demonstrate that, in these cells as well, telomeres frequently end in an ATC 5' sequence via a POT1-dependent mechanism. Using S1-END-seq, the authors further show that ALT telomeres contain single-stranded DNA and estimate that each telomere in ALT cells harbors at least five regions of ssDNA.

      I find this work very interesting and incisive. It clearly demonstrates that END-seq can be applied with unprecedented depth and precision to the study of telomeric features such as the 5' end and ssDNA. The data are very clear and thoroughly interpreted, and the manuscript is well written. The results are carefully analyzed and effectively presented. Overall, I find this manuscript worthy of publication, as the optimized END-seq methods described here will likely be widely utilized in the telomere field.

      Also, the authors have satisfactorily addressed my previous comments.

    4. Reviewer #3 (Public review):

      Summary:

      A subset of cancer cells attain replicative immortality by activating the ALT mechanism of telomere maintenance, which is currently the subject of intense research due to its potential for novel targeted therapies. Key questions remain in the field, such as whether ALT telomeres adhere to the same end-protection rules as telomeres in telomerase-expressing cells, or if ALT telomeres possess unique properties that could be targeted with new, less toxic cancer therapies. Both questions, along with the approaches developed by the authors to address them, are highly relevant.

      Strengths:

      Since chromosome ends resemble one-ended DSBs, the authors hypothesized that the previously described END-SEQ protocol could be used to accurately sequence the 5' end of telomeres on the C-rich strand. As expected, most reads corresponded to the C-rich strand and, confirming previous observation by the de Lange's group, most chromosomes end with the ATC-5' sequence, a feature that was found to be dependent on POT1 and to be conserved in both human ALT cells and mouse cells. Through a complementary method, S1-END-SEQ, the authors further explored ssDNA regions at telomeres, providing new insights into the characteristics of ALT telomeres. The study is original, the experiments were well-controlled and excellently executed.

      Weaknesses:

      A few additional experiments would have strengthened the results such as combining error-free long-read sequencing with END-SEQ to compare the abundance of VTRs within telomeres versus at their distal ends.<br /> Along this line, are VTRs increased at ssDNA regions of ALT telomeres? What is the frequency of VTRs in the END-SEQ analysis of TRF1-FokI-expressing ALT cells? Is it also increased? Has TRF1-FokI been applied to telomerase-expressing cells to compare VTR frequencies at internal sites between ALT and telomerase-expressing cells?<br /> To what extent do ECTRs contribute to telomeric ssDNA?<br /> Future experiments may help shed light on this

    5. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors): 

      One minor question would be whether the authors could expand more on the application of END-Seq to examine the processive steps of the ALT mechanism? Can they speculate if the ssDNA detected in ALT cells might be an intermediate generated during BIR (i.e., is the ssDNA displaced strand during BIR) or a lesion? Furthermore, have the authors assessed whether ssDNA lesions are due to the loss of ATRX or DAXX, either of which can be mutated in the ALT setting?

      We appreciate the reviewer’s insightful questions regarding the application of our assays to investigate the nature of the ssDNA detected in ALT telomeres. Our primary aim in this study was to establish the utility of END-seq and S1-END-seq in telomere biology and to demonstrate their applicability across both ALT-positive and -negative contexts. We agree that exploring the mechanistic origins of ssDNA would be highly informative, and we anticipate that END-seq–based approaches will be well suited for such future studies. However, it remains unclear whether the resolution of S1-END-seq is sufficient to capture transient intermediates such as those generated during BIR. We have now included a brief speculative statement in the revised discussion addressing the potential nature of ssDNA at telomeres in ALT cells.

      Reviewer #2 (Recommendations for the authors):

      How can we be sure that all telomeres are equally represented? The authors seem to assume that END-seq captures all chromosome ends equally, but can we be certain of this? While I do not see an obvious way to resolve this experimentally, I recommend discussing this potential bias more extensively in the manuscript.

      We thank the reviewer for raising this important point. END-seq and S1-END-seq are unbiased methods designed to capture either double-stranded or single-stranded DNA that can be converted into blunt-ended double-stranded DNA and ligated to a capture oligo. As such, if a subset of telomeres cannot be processed using this approach, it is possible that these telomeres may be underrepresented or lost. However, to our knowledge, there are no proposed telomeric structures that would prevent capture using this method. For example, even if a subset of telomeres possesses a 5′ overhang, it would still be captured by END-seq. Indeed, we observed the consistent presence of the 5′-ATC motif across multiple cell lines and species (human, mouse, and dog). More importantly, we detected predictable and significant changes in sequence composition when telomere ends were experimentally altered, either in vivo (via POT1 depletion) or in vitro (via T7 exonuclease treatment). Together, these findings support the robustness of the method in capturing a representative and dynamic view of telomeres across different systems.

      That said, we have now included a brief statement in the revised discussion acknowledging that we cannot fully exclude the possibility that a subset of telomeres may be missed due to unusual or uncharacterized structures

      I believe Figures 1 and 2 should be merged.

      We appreciate the reviewer’s suggestion to merge Figures 1 and 2. However, we feel that keeping them as separate figures better preserves the logical flow of the manuscript and allows the validation of END-seq and its application to be presented with appropriate clarity and focus. We hope the reviewer agrees that this layout enhances the clarity and interpretability of the data.

      Scale bars should be added to all microscopy figures.

      We thank the reviewer for pointing this out. We have now added scale bars to all the microscopy panels in the figures and included the scale details in the figure legends.

      Reviewer #3 (Recommendations for the authors):

      Overall, the discussion section is lacking depth and should be expanded and a few additional experiments should be performed to clarify the results.

      We thank the reviewer for the suggestions. Based on this reviewer’s comments and comments for the other reviewers, we incorporated several points into the discussion. As a result, we hope that we provide additional depth to our conclusions.

      (1) The finding that the abundance of variant telomeric repeats (VTRs) within the final 30 nucleotides of the telomeric 5' ends is similar in both telomerase-expressing and ALT cells is intriguing, but the authors do not address this result. Could the authors provide more insight into this observation and suggest potential explanations? As the frequency of VTRs does not seem to be upregulated in POT1-depleted cells, what then drives the appearance of VTRs on the C-strand at the very end of telomeres? Is CST-Pola complex responsible?

      The reviewer raises a very interesting and relevant point. We are hesitant at this point to speculate on why we do not see a difference in variant repeats in ALT versus non-ALT cells, since additional data would be needed. One possibility is that variant repeats in ALT cells accumulate stochastically within telomeres but are selected against when they are present at the terminal portion of chromosome ends. However, to prove this hypothesis, we would need error-free long-read technology combined with END-seq. We feel that developing this approach would be beyond the scope of this manuscript.

      (2) The authors also note that, in ALT cells, the frequency of VTRs in the first 30 nucleotides of the S1-END-SEQ reads is higher compared to END-SEQ, but this finding is not discussed either. Do the authors think that the presence of ssDNA regions is associated with the VTRs? Along this line, what is the frequency of VTRs in the END-SEQ analysis of TRF1-FokI-expressing ALT cells? Is it also increased? Has TRF1-FokI been applied to telomerase-expressing cells to compare VTR frequencies at internal sites between ALT and telomerase-expressing cells?

      Similarly to what is discussed above, short reads have the advantage of being very accurate but do not provide sufficient length to establish the relative frequency of VTRs across the whole telomere sequence. The TRF1-FokI experiment is a good suggestion, but it would still be biased toward non-variant repeats due to the TRF1-binding properties. We plan to address these questions in a future study involving long-read sequencing and END-seq capture of telomeres.

      Finally, in these experiments (S1-END-SEQ or END-SEQ in TRF1-Fok1), is the frequency of VTRs the same on both the C- and the G-rich strands? It is possible that the sequences are not fully complementary in regions where G4 structures form.

      We thank the reviewer for this observation. While we do observe a higher frequency of variant telomeric repeats (VTRs) in the first 30 nucleotides of S1-END-seq reads compared to END-seq in ALT cells, we are currently unable to determine whether this difference is significant, as an appropriate control or matched normalization strategy for this comparison is lacking. Therefore, we refrain from overinterpreting the biological relevance of this observation.

      The reviewer is absolutely correct. Our calculation did not exclude the possibility of extrachromosomal DNA as a source of telomeric ssDNA. We have now addressed this point in our discussion.

      The reviewer is correct in pointing out that we still do not know what causes ssDNA at telomeres in ALT cells. Replication stress seems the most logical explanation based on the work of many labs in the field. However, our data did not reveal any significant difference in the levels of ssDNA at telomeres in non-ALT cells based on telomere length. We used the HeLa1.2.11 cell line (now clarified in the Materials section), which is the parental line of HeLa1.3 and has similarly long telomeres (~20 kb vs. ~23 kb). Despite their long telomeres and potential for replication-associated challenges such as G-quadruplex formation, HeLa1.2.11 cells did not exhibit the elevated levels of telomeric ssDNA that we observed in ALT cells (Figure 4B). Additional experiments are needed to map the occurrence of ssDNA at telomeres in relation to progression toward ALT.

      (3) Based on the ratio of C-rich to G-rich reads in the S1-END-SEQ experiment, the authors estimate that ALT cells contain at least 3-5 ssDNA regions per chromosome end. While the calculation is understandable, this number could be discussed further to consider the possibility that the observed ratios (of roughly 0.5) might result from the presence of extrachromosomal DNA species, such as C-circles. The observed increase in the ratio of C-rich to G-rich reads in BLM-depleted cells supports this hypothesis, as BLM depletion suppresses C-circle formation in U2OS cells. To test this, the authors should examine the impact of POLD3 depletion on the C-rich/G-rich read ratio. Alternatively, they could separate high-molecular-weight (HMW) DNA from low-molecular-weight DNA in ALT cells and repeat the S1-END-SEQ in the HMW fraction.

      The reviewer is absolutely correct. Our calculation did not exclude the possibility of extrachromosomal DNA as a source of telomeric ssDNA. We have now addressed this point in our discussion.

      (4) What is the authors' perspective on the presence of ssDNA at ALT telomeres? Do they attribute this to replication stress? It would be helpful for the authors to repeat the S1-END-SEQ in telomerase-expressing cells with very long telomeres, such as HeLa1.3 cells, to determine if ssDNA is a specific feature of ALT cells or a result of replication stress. The increased abundance of G4 structures at telomeres in HeLa1.3 cells (as shown in J. Wong's lab) may indicate that replication stress is a factor. Similar to Wong's work, it would be valuable to compare the C-rich/G-rich read ratios in HeLa1.3 cells to those in ALT cells with similar telomeric DNA content.

      The reviewer is correct in pointing out that we still do not know what causes ssDNA at telomeres in ALT cells. Replication stress seems the most logical explanation based on the work of many labs in the field. However, our data did not reveal any significant difference in the levels of ssDNA at telomeres in non-ALT cells based on telomere length. We used the HeLa1.2.11 cell line (now clarified in the Materials section), which is the parental line of HeLa1.3 and has similarly long telomeres (~20 kb vs. ~23 kb). Despite their long telomeres and potential for replication-associated challenges such as G-quadruplex formation, HeLa1.2.11 cells did not exhibit the elevated levels of telomeric ssDNA that we observed in ALT cells (Figure 4B). Additional experiments are needed to map the occurrence of ssDNA at telomeres in relation to progression toward ALT.

      Finally, Reviewer #3 raises a list of minor points:

      (1) The Y-axes of Figure 4 have been relabeled to account for the G-strand reads.

      (2) Statistical analyses have been added to the figures where applicable.

      (3) The manuscript has been carefully proofread to improve clarity and consistency throughout the text and figure legends

      (4) We have revised the text to address issues related to the lack of cross-referencing between the supplementary figures and their corresponding legends.

    1. eLife Assessment

      This important study addresses the role of non-genetic factors in individual differences in phenotype. Using C. elegans, the study finds that non-genetic differences in gene expression, partly influenced by the environment, correlate with individual differences in two reproductive traits. This supports the use of gene expression data as a key intermediate for understanding complex traits. The clever study design makes for compelling evidence.

    2. Reviewer #1 (Public review):

      Summary:

      Genome-wide association studies have been an important approach to identifying the genetic basis of human traits and diseases. Despite their successes, for many traits, a substantial amount of variation cannot be explained by genetic factors, indicating that environmental variation and individual 'noise' (stochastic differences as well as unaccounted for environmental variation) also play important roles. The authors' goal was to address how gene expression variation in genetically identical individuals, driven by historical environmental differences and 'noise', could be used to predict reproductive trait differences.

      Strengths:

      To address this question, the authors took advantage of genetically identical C. elegans individuals to transcriptionally profile 180 adult hermaphrodite individuals that were also measured for two reproductive traits. A major strength of the paper is in its experimental design. While experimenters aim to control the environment that each worm experiences, it is known that there are small differences even when worms are grown together on the same agar plate - e.g., the age of their mother, their temperature, the amount of food they eat, and the oxygen and carbon dioxide levels depending on where they roam on the plate. Instead of neglecting this unknown variation, the authors design the experiment up front to create two differences in the historical environment experienced by each worm: 1) the age of its mother and 2) 8 8-hour temperature difference, either 20 or 25 C. This helped the authors interpret the gene expression differences and trait expression differences that they observed.

      Using two statistical models, the authors measured the association of gene expression for 8824 genes with the two reproductive traits, considering both the level of expression and the historical environment experienced by each worm. Their data supports several conclusions. They convincingly show that gene expression differences are useful for predicting reproductive trait differences, predicting ~25-50% of the trait differences depending on the trait. Using RNAi, they also show that the genes they identify play a causal role in trait differences. Finally, they demonstrate an association with trait variation and the H3K27 trimethylation mark, suggesting that chromatin structure can be an important causal determinant of gene expression and trait variation.

      Overall, this work supports the use of gene expression data as an important intermediate for understanding complex traits. This approach is also useful as a starting point for other labs in studying their trait of interest.

      Weaknesses:

      There are no major weaknesses that I have noted. Some important limitations of their work are worth highlighting, though (and I believe the authors would agree with these points):

      (1) A large remaining question in the field of complex traits remains in splitting the role of non-genetic factors between environmental variation and stochastic noise. It is still an open question which role each of these factors plays in controlling the gene expression differences they measured between the individual worms.

      (2) The ability of the authors to use gene expression to predict trait variation was strikingly different between the two traits they measured. For the early brood trait, 448 genes were statistically linked to the trait difference, while for egg-laying onset, only 11 genes were found. Similarly, the total R2 in the test set was ~50% vs. 25%. It is unclear why the differences occur, but this somewhat limits the generalizability of this approach to other traits.

      (3) For technical reasons, this approach was limited to whole worm transcription. The role of tissue and cell-type expression differences is important to the field, so this limitation is relevant.

      Comments on revisions: The authors have addressed my previous comments to my satisfaction.

    3. Reviewer #2 (Public review):

      This paper measures associations between RNA transcript levels and important reproductive traits in the model organism C. elegans. The authors go beyond determining which gene expression differences underlie reproductive traits, but also (1) build a model that predicts these traits based on gene expression and (2) perform experiments to confirm that some transcript levels indeed affect reproductive traits. The clever study design allows the authors to determine which transcript levels impact reproductive traits, and also which transcriptional differences are driven by stochastic vs environmental differences. In sum, this is a comprehensive study that highlights the power of gene expression as a driver of phenotype, and also teases apart the various factors that affect the expression levels of important genes.

      Overall, this study has many strengths, is very clearly communicated, and has no substantial weaknesses that I can point to.

      One question that emerges for me is whether these findings apply broadly. In other words, I wonder whether gene expression levels are predictive of other phenotypes in other organisms. I think this question has largely been explored in microbes, where some studies (PMID: 17959824) but not others (PMID: 38895328) found that differences in gene expression were predictive of phenotypes like growth rate. Microbes are not the focus here, and instead, the discussion is mainly focused on using gene expression to predict health and disease phenotypes in humans. This feels a little complicated since humans have so many different tissues. Perhaps an area where this approach might be useful is in examining infectious single-cell populations (bacteria, tumors, fungi). But I suppose this idea might still work in humans, assuming the authors are thinking about targeting specific tissues for RNAseq.

      In sum, this is a great paper that really got me thinking about the predictive power of gene expression and where/when it could inform about (health-related) phenotypes.

      Comments on revisions: No additional comments

    4. Reviewer #3 (Public review):

      Summary:

      Webster et al. sought to understand if phenotypic variation in the absence of genetic variation can be predicted by variation in gene expression. To this end they quantified two reproductive traits, the onset of egg laying and early brood size in cohorts of genetically identical nematodes exposed to alternative ancestral (two maternal ages) and same generation life histories (either constant 20 ºC temperature or 8-hour temperature shift to 25 ºC upon hatching) in a two-factor design; then, they profiled genome-wide gene expression in each individual.

      Using multiple statistical and machine learning approaches, they showed that, at least for early brood size, phenotypic variation can be quite well predicted by molecular variation, beyond what can be predicted by life history alone.<br /> Moreover, they provide some evidence that expression variation in some genes might be causally linked to phenotypic variation.

      Strengths:

      Cleverly designed and carefully performed experiments that provide high-quality datasets useful for the community.

      Good evidence that phenotypic variation can be predicted by molecular variation.

      Weaknesses:

      What drives the molecular variation that impacts phenotypic variation remains unknown. While the authors show that variation in expression of some genes might indeed be causal, it is still not clear how much of the molecular variation is a cause rather than a consequence of phenotypic variation.

      Comments on revisions: I have no more comments for the authors

    5. Author Response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      Summary: 

      Genome-wide association studies have been an important approach to identifying the genetic basis of human traits and diseases. Despite their successes, for many traits, a substantial amount of variation cannot be explained by genetic factors, indicating that environmental variation and individual 'noise' (stochastic differences as well as unaccounted for environmental variation) also play important roles. The authors' goal was to address whether gene expression variation in genetically identical individuals, driven by historical environmental differences and 'noise', could be used to predict reproductive trait differences. 

      Strengths: 

      To address this question, the authors took advantage of genetically identical C. elegans individuals to transcriptionally profile 180 adult hermaphrodite individuals that were also measured for two reproductive traits. A major strength of the paper is its experimental design. While experimenters aim to control the environment that each worm experiences, it is known that there are small differences that each worm experiences even when they are grown together on the same agar plate - e.g. the age of their mother, their temperature, the amount of food they eat, and the oxygen and carbon dioxide levels depending on where they roam on the plate. Instead of neglecting this unknown variation, the authors design the experiment up front to create two differences in the historical environment experienced by each worm: 1) the age of its mother and 2) 8 8-hour temperature difference, either 20 or 25 {degree sign}C. This helped the authors interpret the gene expression differences and trait expression differences that they observed. 

      Using two statistical models, the authors measured the association of gene expression for 8824 genes with the two reproductive traits, considering both the level of expression and the historical environment experienced by each worm. Their data supports several conclusions. They convincingly show that gene expression differences are useful for predicting reproductive trait differences, predicting ~25-50% of the trait differences depending on the trait. Using RNAi, they also show that the genes they identify play a causal role in trait differences. Finally, they demonstrate an association with trait variation and the H3K27 trimethylation mark, suggesting that chromatin structure can be an important causal determinant of gene expression and trait variation. 

      Overall, this work supports the use of gene expression data as an important intermediate for understanding complex traits. This approach is also useful as a starting point for other labs in studying their trait of interest. 

      We thank the reviewer for their thorough articulation of the strengths of our study.  

      Weaknesses: 

      There are no major weaknesses that I have noted. Some important limitations of the work (that I believe the authors would agree with) are worth highlighting, however: 

      (1) A large remaining question in the field of complex traits remains in splitting the role of non-genetic factors between environmental variation and stochastic noise. It is still an open question which role each of these factors plays in controlling the gene expression differences they measured between the individual worms. 

      Yes, we agree that this is a major question in the field. In our study, we parse out differences driven between known historical environmental factors and unknown factors, but the ‘unknown factors’ could encompass both unknown environmental factors and stochastic noise.

      (2) The ability of the authors to use gene expression to predict trait variation was strikingly different between the two traits they measured. For the early brood trait, 448 genes were statistically linked to the trait difference, while for egg-laying onset, only 11 genes were found. Similarly, the total R2 in the test set was ~50% vs. 25%. It is unclear why the differences occur, but this somewhat limits the generalizability of this approach to other traits. 

      We agree that the difference in predictability between the two traits is interesting. A previous study from the Phillips lab measured developmental rate and fertility across Caenorhabditis species and parsed sources of variation (1). Results indicated that 83.3% of variation in developmental rate was explained by genetic variation, while only 4.8% was explained by individual variation. In contrast, for fertility, 63.3% of variation was driven by genetic variation and 23.3% was explained by individual variation. Our results, of course, focus only on predicting the individual differences, but not genetic differences, for these two traits using gene expression data. Considering both sets of results, one hypothesis is that we have more power to explain nongenetic phenotypic differences with molecular data if the trait is less heritable, which is something that could be formally interrogated with more traits across more strains.

      (3) For technical reasons, this approach was limited to whole worm transcription. The role of tissue and celltype expression differences is important to the field, so this limitation is important. 

      We agree with this assessment, and it is something we hope to address with future work.

      Reviewer #2 (Public review): 

      Summary: 

      This paper measures associations between RNA transcript levels and important reproductive traits in the model organism C. elegans. The authors go beyond determining which gene expression differences underlie reproductive traits, but also (1) build a model that predicts these traits based on gene expression and (2) perform experiments to confirm that some transcript levels indeed affect reproductive traits. The clever study design allows the authors to determine which transcript levels impact reproductive traits, and also which transcriptional differences are driven by stochastic vs environmental differences. In sum, this is a rather comprehensive study that highlights the power of gene expression as a driver of phenotype, and also teases apart the various factors that affect the expression levels of important genes. 

      Strengths: 

      Overall, this study has many strengths, is very clearly communicated, and has no substantial weaknesses that I can point to. One question that emerges for me is about the extent to which these findings apply broadly. In other words, I wonder whether gene expression levels are predictive of other phenotypes in other organisms. I

      think this question has largely been explored in microbes, where some studies (PMID: 17959824) but not others (PMID: 38895328) find that differences in gene expression are predictive of phenotypes like growth rate. Microbes are not the primary focus here, and instead, the discussion is mainly focused on using gene expression to predict health and disease phenotypes in humans. This feels a little complicated since humans have so many different tissues. Perhaps an area where this approach might be useful is in examining infectious single-cell populations (bacteria, tumors, fungi). But I suppose this idea might still work in humans, assuming the authors are thinking about targeting specific tissues for RNAseq. 

      In sum, this is a great paper that really got me thinking about the predictive power of gene expression and where/when it could inform about (health-related) phenotypes. 

      We thank the reviewer for recognizing the strengths of our study. We are also interested in determining the extent to which predictive gene expression differences operate in specific tissues.

      Reviewer #3 (Public review): 

      Summary: 

      Webster et al. sought to understand if phenotypic variation in the absence of genetic variation can be predicted by variation in gene expression. To this end they quantified two reproductive traits, the onset of egg laying and early brood size in cohorts of genetically identical nematodes exposed to alternative ancestral (two maternal ages) and same generation life histories (either constant 20C temperature or 8-hour temperature shift to 25C upon hatching) in a two-factor design; then they profiled genome-wide gene expression in each individual. 

      Using multiple statistical and machine learning approaches, they showed that, at least for early brood size, phenotypic variation can be quite well predicted by molecular variation, beyond what can be predicted by life history alone. 

      Moreover, they provide some evidence that expression variation in some genes might be causally linked to phenotypic variation. 

      Strengths: 

      (1) Cleverly designed and carefully performed experiments that provide high-quality datasets useful for the community. 

      (2) Good evidence that phenotypic variation can be predicted by molecular variation. 

      We thank the reviewer for recognizing the strengths of our study.

      Weaknesses:  

      What drives the molecular variation that impacts phenotypic variation remains unknown. While the authors show that variation in expression of some genes might indeed be causal, it is still not clear how much of the molecular variation is a cause rather than a consequence of phenotypic variation. 

      We agree that the drivers of molecular variation remain unknown. While we addressed one potential candidate (histone modifications), there is much to be done in this area of research. We agree that, while some gene expression differences cause phenotypic changes, other gene expression differences could in principle be downstream of phenotypic differences.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      I have a number of suggestions that I believe will improve the Methods section. 

      (1) Strain N2-PD1073 will probably be confusing to some readers. I recommend spelling out that this is the Phillips lab version of N2.

      Thank you for this suggestion; we have added additional explanation of this strain in the Methods.

      (2) I found the details of the experimental design confusing, and I believe a supplemental figure will help. I have listed the following points that could be clarified: 

      a. What were the biological replicates? How many worms per replicate?

      Biological replicates were defined as experiments set up on different days (in this case, all biological replicates were at least a week apart), and the biological replicate of each worm can be found in Supplementary File 1 on the Phenotypic Data tab.

      b. I believe that embryos and L4s were picked to create different aged P0s, and eggs and L4s were picked to separate plates? Is this correct?

      Yes, this is correct.

      c. What was the spread in the embryo age?

      We assume this is asking about the age of the F1 embryos, and these were laid over the course of a 2-hour window.  

      d. While the age of the parents is different, there are also features about their growth plates that will be impacted by the experimental design. For example, their pheromone exposure is different due to the role that age plays in the combination of ascarosides that are released. It is worth noting as my reading of the paper makes it seem that parental age is the only thing that matters.

      The parents (P0) of different ages likely have differential ascaroside exposure because they are in the vicinity of other similarly aged worms, but the F1 progeny were exposed to their parents for only the 2-hour egg-laying window, in an attempt to minimize this type of effect as much as possible.  

      e. Were incubators used for each temperature?

      Yes.

      f. In line 443, why approximately for the 18 hours? How much spread?

      The approximation was based on the time interval between the 2-hour egg-laying window on Day 4 and the temperature shift on Day 5 the following morning. The timing was within 30 minutes of 18 hours either direction.

      g.  In line 444, "continually left" is confusing. Does this mean left in the original incubator?

      Yes, this means left in the incubator while the worms shifted to 25°C were moved. To avoid confusion, we re-worded this to state they “remained at 20°C while the other half were shifted to 25°C”.

      h. In line 445, "all worms remained at 20 {degree sign}C" was confusing to me as to what it indicated. I assume, unless otherwise noted, the animals would not be moved to a new temperature.

      This was an attempt to avoid confusion and emphasize that all worms were experiencing the same conditions for this part of the experiment.  

      i. What size plates were the worms singled onto?

      They were singled onto 6-cm plates.

      j. If a figure were to be made, having two timelines (with respect to the P0 and F1) might be useful.

      We believe the methods should be sufficient for someone who hopes to repeat the experiment, and we believe the schematic in Figure 1A labeling P0 and F1 generations is sufficient to illustrate the key features of the experimental design.

      k. Not all eggs that are laid end up hatching. Are these censored from the number of progeny calculations?

      Yes, only progeny that hatched and developed were counted for early brood.

      (3) For the lysis, was the second transfer to dH20 also a wash step?

      Yes.

      (4) What was used for the Elution buffer?

      We used elution buffer consisting of 10 mM Tris, 0.1 mM EDTA. We have added this to the “Cell lysate generation” section of the methods

      (5) The company that produced the KAPA mRNA-seq prep kit should be listed.

      We added that the kit was from Roche Sequencing Solutions.

      (6) For the GO analysis - one potential issue is that the set of 8824 genes might also be restricted to specific GO categories. Was this controlled for?

      We originally did not explicitly control for this and used the default enrichGO settings with OrgDB = org.Ce.eg.db as the background set for C. elegans. We have now repeated the analysis with the “universe” set to the 8824-gene background set. This did not qualitatively change the significant GO terms, though some have slightly higher or lower p-values. For comparison purposes, we have added the background-corrected sets to the GO_Terms tab of Supplementary File 1 with each of the three main gene groups appended with “BackgroundOf8824”.

      Reviewer #2 (Recommendations for the authors): 

      (1) The abstract, introduction, and experimental design are well thought through and very clear.

      Thank you.

      (2) Figure 1B could use a clearer or more intuitive label on the horizontal axis. The two examples help. Maybe the genes (points) on the left side should be blue to match Figure 1C, where the genes with a negative correlation are in the blue cluster.

      Thank you for these suggestions. We re-labeled the x-axis as “Slope of early brood vs. gene expression (normalized by CPM)”, which we hope gives readers a better intuition of what the coefficient from the model is measuring. We also re-colored the points previously colored red in Figure 1B to be color-coded depending on the direction of association to match Figure 1C, so these points are now color-coded as pink and purple.  

      (3) If red/blue are pos/neg correlated genes in 1C, perhaps different colors should be used to label ELO and brood in Figures 2 and 3. Green/purple?

      We appreciate this point, but since we ended up using the cluster colors of pink and purple in Figure 1, we opted to leave Figures 2 and 3 alone with the early brood and ELO colorcoding of red and blue.

      (4) I am unfamiliar with this type of beta values, but I thought the explanation and figure were very clear. It could be helpful to bold beta1 and beta2 in the top panels of Figure 2, so the readers are not searching around for those among all the other betas. It could also be helpful to add an English phrase to the vertical axes inFigures 2C and 2D, in addition to the beta1 and beta2. Something like "overall effect (beta1)" and"environment-controlled effect (beta2)". Or maybe "effect of environment + stochastic expression differences

      (beta1)" and "effect of stochastic expression differences alone (beta2)". I guess those are probably too big to fit on the figure, but it might be nice to have a label somewhere on this figure connecting them to the key thing you are trying to measure - the effect of gene expression and environment.

      Thank you for these suggestions. We increased the font sizes and bolded β1 and β2 in Figure 2A-B. In Figure 2C-D, we added a parenthetical under β1 to say “(env + noise)” and β2 to say “(noise)”. We agree that this should give the reader more intuition about what the β values are measuring.  

      Reviewer #3 (Recommendations for the authors): 

      The authors collected individuals 24 hours after the onset of egg laying for transcriptomic profiling. This is a well-designed experiment to control for the physiological age of the germline. However, this does not properly control for somatic physiological age. Somatic age can be partially uncoupled from germline age across individuals, and indeed, this can be due to differences in maternal age (Perez et al, 2017). This is because maternal age is associated with increased pheromone exposure (unless you properly controlled for it by moving worms to fresh plates), which causes a germline-specific developmental delay in the progeny, resulting in a delayed onset of egg production compared to somatic development (Perez et al. 2021). You control for germline age, therefore, it is likely that the progeny of day 1 mothers are actually somatically older than the progeny of day 3 mothers. This would predict that many genes identified in these analyses might just be somatic genes that increase or decrease their expression during the young adult stage. 

      For example, the abundance of collagen genes among the genes negatively associated (including col-20, which is the gene most significantly associated with early brood) is a big red flag, as collagen genes are known to be changing dynamically with age. If variation in somatic vs germline age is indeed what is driving the expression variation of these genes, then the expectation is that their expression should decrease with age. Vice versa, genes positively associated with early brood that are simply explained by age should be increasing.  So I would suggest that the authors first check this using time series transcriptomic data covering the young adult stage they profiled. If this is indeed the case, I would then suggest using RAPToR ( https://github.com/LBMC/RAPToR ), a method that, using reference time series data, can estimate physiological age (including tissue-specific one) from gene expression. Using this method they can estimate the somatic physiological age of their samples, quantify the extent of variation in somatic age across individuals, quantify how much of the observed differences in expressions are explained just by differences in somatic age and correct for them during their transcriptomic analysis using the estimated soma age as a covariate (https://github.com/LBMC/RAPToR/blob/master/vignettes/RAPToR-DEcorrection-pdf.pdf). 

      This should help enrich a molecular variation that is not simply driven by hidden differences between somatic and germline age. 

      To first address some of the experimental details mentioned for our paper, parents were indeed moved to fresh plates where they were allowed to lay embryos for two hours and then removed. Thus, we believe this minimizes the effects of ascarosides as much as possible within our design. As shown in the paper, we also identified genes that were not driven by parental age and for all genes quantified to what extent each gene’s association was driven by parental age. Thus, it is unlikely that differences in somatic and germline age is the sole explanatory factor, even if it plays some role. We also note that we accounted for egg-laying onset timing in our experimental design, and early brood was calculated as the number of progeny laid in the first 24 hours of egg-laying, where egg-laying onset was scored for each individual worm to the hour. The plot of each worm’s ELO and early brood traits is in Figure S1. Nonetheless, we read the RAPToR paper with interest, as we highlighted in the paper that germline genes tend to be positively associated with early brood while somatic genes tend to be negatively associated. While the RAPToR paper discusses using tissue-specific gene sets to stage genetically diverse C. elegans RILs, the RAPToR reference itself was not built using gene expression data acquired from different C. elegans tissues and is based on whole worms, typically collected in bulk. I.e., age estimates in RILs differ depending on whether germline or somatic gene sets are used to estimate age when the the aging clock is based on N2 samples. Thus, it is unclear whether such an approach would work similarly to estimate age in single worm N2 samples. In addition, from what we can tell, the RAPToR R package appears to implement the overall age estimate, rather than using the tissue-specific gene sets used for RILs in the paper. Because RAPToR would be estimating the overall age of our samples using a reference that is based on fewer samples than we collected here, and because we already know the overall age of our samples measured using standard approaches, we believe that estimating the age with the package would not give very much additional insight.  

      Bonferroni correction: 

      First, I think there is some confusion in how the author report their p-values: I don't think the authors are using a cut-off of Bonferroni corrected p-value of 5.7 x 10-6 (it wouldn't make sense). It's more likely that they are using a Bonferroni corrected p of 0.05 or 0.1, which corresponds to a nominal p value of 5.7 x 10-6, am I right?

      Yes, we used a nominal p-value of 5.7 x 10-6 to correspond to a Bonferroni-corrected p-value of 0.05, calculated as 0.05/8824. We have re-worded this wherever Bonferroni correction was mentioned.

      Second, Bonferroni is an overly stringent correction method that has now been substituted by the more powerful Benjamini Hochberg method to control the false discovery rate. Using this might help find more genes and better characterize the molecular variation, especially the one associated with ELO?

      We agree that Bonferroni is quite stringent and because we were focused on identifying true positives, we may have some false negatives. Because all nominal p-values are included in the supplement, it is straightforward for an interested reader to search the data to determine if a gene is significant at any other threshold.   

      Minor comments: 

      (1) "In our experiment, isogenic adult worms in a common environment (with distinct historical environments) exhibited a range of both ELO and early brood trait values (Fig S1A)" I think this and the figure is not really needed, Figure S1B is already enough to show the range of the phenotypes and how much variation is driven by the life history traits.

      We agree that the information in S1A is also included in S1B, but we think it is a little more straightforward if one is primarily interested in viewing the distribution for a single trait.

      (2) Line 105 It should be Figure S2, not S3.

      Thank you for catching this mistake.

      (3) Gene Ontology on positive and negatively associated genes together: what about splitting the positive and negative?

      We have added a split of positive and negative GO terms to the GO_Terms tab of Supplement File 1. Broadly speaking, the most enriched positively associated genes have many of the same GO terms found on the combined list that are germline related (e.g., involved in oogenesis and gamete generation), whereas the most enriched negatively associated genes have GO terms found on the combined list that are related to somatic tissues (e.g., actin cytoskeleton organization, muscle cell development). This is consistent with the pattern we see for somatic and germline genes shown in Figure 4.

      (4) A lot of muscle-related GOs, can you elaborate on that?

      Yes, there are several muscle-related GOs in addition to germline and epidermis. While we do not know exactly why from a mechanistic perspective these muscle-related terms are enriched, it may be important to note that many of these terms have highly overlapping sets of genes which are listed in Supplementary File 1. For example, “muscle system process” and “muscle contraction” have the exact same set of 15 genes causing the term to be significantly enriched. Thus, we tend to not interpret having many GO terms on a given tissue as indicating that the tissue is more important than others for a given biological process. While it is clear there are genes related to muscle that are associated with early brood, it is not yet clear that the tissue is more important than others.  

      (5) "consistent with maternal age affecting mitochondrial gene expression in progeny " - has this been previously reported?

      We do not believe this particular observation has been reported. It is important to note that these genes are involved in mitochondrial processes, but are expressed from the nuclear rather than mitochondrial genome. We re-worded the quoted portion of the sentence to say “consistent with parental age affecting mitochondria-related gene expression in progeny”.

      (6) PCA: "Therefore, the optimal number of PCs occurs at the inflection points of the graph, which is after only7 PCs for early brood (R2 of 0.55) but 28 PCs for ELO (R2 of 0.56)." 

      Not clear how this is determined: just graphically? If yes, there are several inflection points in the plot. How did you choose which one to consider? Also, a smaller component is not necessarily less predictive of phenotypic variation (as you can see from the graph), so instead of subsequently adding components based on the variance, they explain the transcriptomic data, you might add them based on the variance they explain in the phenotypic data? To this end, have you tried partial least square regression instead of PCA? This should give gene expression components that are ranked based on how much phenotypic variance they explain.  

      Thank you for this thoughtful comment. We agree that, unlike for Figure 3B, there is some interpretation involved on how many PCs is optimal because additional variance explained with each PC is not strictly decreasing beyond a certain number of PCs. Our assessment was therefore made both graphically and by looking at the additional variance explained with each additional PC. For example, for early brood, there was no PC after PC7 that added more than 0.04 to the R2. We could also have plotted early brood and ELO separately and had a different ordering of PCs on the x-axis. By plotting the data this way, we emphasized that the factors that explain the most variation in the gene expression data typically explain most variation in the phenotypic data.  

      (7) The fact that there are 7 PC of molecular variation that explain early brood is interesting. I think the authors can analyze this further. For example, could you perform separate GO enrichment for each component that explains a sizable amount of phenotypic variance? Same for the ELO.  

      Because each gene has a PC loading in for each PC, and each PC lacks the explanatory power of combined PCs, we believe doing GO Terms on the list of genes that contribute most to each PC is of minimal utility. The power of the PCA prediction approach is that it uses the entire transcriptome, but the other side of the coin is that it is perhaps less useful to do a gene-bygene based analysis with PCA. This is why we separately performed individual gene associations and 10-gene predictive analyses. However, we have added the PC loadings for all genes and all PCs to Supplementary File 1.

      (8) Avoid acronyms when possible (i.e. ELO in figures and figure legends could be spelled out to improve readability).

      We appreciate this point, but because we introduced the acronym both in Figure 1 and the text and use it frequently, we believe the reader will understand this acronym. Because it is sometimes needed (especially in dense figures), we think it is best to use it consistently throughout the paper.

      (9) Multiple regression: I see the most selected gene is col-20, which is also the most significantly differentially expressed from the linear mixed model (LMM). But what is the overlap between the top 300 genes in Figure 3F and the 448 identified by the LMM? And how much is the overlap in GO enrichment?

      Genes that showed up in at least 4 out of 500 iterations were selected more often than expected by chance, which includes 246 genes (as indicated by the red line in Figure 3F). Of these genes, 66 genes (27%) are found in the set of 448 early brood genes. The proportion of overlap increases as the number of iterations required to consider a gene predictive increases, e.g., 34% of genes found in 5 of 500 iterations and 59% of genes found in 10 of 500 iterations overlap with the 448 early brood genes. However, likely because of the approach to identify groups of 10 genes that are predictive, we do not find significant GO terms among the 246 genes identified with this approach after multiple test correction. We think this makes sense because the LMM identifies genes that are individually associated with early brood, whereas each subsequent gene included in multiple regression affects early brood after controlling for all previous genes. These additional genes added to the multiple regression are unlikely to have similar patterns as genes that are individually correlated with early brood.  

      (10) Elastic nets: prediction power is similar or better than multiple regression, but what is the overlap between genes selected by the elastic net (not presented if I am not mistaken) and multiple regression and the linear mixed model?

      For the elastic net models, we used a leave-one-out cross validation approach, meaning there were separate models fit by leaving out the trait data for each worm, training a model using the trait data and transcriptomic data for the other worms, and using the transcriptomic data of the remaining worm to predict the trait data. By repeating this for each worm, the regressions shown in the paper were obtained. Each of these models therefore has its own set of genes. Of the 180 models for early brood, the median model selects 83 genes (range from 72 to 114 genes). Across all models, 217 genes were selected at least once. Interestingly, there was a clear bimodal distribution in terms of how many models a given gene was selected for: 68 genes were selected in over 160 out of 180 models, while 114 genes were selected in fewer than 20 models (and 45 genes were selected only once). Therefore, we consider the set of 68 genes as highly robustly selected, since they were selected in the vast majority of models. This set of 68 exhibits substantial overlap with both the set of 448 early brood-associated genes (43 genes or 63% overlap) and the multiple regression set of 246 genes (54 genes or 79% overlap). For ELO, the median model selected 136 genes (range of 96 to 249 genes) and a total of 514 genes were selected at least once. The distribution for ELO was also bimodal with 78 genes selected over 160 times and 255 genes selected fewer than 20 times. This set of 78 included 6 of the 11 significant ELO genes identified in the LMM.  We have added tabs to Supplementary File 1 that include the list of genes selected for the elastic net models as well as a count of how many times they were selected out of 180 models.

      (11) In other words, do these different approaches yield similar sets of genes, or are there some differences?

      In the end, which approach is actually giving the best predictive power? From the perspective of R2, both the multiple regression and elastic net models are similarly predictive for early brood, but elastic net is more predictive for ELO. However, in presenting multiple approaches, part of our goal was identifying predictive genes that could be considered the ‘best’ in different contexts. The multiple regression was set to identify exactly 10 genes, whereas the elastic net model determined the optimal number of genes to include, which was always over 70 genes. Thus, the elastic net model is likely better if one has gene expression data for the entire transcriptome, whereas the multiple regression genes are likely more useful if one were to use reporters or qRTPCR to measure a more limited number of genes.  

      (12) Line 252: "Within this curated set, genes causally affected early brood in 5 of 7 cases compared to empty vector (Figure 4A).

      " It seems to me 4 out of 7 from Figure 4A. In Figure 4A the five genes are (1) cin-4, (2) puf5; puf-7, (3) eef-1A.2, (4) C34C12.8, and (5) tir-1. We did not count nex-2 (p = 0.10) or gly-13 (p = 0.07), and empty vector is the control.

      (13) Do puf-5 and -7 affect total brood size or only early brood size? Not clear. What's the effect of single puf-5 and puf-7 RNAi on brood?

      We only measured early brood in this paper, but a previous report found that puf-5 and puf-7 act redundantly to affect oogenesis, and RNAi is only effective if both are knocked down together(2). We performed pilot experiments to confirm that this was the case in our hands as well.  

      (14)  To truly understand if the noise in expression of Puf-5 and /or -7 really causes some of the observed difference in early brood, could the author use a reporter and dose response RNAi to reduce the level of puf-5/7 to match the lower physiological noise range and observe if the magnitude of the reduction of early brood by the right amount of RNAi indeed matches the observed physiological "noise" effect of puf-5/7 on early brood?

      We agree that it would be interesting to do the dose response of RNAi, measure early brood, and get a readout of mRNA levels to determine the true extent of gene knockdown in each worm (since RNAi can be noisy) and whether this corresponds to early brood when the knockdown is at physiological levels. While we believe we have shown that a dose response of gene knockdown results in a dose response of early brood, this additional analysis would be of interest for future experiments.

      (15) Regulated soma genes (enriched in H3K27me3) are negatively correlated with early brood. What would be the mechanism there? As mentioned before, it is more likely that these genes are just indicative of variation in somatic vs germline age (maybe due to latent differences in parental perception of pheromone).

      We can think of a few potential mechanisms/explanations, but at this point we do not have a decisive answer. Regulated somatic genes marked with H3K27me3 (facultative heterochromatin) are expressed in particular tissues and/or at particular times in development. In this study and others, genes marked with H3K27me3 exhibit more gene expression noise than genes with other marks. This could suggest that there are negative consequences for the animal if genes are expressed at higher levels at the wrong time or place, and one interpretation of the negative association is that higher expressed somatic genes results in lower fitness (where early brood is a proxy for fitness). Another related interpretation is that there are tradeoffs between somatic and germline development and each individual animal lands somewhere on a continuum between prioritizing germline or somatic development, where prioritizing somatic integrity (e.g. higher expression of somatic genes) comes at a cost to the germline resulting in fewer progeny. Additional experiments, including measurements of histone marks in worms measured for the early brood trait, would likely be required to more decisively answer this question.  

      (16) Line 151: "Among significant genes for both traits, β2 values were consistently lower than β1 (Figures 2CD), suggesting some of the total effect size was driven by environmental history rather than pure noise".

      We are interpreting this quote as part of point 17 below.

      (17) It looks like most of the genes associated with phenotypes from the univariate model have a decreased effect once you account for life history, but have you checked for cases where the life history actually masks the effect of a gene? In other words, do you have cases where the effect of gene expression on a phenotype is only (or more) significant after you account for the effect of life history (β2 values higher than β1)?

      This is a good question and one that we did not explicitly address in the paper because we focused on beta values for genes that were significant in the univariate analysis. Indeed, for the sets of 448 early brood genes ad 11 ELO genes, there are no genes for which β2 is larger than β1. In looking at the larger dataset of 8824 genes, with a Bonferroni-corrected p-value of 0.05, there are 306 genes with a significant β2 for early brood. The majority (157 genes) overlap with the 448 genes significant in the univariate analysis and do not have a higher β2 than β1. Of the remaining genes, 72 of these have a larger β2 than β1. However, in most cases, this difference is relatively small (median difference of 0.025) and likely insignificant. There are only three genes in which β1 is not nominally significant, and these are the three genes with the largest difference between β1 and β2 with β2 being larger (differences of 0.166, 0.155, and 0.12). In contrast, the median difference between β1 and β2 the 448 genes (in which β1 is larger) is 0.17, highlighting the most extreme examples of β2 > β1 are smaller in magnitude than the typical case of β1 > β2. For ELO, there are no notable cases where β2 > β1. There are eight genes with a significant β2 value, and all of these have a β1 value that is nominally significant. Therefore, while this phenomenon does occur, we find it to be relatively rare overall. For completeness, we have added the β1 and β2 values for all 8824 genes as a tab in Supplementary File 1.

    1. eLife Assessment

      The authors address a fundamental question for cell and tissue biology. They use the skin epidermis as a paradigm and ask how stratifying self-renewing epithelia induce differentiation and upward migration in basal dividing progenitor cells to generate suprabasal barrier-forming cells that are essential for a functional barrier formed by such an epithelium. The authors provide compelling evidence time that an increase in intracellular actomyosin contractility, a hallmark of barrier-forming keratinocytes, is sufficient to trigger terminal differentiation, providing in vivo evidence of the interdependency of cell mechanics and differentiation. To illustrate their points, the authors use a combination of genetic mouse models, RNA sequencing, and immunofluorescence analysis. Precisely how the changes in gene expression, cell morphology, mechanics, and cell position are instructive and whether consecutive changes in differentiation are required still remain unclear, but the paper takes a nice step in advancing our knowledge of the process.

    2. Reviewer #1 (Public review):

      Summary:

      The authors address a fundamental question for cell and tissue biology using the skin epidermis as a paradigm and ask how stratifying self-renewing epithelia induce differentiation and upwards migration in basal dividing progenitor cells to generate suprabasal barrier-forming cells that are essential for a functional barrier formed by such an epithelium. The authors show for the first time that an increase in intracellular actomyosin contractility, a hallmark of barrier-forming keratinocytes, is sufficient to trigger terminal differentiation. Hence the data provide in vivo evidence of the more general interdependency of cell mechanics and differentiation. The data appear to be of high quality and the evidences are strengthened through a combination of different genetic mouse models, RNA sequencing and immunofluorescence analysis.

      To generate and maintain the multilayered, barrier-forming epidermis, keratinocytes of the basal stem cell layer differentiate and move suprabasally accompanied by stepwise changes not only in gene expression but also in cell morphology, mechanics and cell position. If any of these changes are instructive for differentiation itself, and whether consecutive changes in differentiation are required, remains unclear. Also, there are few comprehensive data sets on the exact changes in gene expression between different states of keratinocyte differentiation. In this study, through genetic fluorescence labeling of cell states at different developmental timepoints the authors were able to analyze gene expression of basal stem cells and suprabasal differentiated cells at two different stages of maturation: E14 (embryonic day 14) when the epidermis comprises mostly two functional compartments (basal stem cells and suprabasal so called intermediate cells) and E16 when the epidermis comprise three (living) compartments where the spinous layer separates basal stem cells from the barrier forming granular layer, as is the case in adult epidermis. Using RNA bulk sequencing, the authors developed useful new markers for suprabasal stages of differentiation like MafB and Cox1. The transcription factor MafB was then shown to inhibit suprabasal proliferation in a MafB transgenic model.

      The data indicate that early in development at E14 the suprabasal intermediate cells resemble in terms of RNA expression, the barrier-forming granular layer at E16, suggesting that keratinocyte can undergo either stepwise (E16) or more direct (E14) terminal differentiation.

      Previous studies by several groups found an increased actomyosin contractility in the barrier forming granular layer and showed that this increase in tension is important for epidermal barrier formation and function. However, it was not clear whether contractility itself serves as an instructive signal for differentiation. To address this question, the authors use a previously published model to induce premature hypercontractility in the spinous layer by using spastin overexpression (K10-Spastin) to disrupt microtubules (MT) thereby indirectly inducing actomyosin contractility. A second model activates myosin contractility more directly through overexpression of a constitutively active RhoA GEF (K10-Arhgef11CA). Both models induce late differentiation of suprabasal keratinocytes regardless of the suprabasal position in either spinous or granular layer indicating that increased contractility is key to induce late differentiation of granular cells. A potential weakness is the use of the K10-spastin model that disrupts MT and likely has additional roles in altering differentiation next to the induction of hypercontractility. Their previous publications provided some evidence that the effect on differentiation is driven by the increase in contractility (Ning et al. cell stem cell 2021). Moreover, their data are now further supported by a second model activating myosin through RhoA. This manuscript extends their previous findings that indicated a role for contractility in early differentiation, now focussing on the regulation of late differentiation in barrier forming cells. This data set thus help to unravel the interdependencies of cell position, mechanical state and differentiation in the epidermis, and suggest that an increase in cellular contractility within the epidermis can induce terminal differentiation. Importantly the authors show that despite contractility induced nuclear localization of the mechanoresponsive transcription factor YAP in the barrier forming granular layer, YAP nuclear localization is not sufficient to drive premature differentiation when forced to the nucleus in the spinous layer.

      Overall, this is a well written manuscript and comprehensive dataset.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript from Prado-Mantilla and co-workers addresses mechanisms of embryonic epidermis development, focusing on the intermediate layer cells, a transient population of suprabasal cells that contributes to the expansion of the epidermis through proliferation. Using bulk-RNA they show that these cells are transcriptionally distinct from the suprabasal spinous cells and identify specific marker genes for these populations. They then use transgenesis to demonstrate that one of these selected spinous layer-specific markers, the transcription factor MafB is capable of suppressing proliferation in the intermediate layers, providing a potential explanation for the shift of suprabasal cells into a non-proliferative state during development. Further, lineage tracing experiments show that the intermediate cells become granular cells without a spinous layer intermediate. Finally, the authors show that the intermediate layer cells express high levels of contractility-related genes than spinous layers and overexpression of cytoskeletal regulators accelerates differentiation of spinous layer cells into granular cells.

      Overall, the manuscript presents a number of interesting observations on the developmental stage-specific identities of suprabasal cells and their differentiation trajectories, and points to a potential role of contractility in promoting differentiation of suprabasal cells into granular cells. The precise mechanisms by which MafB suppresses proliferation, how the intermediate cells bypass the spinous layer stage to differentiate into granular cells and how contractility feeds into these mechanisms remain open. Interestingly, while the mechanosensitive transcription factor YAP appears differentially active in the two states, it is shown to be downstream rather than upstream of the observed differences in mechanics.

      Strengths:

      The authors use a nice combination of RNA sequencing, imaging, lineage tracing and transgenesis to address the suprabasal to granular layer transition. The imaging is convincing and the biological effects appear robust. The manuscript is clearly written and logical to follow.

      Weaknesses:

      While the data overall supports the authors claims, there are a few minor weaknesses that pertain to the aspect of the role of contractility, The choice of spastin overexpression to modulate contractility is not ideal as spastin has multiple roles in regulating microtubule dynamics and membrane transport which could also be potential mechanisms explaining some of the phenotypes. Use of Arghap11 overexpression mitigates this effect to some extent but overall it would have been more convincing to manipulate myosin activity directly. It would also be important to show that these manipulations increase the levels of F-actin and myosin II as shown for the intermediate layer. It would also be logical to address if further increasing contractility in the intermediate layer would enhance the differentiation of these cells.

      Despite these minor weaknesses, the manuscript is overall of high quality, sheds new light on the fundamental mechanisms of epidermal stratification during embryogenesis and will likely be of interest to the skin research community.

    4. Reviewer #3 (Public review):

      Summary:

      This is an interesting paper by Lechler and colleagues describing the transcriptomic signature and fate of intermediate cells (ICs), a transient and poorly defined embryonic cell type in the skin. ICs are the first suprabasal cells in the stratifying skin and unlike later-developing suprabasal cells, ICs continue to divide. Using bulk RNA seq to compare ICs to spinous and granular transcriptomes, the authors find that IC-specific gene signatures include hallmarks of granular cells, such as genes involved in lipid metabolism and skin barrier function that are not expressed in spinous cells. ICs were assumed to differentiate into spinous cells, but lineage tracing convincingly shows ICs differentiate directly into granular cells without passing through a spinous intermediate. Rather, basal cells give rise to the first spinous cells. They further show that transcripts associated with contractility are also shared signatures of ICs and granular cells, and overexpression of two contractility inducers (Spastin and ArhGEF-CA) can induce granular and repress spinous gene expression. This contractility-induced granular gene expression does not appear to be mediated by the mechanosensitive transcription factor, Yap. The paper also identifies new markers that distinguish IC and spinous layers, and shows the spinous signature gene, MafB, is sufficient to repress proliferation when prematurely expressed in ICs.

      Strengths:

      Overall this is a well-executed study, and the data are clearly presented and the findings convincing. It provides an important contribution to the skin field by characterizing the features and fate of ICs, a much understudied cell type, at a high levels of spatial and transcriptomic detail. The conclusions challenge the assumption that ICs are spinous precursors through compelling lineage tracing data. The demonstration that differentiation can be induced by cell contractility is an intriguing finding, and adds a growing list of examples where cell mechanics influence gene expression and differentiation.

      Weaknesses:

      A weakness of the study is an over-reliance on overexpression and sufficiency experiments to test the contributions of MafB, Yap, and contractility in differentiation. The inclusion of loss-of-function approaches would enable one to determine if, for example, contractility is required for the transition of ICs to granular fate, and whether MafB is required for spinous fate. Second, whether the induction of contractility-associated genes is accompanied by measurable changes in the physical properties or mechanics of the IC and granular layers is not directly shown. Inclusion of physical measurements would bolster the conclusion that mechanics lies upstream of differentiation.

      Finally, the role of ICs in epidermal development remains unclear. Although not essential to support the conclusions of this study, insights into the function of this transient cell layer would strengthen the overall impact.

    5. Author Response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      Summary: 

      The authors address a fundamental question for cell and tissue biology using the skin epidermis as a paradigm and ask how stratifying self-renewing epithelia induce differentiation and upward migration in basal dividing progenitor cells to generate suprabasal barrier-forming cells that are essential for a functional barrier formed by such an epithelium. The authors show for the first time that an increase in intracellular actomyosin contractility, a hallmark of barrier-forming keratinocytes, is sufficient to trigger terminal differentiation. Hence the data provide in vivo evidence of the more general interdependency of cell mechanics and differentiation. The data appear to be of high quality and the evidences are strengthened through a combination of different genetic mouse models, RNA sequencing, and immunofluorescence analysis. 

      To generate and maintain the multilayered, barrier-forming epidermis, keratinocytes of the basal stem cell layer differentiate and move suprabasally accompanied by stepwise changes not only in gene expression but also in cell morphology, mechanics, and cell position. Whether any of these changes is instructive for differentiation itself and whether consecutive changes in differentiation are required remains unclear. Also, there are few comprehensive data sets on the exact changes in gene expression between different states of keratinocyte differentiation. In this study, through genetic fluorescence labeling of cell states at different developmental time points the authors were able to analyze gene expression of basal stem cells and suprabasal differentiated cells at two different stages of maturation: E14 (embryonic day 14) when the epidermis comprises mostly two functional compartments (basal stem cells and suprabasal socalled intermediate cells) and E16 when the epidermis comprise three (living) compartments where the spinous layer separates basal stem cells from the barrier-forming granular layer, as is the case in adult epidermis. Using RNA bulk sequencing, the authors developed useful new markers for suprabasal stages of differentiation like MafB and Cox1. The transcription factor MafB was then shown to inhibit suprabasal proliferation in a MafB transgenic model. 

      The data indicate that early in development at E14 the suprabasal intermediate cells resemble in terms of RNA expression, the barrier-forming granular layer at E16, suggesting that keratinocytes can undergo either stepwise (E16) or more direct (E14) terminal differentiation. 

      Previous studies by several groups found an increased actomyosin contractility in the barrierforming granular layer and showed that this increase in tension is important for epidermal barrier formation and function. However, it was not clear whether contractility itself serves as an instructive signal for differentiation. To address this question, the authors use a previously published model to induce premature hypercontractility in the spinous layer by using spastin overexpression (K10-Spastin) to disrupt microtubules (MT) thereby indirectly inducing actomyosin contractility. A second model activates myosin contractility more directly through overexpression of a constitutively active RhoA GEF (K10-Arhgef11CA). Both models induce late differentiation of suprabasal keratinocytes regardless of the suprabasal position in either spinous or granular layer indicating that increased contractility is key to induce late differentiation of granular cells. A potential weakness of the K10-spastin model is the disruption of MT as the primary effect which secondarily causes hypercontractility. However, their previous publications provided some evidence that the effect on differentiation is driven by the increase in contractility (Ning et al. cell stem cell 2021). Moreover, the data are confirmed by the second model directly activating myosin through RhoA. These previous publications already indicated a role for contractility in differentiation but were focused on early differentiation. The data in this manuscript focus on the regulation of late differentiation in barrier-forming cells. These important data help to unravel the interdependencies of cell position, mechanical state, and differentiation in the epidermis, suggesting that an increase in cellular contractility in most apical positions within the epidermis can induce terminal differentiation. Importantly the authors show that despite contractility-induced nuclear localization of the mechanoresponsive transcription factor YAP in the barrier-forming granular layer, YAP nuclear localization is not sufficient to drive premature differentiation when forced to the nucleus in the spinous layer. 

      Overall, this is a well-written manuscript and a comprehensive dataset. Only the RNA sequencing result should be presented more transparently providing the full lists of regulated genes instead of presenting just the GO analysis and selected target genes so that this analysis can serve as a useful repository. The authors themselves have profited from and used published datasets of gene expression of the granular cells. Moreover, some of the previous data should be better discussed though. The authors state that forced suprabasal contractility in their mouse models induces the expression of some genes of the epidermal differentiation complex (EDC). However, in their previous publication, the authors showed that major classical EDC genes are actually not regulated like filaggrin and loricrin (Muroyama and Lechler eLife 2017). This should be discussed better and necessitates including the full list of regulated genes to show what exactly is regulated. 

      We thank the reviewers for their suggestions and comments.

      Thank you for the suggestion to include gene lists. We had an excel document with all this data but neglected to upload it with the initial manuscript. This includes all the gene signatures for the different cell compartments across development. We also include a tab that lists all EDC genes and whether they were up-regulated in intermediate cells and cells in which contractility was induced. Further, we note that all the RNA-Seq datasets are available for use on GEO (GSE295753).  

      In our previous publication, we indeed included images showing that loricrin and filaggrin were both still expressed in the differentiated epidermis in the spastin mutant. Both Flg and Lor mRNA were up in the RNA-Seq (although only Flg was statistically significant), though we didn’t see a notable change in protein levels. It is unclear whether this is just difficult to see on top of the normal expression, or whether there are additional levels of regulation where mRNA levels are increased but protein isn’t. That said, our data clearly show that other genes associated with granular fate were increased in the contractile skin. 

      Reviewer #2 (Public review): 

      Summary: 

      The manuscript from Prado-Mantilla and co-workers addresses mechanisms of embryonic epidermis development, focusing on the intermediate layer cells, a transient population of suprabasal cells that contributes to the expansion of the epidermis through proliferation. Using bulk-RNA they show that these cells are transcriptionally distinct from the suprabasal spinous cells and identify specific marker genes for these populations. They then use transgenesis to demonstrate that one of these selected spinous layer-specific markers, the transcription factor MafB is capable of suppressing proliferation in the intermediate layers, providing a potential explanation for the shift of suprabasal cells into a non-proliferative state during development. Further, lineage tracing experiments show that the intermediate cells become granular cells without a spinous layer intermediate. Finally, the authors show that the intermediate layer cells express higher levels of contractility-related genes than spinous layers and overexpression of cytoskeletal regulators accelerates the differentiation of spinous layer cells into granular cells. 

      Overall the manuscript presents a number of interesting observations on the developmental stage-specific identities of suprabasal cells and their differentiation trajectories and points to a potential role of contractility in promoting differentiation of suprabasal cells into granular cells. The precise mechanisms by which MafB suppresses proliferation, how the intermediate cells bypass the spinous layer stage to differentiate into granular cells, and how contractility feeds into these mechanisms remain open. Interestingly, while the mechanosensitive transcription factor YAP appears deferentially active in the two states, it is shown to be downstream rather than upstream of the observed differences in mechanics. 

      Strengths: 

      The authors use a nice combination of RNA sequencing, imaging, lineage tracing, and transgenesis to address the suprabasal to granular layer transition. The imaging is convincing and the biological effects appear robust. The manuscript is clearly written and logical to follow. 

      Weaknesses: 

      While the data overall supports the authors' claims, there are a few minor weaknesses that pertain to the aspect of the role of contractility, The choice of spastin overexpression to modulate contractility is not ideal as spastin has multiple roles in regulating microtubule dynamics and membrane transport which could also be potential mechanisms explaining some of the phenotypes. Use of Arghap11 overexpression mitigates this effect to some extent but overall it would have been more convincing to manipulate myosin activity directly. It would also be important to show that these manipulations increase the levels of F-actin and myosin II as shown for the intermediate layer. It would also be logical to address if further increasing contractility in the intermediate layer would enhance the differentiation of these cells. 

      We agree with the reviewer that the development of additional tools to precisely control myosin activity will be of great use to the field. That said, our series of publications has clearly demonstrated that ablating microtubules results in increased contractility and that this phenocopies the effects of Arhgef11 induced contractility. Further, we showed that these phenotypes were rescued by myosin inhibition with blebbistatin. Our prior publications also showed a clear increase in junctional acto-myosin through expression of either spastin or Arhgef11, as well as increased staining for the tension sensitive epitope of alpha-catenin (alpha18).  We are not aware of tools that allow direct manipulation of myosin activity that currently exist in mouse models.  

      The gene expression analyses are relatively superficial and rely heavily on GO term analyses which are of course informative but do not give the reader a good sense of what kind of genes and transcriptional programs are regulated. It would be useful to show volcano plots or heatmaps of actual gene expression changes as well as to perform additional analyses of for example gene set enrichment and/or transcription factor enrichment analyses to better describe the transcriptional programs 

      We have included an excel document that lists all the gene signatures. In addition, a volcano plot is included in the new Fig 2, Supplement 1. All our NGS data are deposited in GEO for others to perform these analyses. As the paper does not delve further into transcriptional regulation, we do not specifically present this information in the paper.  

      Claims of changes in cell division/proliferation changes are made exclusively by quantifying EdU incorporation. It would be useful to more directly look at mitosis. At minimum Y-axis labels should be changed from "% Dividing cells" to % EdU+ cells to more accurately represent findings 

      We changed the axis label to precisely match our analysis. We note that Figure 1, Supplement 1 also contains data on mitosis.  

      Despite these minor weaknesses the manuscript is overall of high quality, sheds new light on the fundamental mechanisms of epidermal stratification during embryogenesis, and will likely be of interest to the skin research community. 

      Reviewer #3 (Public review): 

      Summary: 

      This is an interesting paper by Lechler and colleagues describing the transcriptomic signature and fate of intermediate cells (ICs), a transient and poorly defined embryonic cell type in the skin. ICs are the first suprabasal cells in the stratifying skin and unlike later-developing suprabasal cells, ICs continue to divide. Using bulk RNA seq to compare ICs to spinous and granular transcriptomes, the authors find that IC-specific gene signatures include hallmarks of granular cells, such as genes involved in lipid metabolism and skin barrier function that are not expressed in spinous cells. ICs were assumed to differentiate into spinous cells, but lineage tracing convincingly shows ICs differentiate directly into granular cells without passing through a spinous intermediate. Rather, basal cells give rise to the first spinous cells. They further show that transcripts associated with contractility are also shared signatures of ICs and granular cells, and overexpression of two contractility inducers (Spastin and ArhGEF-CA) can induce granular and repress spinous gene expression. This contractility-induced granular gene expression does not appear to be mediated by the mechanosensitive transcription factor, Yap. The paper also identifies new markers that distinguish IC and spinous layers and shows the spinous signature gene, MafB, is sufficient to repress proliferation when prematurely expressed in ICs. 

      Strengths: 

      Overall this is a well-executed study, and the data are clearly presented and the findings convincing. It provides an important contribution to the skin field by characterizing the features and fate of ICs, a much-understudied cell type, at high levels of spatial and transcriptomic detail. The conclusions challenge the assumption that ICs are spinous precursors through compelling lineage tracing data. The demonstration that differentiation can be induced by cell contractility is an intriguing finding and adds a growing list of examples where cell mechanics influence gene expression and differentiation. 

      Weaknesses: 

      A weakness of the study is an over-reliance on overexpression and sufficiency experiments to test the contributions of MafB, Yap, and contractility in differentiation. The inclusion of loss-offunction approaches would enable one to determine if, for example, contractility is required for the transition of ICs to granular fate, and whether MafB is required for spinous fate. Second, whether the induction of contractility-associated genes is accompanied by measurable changes in the physical properties or mechanics of the IC and granular layers is not directly shown. The inclusion of physical measurements would bolster the conclusion that mechanics lies upstream of differentiation. 

      We agree that loss of function studies would be useful. For MafB, these have been performed in cultured human keratinocytes, where loss of MafB and its ortholog cMaf results in a phenotype consistent with loss of spinous differentiation (Pajares-Lopez et al, 2015). Due to the complex genetics involved, generating these double mutant mice is beyond the scope of this study. Loss of function studies of myosin are also complicated by genetic redundancy of the non-muscle type II myosin genes, as well as the role for these myosins in cell division and in actin cross linking in addition to contractility. In addition, we have found that these myosins are quite stable in the embryonic intestine, with loss of protein delayed by several days from the induction of recombination. Therefore, elimination of myosins by embryonic day e14.5 with our current drivers is not likely possible. Generation of inducible inhibitors of contractility is therefore a valuable future goal. 

      Several recent papers have used AFM of skin sections to probe tissue stiffness. We have not attempted these studies and are unclear about the spatial resolution and whether, in the very thin epidermis at these stages, we could spatially resolve differences. That said, we previously assessed the macro-contractility of tissues in which myosin activity was induced and demonstrated that there was a significant increase in this over a tissue-wide scale (Ning et al, Cell Stem Cell, 2021).  

      Finally, whether the expression of granular-associated genes in ICs provides them with some sort of barrier function in the embryo is not addressed, so the role of ICs in epidermal development remains unclear. Although not essential to support the conclusions of this study, insights into the function of this transient cell layer would strengthen the overall impact.  

      By traditional dye penetration assays, there is no epidermal barrier at the time that intermediate cells exist. One interpretation of the data is that cells are beginning to express mRNAs (and in some cases, proteins) so that they are able to rapidly generate a barrier as they become granular cells. In addition, many EDC genes, important for keratinocyte cornification and barrier formation, are not upregulated in ICs at E14.5. We have attempted experiments to ablate intermediate cells with DTA expression - these resulted in inefficient and delayed death and thus did not yield strong conclusions about the role of intermediate cells. Our findings that transcriptional regulators of granular differentiation (such as Grhl3 and Hopx) are also present in intermediate cells, should allow future analysis of the effects of their ablation on the earliest stages of granular differentiation from intermediate cells. In fact, previous studies have shown that Grhl3 null mice have disrupted barrier function at embryonic stages (Ting et al, 2005), supporting the role of ICs in being important for barrier formation. (?)

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      Overall, this is a well-written manuscript and a comprehensive dataset. Only the RNA sequencing result should be presented more transparently providing the full lists of regulated genes instead of presenting just the GO analysis and selected target genes so that this analysis can serve as a useful repository. The authors themselves have profited from and used the published dataset of gene expression of the granular cells. Moreover, some of the previous data should be better discussed though. The authors state that forced suprabasal contractility in their mouse models induces the expression of some genes of the epidermal differentiation complex (EDC). However, in their previous publication, the authors showed that major classical EDC genes are actually not regulated like filaggrin and loricrin (Muroyama and Lechler eLife 2017). This should be discussed better and necessitates including the full list of regulated genes to show what exactly is regulated. 

      A general point regarding statistics throughout the manuscript. It seems like regular T-tests or ANOVAs have been used assuming Gaussian distribution for sample sizes below N=5 which is technically not correct. Instead, non-parametric tests like e.g. the Mann-Whitney test should be used. Since Graph-Pad was used for statistics according to the methods this is easy to change. 

      Figure 1: It would be good to show the FACS plot of the analyzed and sorted population in the supplementary figures. 

      If granular cells can be analyzed and detected by FACS, why were they not included in the RNA sequencing analysis? 

      Figure 1 supplement 1c: cell division numbers are analyzed from only 2 mice and the combined 5 or 4 fields of view are used for statistics using a test assuming normal distribution which is not really appropriate. Means per mice should be used or if accumulated field of views are used, the number should be increased using more stringent tests. Otherwise, the p-values here clearly overstate the significance. 

      Granular cells could not be specifically isolated in the approach we used. The lectin binds to both upper spinous and granular cells. For this reason, we relied on a separate granular gene list as described.

      For Figure 1 Supplement 1, we removed the statistical analysis and use it simply as a validation of the data in Figure 1.  

      Figure 2: It is not completely clear on which basis the candidate genes were picked. They are described to be the most enriched but how do they compare to the rest of the enriched genes. The full list of regulated genes should be provided. 

      Some markers for IC or granular layer are verified either by RNA scope or immunofluorescence. Is there a technical reason for that? It would be good to compare protein levels for all markers.  Figure 2-Supplement 1: There is no statement about the number of animals that these images are representative for. 

      We have included a volcano plot to show where the genes picked reside. We have also included the full gene lists for interested readers. 

      When validated antibodies were available, we used them. When they were not, we performed RNA-Scope to validate the RNA-Seq dataset. 

      We have included animal numbers in the revised Fig 2-Supplement 2 legend (previously Fig 2Supplement 1).  

      Figure 4b: It would be good to include the E16 spinous cells to get an idea of how much closer ICs are to the granular population. 

      We have included a new Venn diagram showing the overlap between each of the IC and spinous signatures with the granular cell signature in Fig 4B. Overall, 36% of IC signature genes are in common with granular cells, while just 20% of spinous genes overlap.  

      Reviewer #2 (Recommendations for the authors): 

      (1)  Figure 6B is confusing as y-axis is labeled as EdU+ suprabasal cells whereas basal cells are also quantified. 

      We have altered the y-axis title to make it clearer.  

      (2)  Not clear why HA-control is sometimes included and sometimes not. 

      We include the HA when it did not disrupt visualization of the loss of fluorescence. As it was uniform in most cases, we excluded it for clarity in some images. HA staining is now included in Fig 3C.

      (3)  The authors might reconsider the title as it currently is somewhat vague, to more precisely represent the content of the manuscript. 

      We thank the reviewer for the suggestion. We considered other options but felt that this gave an overview of the breadth of the paper.  

      Reviewer #3 (Recommendations for the authors): 

      (1)  ICs are shown to express Tgm1 and Abca12, important for cornified envelope function and formation of lamellar bodies. Do ICs provide any barrier function at E14.5? 

      By traditional dye penetration assays, there is no epidermal barrier at the time that intermediate cells exist. One interpretation of the data is that cells are beginning to express mRNAs (and in some cases, proteins) so that they are able to rapidly generate a barrier as they become granular cells.  

      (2)  Genes associated with contractility are upregulated in ICs and granular cells. And ICs have higher levels of F-actin, MyoIIA, alpha-18, and nuclear Yap. Does this correspond to a measurable difference in stiffness? Can you use AFM to compare to physical properties of ICs, spinous, and granular cells? 

      Several recent papers have used AFM of skin sections to probe tissue stiDness. We have not attempted these studies and are unclear about the spatial resolution and whether in the very thin epidermis at these stages whether we could spatially resolve diDerences. It is also important to note that this tissue rigidity is influenced by factors other than contractility. That said, we previously assessed the macro-contractility of tissues in which myosin activity was induced and demonstrated that there was a significant increase in this over a tissue-wide scale (Ning et al, Cell Stem Cell, 2021).

      (3)  Overexpression of two contractility inducers (spastin and ArhGEF-CA) can induce granular gene expression and repress spinous gene expression, suggesting differentiation lies downstream of contractility. Is contractility required for granular differentiation? 

      This is an important question and one that we hope to directly address in the future. Published studies have shown defects in tight junction formation and barrier function in myosin II mutants. However, a thorough characterization of differentiation was not performed.  

      (4)  ICs are a transient cell type, and it would be important to know what is the consequence of the epidermis never developing this layer. Does it perform an important temporary structural/barrier role, or patterning information for the skin?

      We have attempted experiments to ablate intermediate cells with DTA expression - this resulted in ineDicient and delayed death and thus did not yield strong conclusions. Our findings that transcriptional regulators of granular diDerentiation (such as Grhl3 and Hopx) are also present in intermediate cells, should allow future analysis of the eDects of their ablation on the earliest stages of granular diDerentiation from intermediate cells.

    1. eLife assessment

      This convincing study advances our understanding of the physiological consequences of the strong overexpression of non-toxic proteins in baker's yeast. The findings suggest that a massive protein burden results in nitrogen starvation and a shift in metabolism likely regulated via the TORC1 pathway, as well as defects in ribosome biogenesis in the nucleolus. The study presents findings and tools that are important for the cell biology and protein homeostasis fields.

    2. Reviewer #1 (Public Review):

      Summary:

      The study "Impact of Maximal Overexpression of a Non-toxic Protein on Yeast Cell Physiology" by Fujita et al. aims to elucidate the physiological impacts of overexpressing non-toxic proteins in yeast cells. By identifying model proteins with minimal cytotoxicity, the authors claim to provide insights into cellular stress responses and metabolic shifts induced by protein overexpression.

      Strengths:

      The study introduces a neutrality index to quantify cytotoxicity and investigates the effects of protein burden on yeast cell physiology. The study identifies mox-YG (a non-fluorescent fluorescent protein) and Gpm1-CCmut (an inactive glycolytic enzyme) as proteins with the lowest cytotoxicity, capable of being overexpressed to more than 40% of total cellular protein while maintaining yeast growth. Overexpression of mox-YG leads to a state resembling nitrogen starvation probably due to TORC1 inactivation, increased mitochondrial function, and decreased ribosomal abundance, indicating a metabolic shift towards more energy-efficient respiration and defects in nucleolar formation.

      Weaknesses:

      While the introduction of the neutrality index seems useful to differentiate between cytotoxicity and protein burden, the biological relevance of the effects of overexpression of the model proteins is unclear.

    3. Reviewer #2 (Public Review):

      Summary:

      In this manuscript, Fujita et al. characterized the neutrality indexes of several protein mutants in S. cerevisiae and uncovered that mox-YG and Gpm1-CCmut can be expressed as abundant as 40% of total proteins without causing severe growth defects. The authors then looked at the transcriptome and proteome of cells expressing excess mox-YG to investigate how protein burden affects yeast cells. Based on RNA-seq and mass-spectrometry results, the authors uncover that cells with excess mox-YG exhibit nitrogen starvation, respiration increase, inactivated TORC1 response, and decreased ribosomal abundance. The authors further showed that the decreased ribosomal amount is likely due to nucleoli defects, which can be partially rescued by nuclear exosome mutations.

      Strengths:

      Overall, this is a well-written manuscript that provides many valuable resources for the field, including the neutrality analysis on various fluorescent proteins and glycolytic enzymes, as well as the RNA-seq and proteomics results of cells overexpressing mox-YG. Their model on how mox-YG overexpression impairs the nucleolus and thus leads to ribosomal abundance decline will also raise many interesting questions for the field.

      Weaknesses:

      The authors concluded from their RNA-seq and proteomics results that cells with excess mox-YG expression showed increased respiration and TORC1 inactivation. I think it will be more convincing if the authors can show some characterization of mitochondrial respiration/membrane potential and the TOR responses to further verify their -omic results.

      In addition, the authors only investigated how overexpression of mox-YG affects cells. It would be interesting to see whether overexpressing other non-toxic proteins causes similar effects, or if there are protein-specific effects. It would be good if the authors could at least discuss this point considering the workload of doing another RNA-seq or mass-spectrum analysis might be too heavy.

    4. Reviewer #3 (Public Review):

      Summary:

      Protein overexpression is widely used in experimental systems to study the function of the protein, assess its (beneficial or detrimental) effects in disease models, or challenge cellular systems involved in synthesis, folding, transport, or degradation of proteins in general. Especially at very high expression levels, protein-specific effects and general effects of a high protein load can be hard to distinguish. To overcome this issue, Fujita et al. use the previously established genetic tug-of-war system to identify proteins that can be expressed at extremely high levels in yeast cells with minimal protein-specific cytotoxicity (high 'neutrality'). They focus on two versions of the protein mox-GFP, the fluorescent version and a point mutation that is non-fluorescent (mox-YG) and is the most 'neutral' protein on their screen. They find that massive protein expression (up to 40% of the total proteome) results in a nitrogen starvation phenotype, likely inactivation of the TORC1 pathway, and defects in ribosome biogenesis in the nucleolus.

      Strengths:

      This work uses an elegant approach and succeeds in identifying proteins that can be expressed at surprisingly high levels with little cytotoxicity. Many of the changes they see have been observed before under protein burden conditions, but some are new and interesting. This work solidifies previous hypotheses about the general effects of protein overexpression and provides a set of interesting observations about the toxicity of fluorescent proteins (that is alleviated by mutations that render them non-fluorescent) and metabolic enzymes (that are less toxic when mutated into inactive versions).

      Weaknesses:

      The data are generally convincing, however in order to back up the major claim of this work - that the observed changes are due to general protein burden and not to the specific protein or condition - a broader analysis of different conditions would be highly beneficial.

      Major points:

      (1) The authors identify several proteins with high neutrality scores but only analyze the effects of mox/mox-YG overexpression in depth. Hence, it remains unclear which molecular phenotypes they observe are general effects of protein burden or more specific effects of these specific proteins. To address this point, a proteome (and/or transcriptome) of at least a Gpm1-CCmut expressing strain should be obtained and compared to the mox-YG proteome. Ideally, this analysis should be done simultaneously on all strains to achieve a good comparability of samples, e.g. using TMT multiplexing (for a proteome) or multiplexed sequencing (for a transcriptome). If feasible, the more strains that can be included in this comparison, the more powerful this analysis will be and can be prioritized over depth of sequencing/proteome coverage.

      (2) The genetic tug-of-war system is elegant but comes at the cost of requiring specific media conditions (synthetic minimal media lacking uracil and leucine), which could be a potential confound, given that metabolic rewiring, and especially nitrogen starvation are among the observed phenotypes. I wonder if some of the changes might be specific to these conditions. The authors should corroborate their findings under different conditions. Ideally, this would be done using an orthogonal expression system that does not rely on auxotrophy (e.g. using antibiotic resistance instead) and can be used in rich, complex mediums like YPD. Minimally, using different conditions (media with excess or more limited nitrogen source, amino acids, different carbon source, etc.) would be useful to test the robustness of the findings towards changes in media composition.

      (3) The authors suggest that the TORC1 pathway is involved in regulating some of the changes they observed. This is likely true, but it would be great if the hypothesis could be directly tested using an established TORC1 assay.

      (4) The finding that the nucleolus appears to be virtually missing in mox-YG-expressing cells (Figure 6B) is surprising and interesting. The authors suggest possible mechanisms to explain this and partially rescue the phenotype by a reduction-of-function mutation in an exosome subunit. I wonder if this is specific to the mox-YG protein or a general protein burden effect, which the experiments suggested in point 1 should address. Additionally, could a mox-YG variant with a nuclear export signal be expressed that stays exclusively in the cytosol to rule out that mox-YG itself interferes with phase separation in the nucleus?

      Minor points:

      (5) It would be great if the authors could directly compare the changes they observed at the transcriptome and proteome levels. This can help distinguish between changes that are transcriptionally regulated versus more downstream processes (like protein degradation, as proposed for ribosome components).